Researchers at Google have created a new framework for AI research agents that performs better on important benchmarks than top systems from competitors OpenAI, Perplexity, and others.
Inspired by the human writing process, the new agent, named Test-Time Diffusion Deep Researcher (TTD-DR), drafts, gathers information, and makes iterative adjustments.
The system generates more thorough and precise research on complicated subjects by utilizing diffusion mechanisms and evolutionary algorithms.
For high-value activities that conventional retrieval augmented generation (RAG) systems find difficult to do, including producing a competition study or a market entry report, this architecture may enable a new generation of custom research assistants for businesses.
AI Scaling Reaches Its Boundaries
Enterprise AI is changing as a result of power constraints, growing token costs, and inference delays. Come see how elite teams are by visiting our exclusive salon:
- Using energy to gain a tactical edge
- Developing effective inference for actual throughput increases
- Using sustainable AI systems to unlock competitive returns on investment
The authors of the report state that the system’s main goal was to address these actual commercial use cases.
The Limitations Of The Deep Research Agents Available Today
Agents that perform deep research (DR) are made to handle intricate questions that are more complicated than a straightforward search. They plan using large language models (LLMs), collect data using methods like online search, and then employ test-time scaling techniques like chain-of-thought (CoT), best-of-N sampling, and Monte-Carlo Tree Search to compile the results into a comprehensive report.
Many of these systems, meanwhile, have basic design flaws. The majority of DR agents that are made publicly available use test-time tools and algorithms that lack a structure that closely resembles human cognitive functioning. The planning, searching, and content-generation processes of open-source agents are frequently rigidly linear or parallel, which makes it challenging for the various stages of the research to communicate with and correct one another.
According to the authors of the report, “This highlights the need for a more cohesive, purpose-built framework for DR agents that mimics or surpasses human research capabilities and indicates a fundamental limitation in current DR agent work.”
Human researchers operate iteratively, in contrast to the linear approach of the majority of AI entities. Usually, they begin with a high-level plan, produce an initial draft, and then go through several rounds of revisions. They look for fresh data to bolster their claims and close any gaps during these updates.
In this instance, a trained diffusion model first produces a noisy draft, which the denoising module then refines into higher-quality (or higher-resolution) outputs with the help of retrieval tools, the researchers say.
This is the foundation of TTD-DR. According to the concept, writing a research report is a diffusion process in which a first, “noisy” draft is gradually improved upon to produce a polished final product.
Two main mechanisms are used to do this. The first, dubbed “Denoising with Retrieval” by the researchers, begins with an initial draft and refines it iteratively. The agent creates new search queries using the current draft in each phase, gathers outside data, and incorporates it to “denoise” the report by adding detail and fixing errors.
The planner, question generator, and answer synthesizer are the three parts of the agent that autonomously maximize their own performance thanks to the second process, “Self-Evolution.” This component-level evolution is important because it makes the “report denoising more effective,” according to Rujun Han, a Google research scientist and co-author of the work, who made this explanation in comments to VentureBeat. This is similar to an evolutionary process in which each component of the system improves over time at its particular function, giving the primary revision process context of increasing quality.
According to the authors, “the complex interaction and complementary pairing of these two algorithms are essential for attaining superior research results.” Reports produced through this iterative process are not only more accurate but also more logically sound. Han points out that the model’s performance improvements are a clear indicator of its capacity to generate coherent and well-structured business papers because it was assessed on helpfulness, which encompasses fluency and coherence.
In line with deep research products from OpenAI, Perplexity, and Grok, the resulting research companion is “capable of generating helpful and comprehensive reports for complex research questions across diverse industry domains, including finance, biomedical, recreation, and technology,” according to the paper.
TTD-DR In Operation
The researchers utilized Gemini 2.5 Pro as the primary LLM (although you may use other models) in Google’s Agent Development Kit (ADK), a flexible platform for coordinating intricate AI activities, to construct and test their architecture.
Leading open-source and commercial systems, such as Grok DeepSearch, Perplexity Deep Research, OpenAI Deep Research, and the open-source GPT-Researcher, were used to benchmark TTD-DR.
Two primary topics were the focus of the evaluation. They combined their own LongForm Research dataset with the DeepConsult benchmark, a series of business and consulting-related prompts, to generate long-form comprehensive reports. They put the agent to the test using demanding academic and real-world standards like Humanity’s Last Exam (HLE) and GAIA in order to answer multi-hop questions that call for a great deal of search and reasoning.
According to the findings, TTD-DR continuously outperformed its rivals. On two distinct datasets, TTD-DR outperformed OpenAI Deep Research in long-form report production, with win rates of 69.1% and 74.5%, respectively. Additionally, it outperformed OpenAI’s system with performance increases of 4.8%, 7.7%, and 1.7% on three different benchmarks that needed multi-hop reasoning to produce succinct solutions.
Test-time Diffusion’s Future
The framework is made to be very flexible, even though the current study is focused on text-based reports that employ web search. According to Han, the group intends to continue working on adding more solutions for intricate business operations.
Complex software code, a thorough financial model, or a multi-phase marketing campaign might all be created using a similar “test-time diffusion” method, in which an initial “draft” of the project is iteratively improved with fresh data and input from a variety of specialized tools.
According to Han, “all of these tools can be naturally incorporated in our framework,” implying that this draft-centric methodology may serve as the basis for a variety of intricate, multi-step AI agents.

