> DeepSearch runs through an iterative loop of searching, reading, and reasoning until it finds the optimal answer. [...]
> DeepResearch builds upon DeepSearch by adding a structured framework for generating long research reports
Given these definitions, I think DeepSearch is the more valuable and interesting pattern. It's effectively RAG built using tools in a loop, which is much more likely to answer questions effectively than more traditional RAG where there is only one attempt to find relevant documents to include in a single prompt to an LLM.
DeepResearch is a cosmetic enhancement that wraps the results in a "report" - it looks impressive but IMO is much more likely to lead to inaccurate or misleading results.
More notes here: https://simonwillison.net/2025/Mar/4/deepsearch-deepresearch...
I think that if done well deep research can be more than that. At a minimum, I would say that before "deep search" you'd need some calls to an LLM to figure out what to look for, what places would be best to look for (i.e. sources, trust, etc), how to tabulate the data gathered and so on. Just as deep search is "rag w/ tools in a loop", so can (should) be deep research.
Think of the analogy of using aider straight up going to code or using it to first /architect and then code. But for any task that lends itself to (re)searching. At least it would catch useless tangents faster.
At the end of the day, what's fascinating about LLM based agents is that you can almost always add another layer of abstraction on top. No matter what you build, you can always come from another angle. That's really cool imo, and something Hassabis has hinted lately in some podcasts.
I think that one part of the deep loop needs to be a check-in on expectations and goals…
So instead of throwing a deep task: I find that bots work better in small iterative chucks of objectives..
I haven’t formulated it completely yet but as an example ive been working extensively with cursors whole anthropic abstraction ai as a service:
So many folks suffer from “generating” quagmire;
And I found that telling the bot to “break any response into smaller chunks to avoid context limitations” works incredibly well…
So when my scaffold is complete the goal is to use Fabric Patterns for nursery assignments to the deep bots.. whereby they constantly check in.
Prior to “deep” things I found this to work really well by telling the bots about obsessively development_diray.md and .json tracking of actions (even still their memory is super small, and I envisioned a multi layer of agents where the initial agents actions feed the context of agents who follow along and you have a waterfall of context between agents so as to avoid context loss on super deep iterative research…
(I’ll type out something more salient when I have a KVM…
(But I hope that doesn’t sound stupid)
You tell the agent “grok this persona to accomplish the task
Thank you.
-
Second, thank you for bringin up Jina, I recently discovered it and immediate began building a Thing based on it:
I want to use its functions to ferret-out all the entanglements from the roster from the WEF Leadership roster, similar to the NGO fractal connections - I’m doing that with every WEF member, through to Congress.
I would truly wish to work with you on such, I so inclined..
I prefer to build “dossiers” rather than reports, and represented in json schemas
I’m on mobile so will provide more details when at machine…
Looping through a dossier of connections is much more thoughtful than a “report” imo.
I need to see you on someone’s podcast, else you and I should make one!
What I want is a “podcast” with audience participation..
The lex fridman DeepSeek episode was so awesome but I have so many questions and I get exceedingly frustrated when lex doesn’t ask what may seem obv to us HNers…
-
Back topic:
Reports are flat; dossiers re malleable.
As I mentioned my goal is fractal visuals (in minamGL) of the true entanglements from the WEF out.
Much like mike Benz on usAid - using jina deep research, extraction etc will pull back the veil on the truth of the globalist agenda seeking control and will reveal true relationships, loyalties and connections.
It been running through my head for decades and I finally feel that jina is a tool that can start to reveal what myself and so many others can plainly see but can’t verify.
anyway - I just watched it.
Fantastic.
---
What are your thoughts on API-verload -- We will need bots and bots to keep track of the APIS...
Meaning... Are we to establish an Internet Protocol ~~ AI Protocol addressing schema - whereby every API henceforth has an IP v6 addy?
and instead of "domain names" for sale -- you are purchasing and registering an IPv6-->API.MY.SERVICE
Yup, I got the same impression reading this article - and the Jina one, too. Like with langchain and agents, people are making chained function calls or a loop sound like it is the second coming, or a Nobel prize-worthy discovery. It's not - it's obvious. It's just expensive to get to work reliably and productize.
Folks associated with that project were some of the first people to talk about patterns like this publicly online. With GPT-3-Davinci it was not at all obvious that chaining these calls would work well, to most people, and I think the community and team around langchain and associated projects did a pretty good job of helping to popularize some of the patterns such as they are now obvious.
That said, I agree with your overall impression.
DeepResearch (note the absence of the space character) is the name that Han Xiao proposes for the general pattern of generating a research-style report after running multiple searches.
You might implement that pattern using prompt engineering or using some custom trained model or through other means. If the eventual output looks like a report and it ran multiple searches along the way it fits Han's "DeepResearch" definition.
I guess they will take any opportunity to follow the leader here if they worry that they're at risk of similar branding issues here too.
No, that's not what Xiao said here. Here's the relevant quote
> It often begins by creating a table of contents, then systematically applies DeepSearch to each required section – from introduction through related work and methodology, all the way to the conclusion. Each section is generated by feeding specific research questions into the DeepSearch. The final phase involves consolidating all sections into a single prompt to improve the overall narrative coherence.
(I also recommend that you stare very hard at the diagrams.)
Let me paraphrase what Xiao is saying here:
A DeepSearch is a primitive — it does mostly the same thing a regular LLM query does, but with a lot of trained-in thinking and searching work, to ensure that it is producing a rigorous answer to your question. Which is great: it means that DeepSearch is more likely to say "I don't know" than to hallucinate an answer. (This is extremely important as a building block; an agent needs to know when a query has failed so it can try again / try something else.)
However, DeepSearch alone still "hallucinates" in one particular way: it "hallucinates understanding" of the topic, thinking that it already has a complete mental toolkit of concepts needed to solve your problem. It will never say "solving this sub-problem seems to require inventing a new tool" and so "branch off" to another recursed DeepSearch to determine how to do that. Instead, it'll try to solve your problem with the toolkit it has — and if that toolkit is insufficient, it will simply fail.
Which, again, is great in some ways. It means that a single DeepSearch will do a (semi-)bounded amount of work. Which means that the costs of each marginal additional DeepSearch call are predictable.
But it also means that you can't ask DeepSearch itself to:
• come up with a mathematical proof of something, where any useful proof strategy will implicitly require inventing new math concepts to use as tools in solving the problem.
• do investigative journalism that involves "chasing leads" down a digraph of paths; evaluating what those leads have to say; and using that info to determine new leads.
• "code me a Facebook clone" — and have it understand that doing so involves iteratively/recursively building out a software architecture composed of many modules — where it won't be able to see the need for many of those modules at "design time", but will only "discover" the need to write them once it gets to implementation time of dependent modules and realizes that to achieve some goal, it must call into some code / entire library that doesn't exist yet. (And then make a buy-vs-build decision on writing that code vs pulling in a dependency... which requires researching the space of available packages in the ecosystem, and how well they solve the problem... and so on.)
A DeepResearch model, meanwhile, is a model that looks at a question, and says "is this a leaf question that can be answered directly — or is this a question that needs to be broken down and tackled by parts, perhaps with some of the parts themselves being unknowns until earlier parts are solved?"
A DeepResearch model does a lot of top-level work — probably using DeepSearch! — to test the "leaf-ness" of your question; and to break down non-leaf questions into a "battle plan" for solving the problem. It then attempts solutions to these component problems — not by calling DeepSearch, but by recursively calling itself (where that forked child will call DeepSearch if the sub-problem is leaf-y, or break down the sub-problem further if not.)
A DeepResearch model will then takes the derived solutions for dependent problems into account in the solution space for depending problems. (A DeepResearch model may also be trained to notice when it's "worked into a corner" by coming up with early-phase solutions that make later phases impossible; and backtracking to solve the earlier phases differently, now with in-context knowledge of the constraints of the later phases.)
Once a DeepResearch model finds a successful solution to all subproblems, it takes the hierarchy of thinking/searching logs it generated in the process, and strips out all the dead-ends and backtracking, to present a comprehensible linear "success path." (Probably it does this as the end-step of each recursive self-call, before returning to self, to minimize the amount of data returned.)
Note how this last reporting step isn't "generating a report" for human consumption; it's a DeepResearch call "generating a report" for its parent DeepResearch call to consume. That's special sauce. (And if you think about it, the top-level call to this whole thing is probably going to use a non-DeepResearch model at the end to rephrase the top-level DeepResearch result from a machine-readable recurse-result report into a human-readable report. It might even use a DeepSearch model to do that!)
---
Bonus tangent:
Despite DeepSearch + DeepResearch using a scientific-research metaphor, I think an enlightening comparison is with intelligence agencies.
DeepSearch alone does what an individual intelligence analyst does. You hand them an individually-actionable question; they run through a "branching, but vaguely bounded in time" process of thinking and searching, generating a thinking log in the process, eventually arriving at a conclusion; they hand you back an answer to your question, with a lot of citations — or they "throw an exception" and tell you that the facts available to the agency cannot support a conclusion at this time.
Meanwhile, DeepResearch does what an intelligence agency as a whole does:
1. You send the agency a high-level strategic Request For Information;
2. the agency puts together a workgroup composed of people with trained-in expertise with breaking down problems (Intelligence Managers), and domain-matter experts with a wide-ranging gestalt picture of the problem space (Senior Intelligence Analysts), and tasks them with breaking down the problem into sub-problems;
3. some of these sub-problems are actionable — they can be assigned directly for research by a ground-level analyst; some of these sub-problems have prerequisite work that must be done to gather intelligence in the field; and some of these sub-problems are unknown unknowns — missing parts of the map that cannot be "planned into" until other sub-problems are resolved.
4. from there, the problem gets "scheduled" — in parallel, (the first batch of) individually-actionable questions get sent to analysts, and any field missions to gather pre-requisite intelligence are kicked off for planning (involving spawning new sub-workgroups!)
5. the top-level workgroup persists after their first meeting, asynchronously observing the reports from actionable questions; scheduling newly-actionable questions to analysts once field data comes in to be chewed on; and exploring newly-legible parts of the map to outline further sub-problems.
6. If this scheduling process runs out of work to schedule, it's either because the top-level question is now answerable, or because the process has worked itself into a corner. In the former case, a final summary reporting step is kicked off, usually assigned to a senior analyst. In the latter case, the workgroup reconvene to figure out how to backtrack out of the corner and pursue alternate avenues. (Note that, if they have the time, they'll probably make "if this strategy produces results that are unworkable in a later step" plans for every possible step in their original plan, in advance, so that the "scheduling engine" of analyst assignments and fieldwork need never run dry waiting for the workgroup to come up with a new plan.)
> DeepResearch builds upon DeepSearch by adding a structured framework for generating long research reports.
But then called it "a cosmetic enhancement" really to be slightly dismissive of it - I'm a skeptic of the report format because I think the way it's presented makes the information look more solid than it actually is. My complaint is at the aesthetic level, not relating to the (impressive) way the report synthesis is engineered.
So yeah, I'm being inaccurate and a bit catty about it.
Your explanation is much closer to what Han described, and much more useful than mine.
I thought this was funny at the time, but I think as more time passes it does highlight the stark gulf that exists between the capability of the most advanced AI systems and what we expect as "normal competency" from the most average person.
Yes, but now we're at the point where we can compare AI to a person, whereas five years ago the gap was so big that that was just unthinkable.
Tale as old as time. We can make nice software systems but general purpose AI / Agents isn't here yet.
In short: the natural order of things is that computers are better at thinking, and people are better at manual labor. Which is the opposite of what we wanted.
I choose a direction and apply force.
One of the implications of OpenAI's DR is that frontier labs are more likely to train specific models for a bunch of tasks, resulting in the kind of quality wrappers will find hard to replicate. This is leading towards model + post training RL as a product, instead of keeping them separate from the final wrapper as product. Might be interesting times if the trajectory continues.
PS: There is also genspark MOA[2] which creates an indepth report on a given prompt using mixtures of agents. From what i have seen in 5-6 generations, this is very effective.
[1]: https://x.com/_philschmid/status/1896569401979081073 (i might be misunderstanding this, but this seems a native call instead of explicit)
Prediction for Next Hot Thing in Q4 2025 / Q1 2026: someone will make the Nobel prize-worthy discovery that you can stuff results of your deep search into a database (vector or otherwise) and then use it to improve the ability to compile a higher-quality report from much larger amount of sources.
We'll call it DeepRAG or Retrieval Augmented Deep Research or something.
Prediction for Q2 2026: next Nobel prize awarded for realizing you may as well stop treating report generation as the core aspect of "deep research" (as it obviously makes no sense, but hey, time traveler's spoilers, sorry!), and stop at the "stuff search results into a database" and let users "chat with the search results", making the research an interactive process.
We'll call this huge scientific breakthrough "DeepRAG With Human Feedback", or "DRHF".
https://leehanchung.github.io/assets/img/2025-02-26/05-quadr...
Real innovation is going in with task-specific models like AlphaFold.
LLMs are starting to become more task specific too, as we’ve seen with the performance of reasoning model on their specific tasks.
I imagine we’ll see LLMs trained specifically for medical purposes, legal purposes, code purposes, and maybe even editorial purposes.
All useful in their own way, but none of them even close to sci-fi.
Prior to gpt-3 AI was rarely used in marketing or to talk about any number of ML methods.
Nowadays “AI” is just the new “smart” for marketing products.
Terms change. The current usage of AGI, especially in the context I was talking about, is specifically marketing from LLM providers.
I’d argue that the term AGI, when used in a non fiction context, has always been a meaningless marketing term of some kind.
> Prior to gpt-3 AI was rarely used in marketing or to talk about any number of ML methods.
In the decade prior to GPT-3, AI was frequently used in marketing to talk about any ML methods, up to and including linear regression. This obviously ramped up heavily after "Deep Learning" got coined as a term.
AI now actually means something in marketing, but the only reason for that is that calling out to an LLM is even simpler than adding linear regression to your product somewhere.
As for AGI, that was a hot topic in some circles (that are now dismissed as "AI doomers") for decades. In fact, OpenAI started with people associating or at least within the sphere of influence of LessWrong community, which both influenced the naming and perspective the "LLM industry" started with, and briefly put the outputs of LessWrong into spotlight - which is why now everyone uses terms like "AGI" and "alignment" and "AI safety".
However, unlike "alignment", which got completely butchered as a term, AGI still roughly means what it meant before - which is basically a birth of a man-made god. That's true of AGI as "meaningless marketing term" too, if people so positive on it paused to follow through the implications beyond "oh it's like ChatGPT, but actually good at everything".
Well now it is not: it is now "the difference between something with outputs that sounds plausible vs something with outputs which are properly checked".
You’re confusing several different ideas here.
The idea you’re talking about is called “the bitter lesson.” It (very basically) says that a model with more compute put behind it will perform better than a cleverer method which may use less compute. Has nothing to do with being “generic.” It’s also worth noting that, afaik, it’s an accurate observation, but not a law or a fact. It may not hold forever.
Either way, I’m not arguing against that. I’m saying that LLMs are too general to be useful in specific, specialized, domains.
Sure bigger generic models perform (increasingly marginally) better at the benchmarks we’ve cooked up, but they’re still too general to be that useful in any specific context. That’s the entire reason RAG exists in the first place.
I’m saying that a language model trained on a specific domain will perform better at tasks in that domain than a similar sized model (in terms of compute) trained on a lot of different, unrelated text.
For instance, a model trained specifically on code will produce better code than a similarly sized model trained on all available text.
I really hope that example makes what I’m saying self-evident.
If you'd compare it with ie. model for self driving cars – generic text models will not win because they operate in different domain.
In all cases trying to optimize on subset/specialized tasks within domain is not worth the investment because state of art will be held by larger models working on the whole available set.
> You're explaining it nicely and then seem to make mistake that contradicts what you've just said – because code and text share domain (text based)
"Text" is not the domain that matters.
The whole trick behind LLMs being as capable as they are, is that they're able to tease out concepts from all that training text - concepts of any kind, from things to ideas to patterns of thinking. The latent space of those models has enough dimensions to encode just about any semantic relationship as some distinct direction, and practice shows this is exactly what happens. That's what makes style transfer pretty much a vector operation (instead of "-King +Woman", think "-Academic, +Funny"), why LLMs are so good at translating between languages, from spec to code, and why adding modalities worked so well.
With LLMs, the common domain between "text" and "code" is not "text", but the way humans think, and the way they understand reality. It's not the raw sequences of tokens that map between, say, poetry or academic texts and code - it's the patterns of thought behind those sequences of tokens.
Code is a specific domain - beyond being the lifeblood of programs, it's also an exercise in a specific way of thinking, taken up to 11. That's why learning code turned out to be crucial for improving general reasoning abilities of LLMs (the same is, IMO, true for humans, but it's harder to demonstrate a solid proof). And conversely, text in general provides context for code that would be hard to infer from code alone.
All digital data is just 1s and 0s.
Do you think a model trained on raw bytes would perform coding tasks better than a model trained on code?
I have a strong hunch that there’s some Goldilocks zone of specificity for statistical model performance and I don’t think “all text” is in that zone.
Here is the article for “the bitter lesson.” [0]
It talks about general machine learning strategies which use more compute are better at learning a given data set then a strategy tailor made for that set.
This does not imply that training on a more general dataset will yield more performance than using a more specific dataset.
The lesson is about machine learning methods, not about end-model performance at a specific task.
Imagine a logistic regression model vs an expert system for determining real estate prices.
The lesson tells us that, given the more and more compute, the logistic regression model will perform better than the expert system.
The lesson does not imply that when given 2 logistic regression models, one trained on global real estate data and one trained on local, that the former would outperform.
I realize this is a fine distinction and that I may not be explaining it as well as I could if I were speaking, but it’s an important distinction nonetheless.
[0] http://www.incompleteideas.net/IncIdeas/BitterLesson.html
TIL: The term AGI, which we've been using since at least 1997[0] was invented by time-traveling LLM companies in the 2020s.
[0]: https://ai.stackexchange.com/questions/20231/who-first-coine...
I'm happy to see some acknowledgement of the world before LLMs. This is an old problem, and one I (or my team, really) was working on at the time of DALL-E & ChatGPT's explosion. As the article indicated, we deemed 3.5 unacceptable for Q&A almost immediately, as the failure rate was too high for operational reporting in such a demanding industry (legal). We instead employed SQuAD and polished up the output with an LLM.
These new reasoning models that effectively retrofit Q&A capabilities (an extractive task) onto a generative model are impressive, but I can't help but think that it's putting the cart before the horse and will inevitably give diminishing returns in performance. Time will tell, I suppose.
Funnily enough, Amazon will pick for you products to compare, but the compared items usually are terrible, and you can't just add whatever you want, or choose columns.
With Grok, I'll have it remove columns, add columns, shorten responses, so on and so forth.
OpenAI tends to have something like 20 high quality sources selected and goes very deep in the specific topic, producing something like 20-50 pages of research in all areas and adjacent areas. It takes a lot to read but is quite good.
Perplexity tends to hit something like 60 or more sources, goes fairly shallow, answers some questions in general ways but is excellent at giving you the surface area of the problem space and thoughts about where to go deeper if needed.
OpenAI takes a lot longer to complete, perhaps 20x longer. This factors heavily into whether you want a surface-y answer now or a deep answer later.
I think it really comes down to your own workflow. You sometimes want to be more imperative (select the sources yourself to generate a report) and sometimes more declarative (let a DFS/BFS algo go and split a query into subqueries and go down rabbit holes until some depth and then aggregate).
Been trying different ways of optimizing the former but I am fascinated by the more end to end flows systems like STORM do.
But there are so many ways to configure GPT Researcher for all kinds of budgets so I wonder if this comparison really pushed the output or just went with defaults and got default midranges for comparison.
LLM's are not accountable for anything.
The best case scenario as I see it, is that we become more critical of research in general, and start training AI to help us identify these issues. And then we could perhaps utilize GAN to improve report generation.
You may repeat instructions multiple times, but it ignores them or fails to understand a solution