I as a human know how to find this information. The game day rosters for many NFL teams are available on many sites. It would be tedious but possible for me to find this number. It might take an hour of my time.
But despite this being a relatively easy research task all of the deep research tools I tried (OpenAI, Google, and Perplexity) completely failed and just gave me a general estimate.
Based on this article I tried that search just using o3 without deep research and it still failed miserably.
So even though today it might be a good check it might not remain as such a good benchmark.
I think we need a way to keep updating prompts without increasing complexity in someway to properly verify model improvements. ARC Deep Research anyone?
There are some NBA fan sites that do keep track of some of these tournament level final metrics.
I am not sure if the model will need the exact answer or just the backlinks to site where they can find them is enough. Maybe just documenting how to do it could do the job as well...
Google AI Studio gave me an exact answer of 2227 as a possible answer and linked to these comments because there is a comment further down which claims that is the exact answer. The comment was 2 hours old when I did the prompt.
It also provided a code example of how to find it using the python nfl data library mentioned in one of the comments here.
"For the 2023‑24 NBA regular season (which ran from October 24, 2023 to April 14, 2024), a total of 561 distinct players logged at least one game appearance, as indexed by their “Rk” on the Basketball‑Reference “Player Stats: Totals” page (the final rank shown is 561)"
Doing a quick search on my own, this number seems like it could be correct.
Technically I can probably do it in about 10 minutes because I've worked with these kind of stats before and know about packages that will get you this basically instantly (https://pypi.org/project/nfl-data-py/).
It's exactly 4 lines of code to find the correct answer, which is 2,227.
Assuming I didn't know about that package though I'd open a site like pro football reference up, middle click on each game to open the page in a new tab, click through the tabs, copy paste the rosters into sublime text, do some regex to get the names one per line, drop the new one per line list into sortmylist or a similar utility, dedupe it, and then paste it back into sublime text to get the line count.
That would probably take me about an hour.
If “It might take an hour of my time.” to get the correct answer then there’s a lower bond for trying a shortcut that might not work.
The deep research capabilities are much better suited to more qualitative research / aggregation.
Unfortunately sentiment analysis like "Tell me how you feel about how many players the NFL has" is just way less useful than: "Tell me how many players the NFL has."
Because it failed miserably at a very simple task of looking through some scattered charts, the human asking should blame themselves for this basic failure and trust it to do better with much harder and more specialized tasks?
His point is that the two tasks are very different at their core, and deep research is better at teasing out an accurate "fuzzy" answer from a swamp of interrelated data, and a data scientist is better at getting an accurate answer for a precise, sharply-defined question from a sea of comma-separated numbers.
A human readily understands that "hold the onions, hots on the side" means to not serve any onions and to place any spicy components of the sandwich in a separate container rather than on the sandwich itself. A machine needs to do a lot of educated guessing to decide whether it's being asked to keep the onions in its "hand" for a moment or keep them off the sandwich entirely, and whether black pepper used in the barbeque sauce needs to be separated and placed in a pile along with the haberno peppers.
I understand that there are fuzzy tasks that AIs/algorithms are terrible at, which seem really simple for a human mind, and this hasn't gone away with the latest generations of LLMs. That's fine and I wouldn't criticize an AI for failing at something like the instructions you describe, for example.
However in this case, the human was asking for very specific, cut and dry information from easily available NFL rosters. Again, if an AI fails at that, especially because you didn't phrase the question "just so", then sorry, but no, it's not much more trustworthy for deep research and data scientist inquiries.
What in any case makes you think the data scientists will use superior phrasing to tease better results under more complexity from an LLM?
> If your reaction to this is “surely typing out the code is faster than typing out an English instruction of it”, all I can tell you is that it really isn’t for me any more. Code needs to be correct. English has enormous room for shortcuts, and vagaries, and typos, and saying things like “use that popular HTTP library” if you can’t remember the name off the top of your head.
Using LLMs as part of my coding work speeds me up by a significant amount.
Precise aggregation is what so many juniors do in so many fields of work it's not even funny...
Not to say that we validate whether to trust an opinion of a human expert by them being able to deliver measurably correct judgements, the same thing LLM seem to be not good at.
Gemini 2.5 Pro and o3/o4-mini seem to have crossed a threshold for a bunch of things (at least for me) in the last few weeks.
Tasteful, effective use of the search tool for o3/o4-mini is one of those. Being able to "reason" effectively over long context inputs (particularly useful for understanding and debugging larger volumes of code) is another.
One could use the above workflow in the same way and argues that natural language search is more intuitive than keyword based search. But I don't think that brings any meaningful productivity improvement.
> Being able to "reason" effectively over long context inputs (particularly useful for understanding and debugging larger volumes of code) is another.
Any time I saw this "wish" pop up, my suggestion is to try a disassembler to reverse engineer some binary to really understand the problem of coming up with a theory of a program (based on Naur's definition). Individual statements are always clear (programming language are formal and have no ambiguity). The issue is grouping them, unambiguously define the semantic of these groups, and find the links between them, recursively.
Once that's done, what you'll have is a domain. And you could have skipped the whole thing by just learning the domain from a domain expert. So the only reason to do this is because the code doesn't really implement the domain (bugs) or it's hidden purposefully. So the most productive workflow there is to learn the domain first to find discrepancy (first case) or focus yourself on the missing part (second case). In the first case, the easiest approach is writing tests, and the more complete one is to do a formal verification of the software.
sure, at least I have someone to blame in that case. but in my experience, the AI is at least as reliable as a person who I don't personally know
Less time and money spent to train those nurses, which you can then spend on training specialists. And your expert system will take less time to update than training thousands of doctor every time some new protocol or drug is released.
The problem is when the ai makes a catastrophic prediction, and the layman can't see it.
I don't see how it is really different with AI
How do you know this?
It really is a game changer when the search engine
I find that an AI performing multiple searches on variations of keywords, and aggregating the top results across keywords is more extensive than most people, myself included, would do.
I had luck once asking what its search queries were. It usually provides the references.
Edit: from https://help.kagi.com/kagi/ai/assistant.html it looks like the answer is "all of them":
> Access to the latest and most performant large language models from OpenAI, Anthropic, Meta, Google, Mistral, Amazon, Alibaba and DeepSeek
1. Technically it might be possible to search the Internet, but it might not surface correct and/or useful information.
2. High-value information that would make a research report valuable is rarely public nor free. This holds especially true in capital-intensive or regulated industries.
ChatGPT plus an extra $30/month for search access to a specific archive would make sense to me.
Brilliant (and one less think I don't have to build)!
This is RAG but for API docs.
Sure, it's not a journal - but in some fields (Machine Learning, Math) it seems like everyone also uploads their stuff there. So if the models can crawls sites like arXiv, at least there's some decent stuff to be found.
It would be great if for a DeepSearch tool for ML, I could just use Arxiv as a source and have the Agent search this. But so far I have not found a working Arxiv tool that does this well.
I believe that most positions are resolved if
1) you accept that these are fundamentally narrative tools. They build stories, In whatever style you wish. Stories of code, stories of project reports. Stories or conversations.
2) this is balanced by the idea that the core of everything in our shared information economy is Verification.
The reason experts get use out of these tools, is because they can verify when the output is close enough to be indistinguishable from expert effort.
Domain experts also do another level of verification (hopefully) which is to check if the generated content computes correctly as a result - based on their mental model of their domain.
I would predict that that LLMs are deadly in the hands of people who can’t gauge the output, and will end up driving themselves off of a cliff, while experts will be able to use it effectively on tasks where verification of the output has a comparative effort advantage, over the task of creating the output.
The first result was WB, which I gave to it as the first example and am already using. Results 2 and 3 were the mainstream services which it helpfully marked in the table as not having the features I need. Result 4 looked promising but was discontinued 3 years ago. Result 5 was an actual option which I'm trying out (but may not work for other reasons).
So, 1/5 usable results. That was mildly helpful I guess, but it appeared a lot more helpful on the surface than it was. And I don't seem to have the ability to say "nice try but dig deeper".
My question is, how to reproduce this level of functionality locally, in a "home lab" type setting. I fully expect the various AI companies to follow the exact same business model as any other VC-funded tech outfit: free service (you're the product) -> paid service (you're still the product) -> paid service with advertising baked in (now you're unabashedly the product).
I fear that with LLM-based offerings, the advertising will be increasingly inseparable, and eventually undetectable, from the actual useful information we seek. I'd like to get a "clean" capsule of the world's compendium of knowledge with this amazing ability to self-reason, before it's truly corrupted.
First one, geolocation a photo I saw in a museum. It didn’t find a definitive answer but it sure turned up a lot of fascinating info in its research.
Second one, I asked it to suggest a new line of enquiry in the Madeleine McCann missing person case. It made the interesting suggestion that the 30 minute phone call the suspect made on the evening of the disappearance, from a place near the location of the abduction, was actually a sort of “lookout call” to an accomplice nearby.
Quite impressed. This is a great investigative tool.
> “Google is still showing slop for Encanto 2!” (Link is provided)
I believe quite strongly that Google is making a serious misstep in this area, the “supposed answer text pinned at the top above the actual search results.”
For years they showed something in this area which was directly quoted from what I assume was a shortlist of non-BS sites so users were conditioned for years that if they just wanted a simple answer like when a certain movie came out or if a certain show had been canceled or something, you may as well trust it.
Now it seems like they have given over that previous real estate to a far less reliable feature, which simply feeds any old garbage it finds anywhere into a credulous LLM and takes whatever pops out. 90% of people that I witness using Google today simply read that text and never click any results.
As a result, Google is now pretty much always even less accurate at the job of answering questions than if you posed that same question to ChatGPT, because GPT seems to be drawing from its overall weights which tend toward basic reality, whereas Google’s “Answer” seems to be summarizing a random 1-5 articles from the Spam Web, with zero discrimination between fact, satire, fiction, and propaganda. How can they keep doing this and not expect it to go badly?
I secondarily wonder how an LLM solves the trust problem in Web search. What's traditionally solved (and now gamed) through PageRank. It doesn't seem ChatGPT is easily fooled by Spam as direct search.
How much is Bing (or whatever the search engine is) getting better? vs how much are LLMs better at knowing what a good result is for a query?
Or perhaps it has to do with the richer questions that get asked to chat vs search?
Kagi has this already, it's great. Choose a result, click the three-dot menu, choose "Ask questions about this page." I love to do this with hosted man pages to discover ways to combine the available flags (and to discover what is there)
I find most code LLMs write to be subpar but Kagi can definitely write a better ffmpeg line than I can when I use this approach
You don't hear a lot of buzz around them, but thats kind of what Perplexity lets you do. (Possibly phind too but it's been a while since I used them).
the whole contraption uses ~10 different models, but more can easily be plugged into the initial generation phase. happy to demo it sometime! [edit: email on profile].
Conveniently Gemini is the best frontier model for everything else, they’re very interested and well positioned (if not best?) to also be the best in deep research. Let’s check back in 3-6 months.
1) Their AI models aren't half bad. Gemini 2.5 seems to be doing quite well relative to some competitors.
2) They know how to scale this stuff. They have their own hardware, lots of data, etc.
Scaling is of course the hard part. Doing things at Google scale means doing it well while still making a profit. Most AI companies are just converting VC cash into GPUs and energy. VC subsidized AI is nice at a small scale but cripplingly expensive at a larger scale. Google can't do this; they are too large for that. But they are vertically integrated, build their own data centers, with their own TPUs, etc. So, once this starts happening at their scale, they might just have an advantage.
A lot of what we are seeing is them learning to walk before they start running faster. Most of the world has no clue what perlexity is or any notion of the pros and cons of claude 3.7 sonnet vs. o4 mini high. None of that stuff matters long term. What matters is who can do this stuff well enough for billions of people.
So, I wouldn't count them out. But none of this stuff guarantees success either, of course.
I just had a research report last night that looked at 400 sources when I asked it to help identify a first edition Origin of Species (it did a great job too, correctly explaining how to identify a true first edition from chimeral ones).
Individual model vendors cannot do such a product as they are biased towards their own model, they would not allow you to choose models from competitors.
> The user-facing Google Gemini app can search too, but it doesn’t show me what it’s searching for.
Gemini 2.5 Pro is also capable of search as part of its chain of thought but it needs light prodding to show URLs, but it'll do so and is good at it.Unrelated point, but I'm going to keep saying this anywhere Google engineers may be reading, the main problem with Gemini is their horrendous web app riddled with 5 annoying bugs that I identified as a casual user after a week. I assume it's in such a bad state because they don't actually use the app and they use the API, but come on. You solved the hard problem of making the world's best overall model but are squandering it on the world's worst user interface.
It's a great tool, but sometimes frustrating.
I need to get from A to B via C via public transport in a big metropolis.
Now C could be one of say 5 different locations of a bank branch, electronics retailer, blood test lab or whatever, so there's multiple ways of going about this.
I would like a chatbot solution that compares all the different options and lays them out ranked by time from A to B. Is this doable today?
This may exclude some clever routes that shave off 3 minutes if you do the correct parkour... but it means I can now put my phone down and enjoy the journey without tracking it like a hawk.
dont forget Xai grok!
For example, a lot of the "sources" cited in Google's AI Overview (notably not a deep research product) are not official, just sites that probably rank high in SEO. I want the original source, or a reliable source, not joeswebsite dot com (no offense to this website if it indeed exists).
That'd what makes o3/o4-mini driven search notable to me: those models appear to have much better taste in which searches to run and which sources to consider.
Biologists, mathematicians, physicists, philosophers and the like seem to have an open-ended benefit from the research which AI is now starting to enable. I kind of envy them.
Unless one moves into AI research?
That doesn't mean they won't try though. I think the replication crisis has illustrated how many researchers actually care about correctness versus just publishing papers
Scientists are meant to be good at verifying and double-checking results - similar to how journalists have to learn to derive the truth from unreliable sources.
These are skills that turn out to be crucial when working with LLMs.
Verifying and double-checking results requires replicating experiments, doesn't it?
> similar to how journalists have to learn to derive the truth from unreliable sources
I think maybe you are giving journalists too much credit here, or you have a very low standard for "truth"
You cannot, no matter how good you are, derive truth from faulty data
Figuring out that the data is faulty is part of research.
There is still no possible way that a journalist can arrive at correct information, no matter how good, if they only have faulty data to go with
A friend of mine is an investigative reporter for a major publication. They once told me that an effective trick for figuring out what's happening in a political story is to play different sources off against each other - tell one source snippets of information you've got from another source to see if they'll rebut or support it, or if they'll leak you a new detail because what you've got already makes them look bad.
Obviously these sources are all inherently biased and flawed! They'll lie to you because they have an agenda. Your job is to figure out that agenda and figure out which bits are true.
The best way to confirm a fact is to hear about it from multiple sources who don't know who else you are talking to.
That's part of how the human intelligence side of journalism works. This is why I think journalists are particularly well suited to dealing with LLMs - human sources lie and mislead and hallucinate to them all the time already. They know how to get (as close as possible) to the truth.
All three of those things are things that software engineers rather reliably are bad at and cut corners on, because they are the least engaging and least interesting part of the job of building software
I've seen a few people state that they don't like using LLMs because it takes away the fun part (writing the code) and leaves them with the bits they don't enjoy.
Are bad engineers
> AI-assisted development
Are also bad engineers
- shooting buildings in Gaza https://apnews.com/article/israel-palestinians-ai-weapons-43...
- compiling a list of information on Government workers in US https://www.msn.com/en-us/news/politics/elon-musk-s-doge-usi...
- creating a few losy music videos
I'd argue we'd be better off SLOWING DOWN with that shit