If you are a skilled mathematician, it is quite easy to verify both that (as of 7th April) models excel at novel calculations on held out problems and mostly shit the bed when asked for proofs.
Gary cites these USAMO as evidence of contamination influencing benchmark results, but that view is not consistent with strong performance of the models on clearly held out tasks (arc test, AIME 25, HMMT 25, etc etc).
If you really care, you can test this by inventing problems! It is a very very verifiable claim about the world.
In any case, this is not the pundit you want. There are many ways to make a bear case that are much saner than this.
I'm not trying to be snarky, I'm just wondering why I would care that a tech giant has failed to cross the "GPT-5" threshold. What's the significance of that to an ordinary user?
This is good for self hosters and devs who will be able to run near SOTA models like QwQ locally. I’m near the point where I’m going to cancel my ChatGPT Plus and Claude subscription.
If you’re not already trying to self host, build your own local agents and build your own MCPs/Tools I would encourage you to try it (simple stack: ollama, pydanticAI, fastmcp, QwQ 32B, Llama 3.2 3B). If you don’t have a fancy GPU or M1+ try out QwQ on Groq or Flash 2.0 Lite with the Gemini API, it’s super cheap and fast and they are basically equivalent (if not better) than the ChatGPT you were paying for 16 months ago.
This is more for those curious about if AI is tulip bulbs.
There's unintentional ideological camps on AI, one is mad about a financial/interest bubble*, and it is maintained by content like this every 4 to 8 weeks. (randomly selected representative comment, in this same discussion: https://news.ycombinator.com/item?id=43618256)
* reminiscent, to me, of how I felt about Uber for years and years and years until I sort of moved on when it survived COVID.
I know I write off some of the time I spend working with LLMs as an investment in the future. If someone told me this is as good as they'll get, I would definitely invest less time working with them.
What is the negative effect I'm not seeing? Bad code? Economic waste in datacenter investment? Wasted effort of researchers who could be solving other problems?
I've been writing software for over a decade, and I’ve never been as productive as I am now. Jumping into a new codebase - even in unfamiliar areas like a React frontend - is so much easier. I’m routinely contributing to frontend projects, which I never did before.
There is some discipline required to avoid the temptation to just push AI-generated code, but otherwise, it works like magic.
https://en.wikipedia.org/wiki/AI_winter
That said, the techniques that have come out of machine-learning research in recent years are indeed powerful, for certain, constrained purposes. That was true of other AI technologies in the past, but we don't call them AI anymore; we call them things like "rules engines". These days, you could take a curren-year press release about AI and use a cloud-to-butt style filter to replace all occurrences of "AI" with "statistics" and see virtually zero drop in factuality (thought a considerable drop in market sizzle).
GPT-2 to GPT-3: February 2019 to June 2020 = 16 months.
GPT-3 to GPT-4: June 2020 to March 2023 = 33 months.
Looks like time to get to the next level is doubling. So we can expect GPT-5 sometime June 2028.
Feels like people are being premature about claiming AI winter or that it is somehow a scandal that we don't already have GPT-5.
It's going to take time. We need some more patience in this industry.
We can't yet say what the future holds. The nay Sayers who were so confident that LLMs were stochastic parrots are now embarrassingly wrong. This article sounds like that. Whether we are actually at a dead end or not is unknown. Why are people talking with such utter conviction when nobody truly understands what's going internally with LLMs?
The meaning of the word lie and hallucinate can be assigned to other algorithms outside of the algorithm that powers human brain. If something says something untrue or fabricates details then it is by definition lying or hallucinating.
It has nothing to do with anthropomorphization.
Additionally don’t think of what the human brain does as some higher level thing. We are algorithms as much as the LLM is an algorithm.
What else is new?