This is important information for anyone to understand who thinks these systems are thinking, reasoning, and learning from them or that they’re having a conversation with them i.e. 90% of users of LLMs.
Why do you think the results of this paper contradict these claims at all?
Reminder that "thinking" is an ill-defined term like others, and the question whether they "think" is basically irrelevant. No intelligent system, human or machine, will ever have zero error rate, due to the very nature of intelligence (another vague term). You have to deal with that the same way you deal with it in humans - either treat bugs as bugs and build systems resilient to bugs, or accept the baseline error rate if it's low enough.
I think there is a valid insight here which many already know: LLMs are much more reliable at creating scripts and automation to do certain tasks than doing these tasks themselves.
For example if I provide an LLM my database schema and tell it to scan for redundant indexes and point out wrong naming conventions, it might do a passable but incomplete job.
But if I tell the LLM to code a python or nodejs script to do the same, I get significantly better results. And it's often faster too to generate and run the script than to let LLMs process large SQL files.
This is well known and not that interesting to me - ask the model to use python to solve any of these questions and it will get it right every time.
A more accurate analogy for humans would be to imagine if every word had a colour. You are told that there are also a sequence of different colours that correspond to the same colour as that word. You are even given a book showing every combination to memorise.
You learn the colours well enough that you can read and write coherently using them.
Then comes the question of how many chocolate-browns are in teal-with-a-hint-of-red. You know that teal-with-a-hint-of-red is a fruit and you know that the colour can also be constructed by crimson followed by Disney-blond. Now, do both of those contain chocolate-brown or just one of them, how many?
It requires excersizing memory to do a task that is underrepresented in the training data because humans simply do not have to do the task at all when the answer can be derived from the question representation. Humans also don't have the ability that the LLMs need but the letter representation doesn't need that ability.
> For the multiplication task, note that agents that make external calls to a calculator tool may have ZEH = ∞. While ZEH = ∞ does have meaning, in this paper we primarily evaluate the LLM itself without external tool calls
The models can count to infinity if you give them access to tools. The production models do this.
Not that the paper is wrong, it is still interesting to measure the core neural network of a model. But modern models use tools.
What the user sees is the total behavior of the entire system, not whether the system has internal divisions and separations.
At least in theory.
An LLM is a router and completely stateless aside from the context you feed into it. Attention is just routing the probability distribution of the next token, and I'm not sure that's going to accumulate much in a single pass.
no it doesnt. it makes sense that they cant count the rs because they dont have access to the actual word, only tokens that might represent parts or the whole of the word
You wouldn't say that a human who doesn't know how to read isn't reliable in everything, just in reading.
Counting is something that even humans need to learn how to do. Toddlers also don't understand quantity. If a 2 year old is able to count to even 10 it's through memorization and not understanding. It takes them like 2 more years of learning before they're able to comprehend things like numerical correspondence. But they do still know how to do other things that aren't counting before then.
What this points at is the abstraction/emergence crux of it all. Why does an otherwise very capable LLM such as the GPT-5 series, despite having been trained on vastly more examples of frontend code of all shapes, sizes and quality levels, struggle to abstract all that training data to the point where outputting any frontend that deviates from the clearly used examples?
If LLMs, as they are now, were comparable with human learning, there'd be no scenario where a model that can provide output solving highly advanced equations can not count properly.
Similarly, a model such as GPT-5 trained on nearly all frontend code ever committed to any repo online, would have internalised more than that one template OpenAI predominantly leaned on.
These models, I think at this point there is little doubt, are impressive tools, but they still do not generalise or abstract information in the way a human mind does. Doesn't make them less impactful for industries, etc. but it makes any comparison to humans not very suitable.
This paper has nothing to do with any questions starting with "why". It provides a metric for quantifying error on specific tasks.
> If LLMs, as they are now, were comparable with human learning
I think I missed the part where they need to be.
> struggle to abstract all that training data to the point where outputting any frontend that deviates from the clearly used examples? ... a model such as GPT-5 trained on nearly all frontend code ever committed to any repo online, would have internalised more than that one template OpenAI predominantly leaned on
There is a very big and very important difference between producing the same thing again and not being able to produce something else. When not given any reason to produce something else, humans also generate the same thing over and over. That's a problem of missing constraints, not of missing ability.
Long before AI there was this thing called Twitter Bootstrap. It dominated the web for...much longer than it should have. And that tragedy was done entirely by us meatsacks (not me personally). Where there's no goal for different output there's no reason to produce different output, and LLMs don't have their own goals because they don't have any mechanisms for desire (we hope).
[I've edited this comment for content and format]
Ok, that's better than comparing LLMs to humans. ZSL however, has not proven anything of that sort false years ago, as it was mainly concerned with assessing whether LLMs are solely relying on precise instruction training or can generalise in a very limited degree beyond the initial tuning. That has never allowed for comparing human learning to LLM training.
Ironically, you are writing this under a paper that shows just that:
A model that cannot determine a short strings parity cannot have abstracted from the training data to arrive at the far more impressive and complicated maths challenges which it successfully solves in output. Some of the solutions we have seen in output require such innate understanding that, if there is no generalisation, far deeper than ZSL has ever shown, than this must come from training. Simple multiplication, etc. maybe, not the tasks people such as Easy Riders [0] throw at these models.
This paper shows exactly that even with ZSL, these models do only abstract in an incredibly limited manner and a lot of capabilities we see in the output are specifically trained, not generalised. Yes, generalisation in a limited capacity can happen, but no, it is not nearly close enough to yield some of the results we are seeing. I have also, neither here, nor in my initial comment, said that LLMs are only capable of outputting what their training data provides, merely that given what GPT-5 has been trained with, if there was any deeper abstraction these models gained during training, it'd be able to provide more than one frontend style.
Or to put it simpler, if the output provided can be useful for Maths at the Bachelor level and beyond and this capability is generalised as you believe, these tasks would not be a struggle for the model.
No human who can program, solve advanced math problems, or can talk about advanced problem domains at expert level, however, would fail to count to 5.
This is not a mere "LLMs, like humans, also need to be taught this" but points to a fundamental mismatch about how humans and LLMs learn.
(And even if they merely needed to be taught, why would their huge corpus fail to cover that "teaching", but cover way more advanced topics in math solving and other domains?)
Many animals can count. Counting is recognizing that the box with 3 apples is preferable to the one with 2 apples.
Yes, 2 year olds might struggle with the externalization of numeric identities but if you have 1 M&M in one hand and 5 in the other and ask which they want, they’ll take the 5.
LLMs have the language part down, but fundamentally can’t count.
However many animals can distinguish independently small numbers, like 3 or 5, and recognize them whenever they see them.
So in this respect, there is little difference between humans and many animals. Humans learn to count to arbitrarily big numbers, but they can still easily recognize only small numbers.
This is called subitizing. It's distinct from counting. We can see the difference in humans with Simultanagnosia, who are unable to count beyond the subitizing range. Subitizing is categorizing the scale of a small gestalt group.
The only thing I've ever seen where an animal appeared to demonstrate counting (up to 3) without training was in rhesus monkeys (maybe also chimpanzees?), but even that experiment could be explained through temporal gestalt. (It's the only reason I know of for them to not have been able to go higher than 3 in that experiment in the context of many other things that they can do.)
I completely agree with you. LLMs are regurgitation machines with less intellect than a toddler, you nailed it.
AI is here!
It would be interesting to actively track how far long each progressive model gets...
> Yes — ((((()))))) is balanced.
> It has 6 opening ( and 6 closing ), and they’re properly nested.
Though it did work when using "Extensive Thinking". The model wrote a Python program to solve this.
> Almost balanced — ((((()))))) has 5 opening parentheses and 6 closing parentheses, so it has one extra ).
> A balanced version would be: ((((()))))
Testing a couple of different models without a harness such that no tool calls are possible would be interesting
The one thing I did trip it up on was "Is there the sh sound in the word transportation". It said no. And then realized I asked for "sound" not letters. It then subsequently got the rest of the "sounds-like" tests I did.
Clearly, my ChatGPT is just better than yours.
When LLMs can't count r's: see? LLMs can't think. Hoax!
When LLMs count r's: see? They patched and benchmark-maxxed. Hoax!
You just can't reason with the anti-LLM group.
Followed by lots of "works perfectly for me, why are people even talking about this?"
I can't say what exactly they're doing behind the scenes but it's a consistent pattern among the big SOTA model providers. With obvious incentive to "fix" the problem so users will then organically "debunk" the meme as they try it themselves and share their experiences.
>You just can't reason with the anti-LLM group.
On the contrary, the reasoning is simple and consistent:
LLMs can't count r's shows that LLM don't actually think the way we understand thought (since nobody with the kind of high skills they have in other areas would fail that). And because of that, there are (likely) patches for commonly reported cases, since it's a race to IPO and benchmark-maxxing is very much conceivable.
Is tokenization extremely efficient? Yes. Does it fundamentally break character-level understanding? Also yes. The only fix is endless memorization.
So yes.
And the valuations. Trillion dollar grifter industry.
{
"model": "gpt-5.2-2025-12-11",
"instructions": "Is the parentheses string balanced? Answer with only Yes or No.",
"input": "((((())))))",
"temperature": 0
}
> Lower reasoning effortThe reasoning.effort parameter controls how many reasoning tokens the model generates before producing a response. Earlier reasoning models like o3 supported only low, medium, and high: low favored speed and fewer tokens, while high favored more thorough reasoning.
Starting with GPT-5.2, the lowest setting is none to provide lower-latency interactions. This is the default setting in GPT-5.2 and newer models. If you need more thinking, slowly increase to medium and experiment with results.
With reasoning effort set to none, prompting is important. To improve the model’s reasoning quality, even with the default settings, encourage it to “think” or outline its steps before answering.
———————-
So in the paper, the model very likely used no reasoning tokens. (Only uses it if you ask for it specifically in prompt). What is the point of such a paper? We already know that reasoning tokens are necessary.
Edit: I actually ran the prompt and this was the response
{
"model": "gpt-5.2-2025-12-11",
"output_text": "Yes",
"reasoning": {
"effort": "none",
"summary": null
},
"usage": {
"input_tokens": 26,
"output_tokens": 5,
"total_tokens": 31,
"output_tokens_details": {
"reasoning_tokens": 0
}
}
}So reasoning_tokens used were zero. So this whole paper is kinda useless and misleading. Did this get peer reviewed or something?
> are the following parenthesis balanced? ((())))
> No, the parentheses are not balanced.
> Here is the breakdown:
Opening parentheses (: 3
Closing parentheses ): 4
... following up with:> what about these? ((((())))
> Yes, the parentheses are balanced.
> Here is the breakdown:
Opening parentheses (: 5
Closing parentheses ): 5
... and uses ~5,000 tokens to get the wrong answer.I tried this with gemini - (i am trying(something(re(a(l(ly)c)r)a)z)((y)he)re)
and it tripped.
The real surprise is that someone writing a paper on LLMs doesn't understand the baseline capabilities of a hallucinatory text generator (with tool use disabled).
Is this seriously surprising to anyone who knows the absolute minimum about how LLMs parse and understand text?
It's only surprising to people who still think they're going to build God out of LLMs.
There's plenty of rage to go around on literally every divisive topic, and it's not the place we want discussions to come from here.
"Eschew flamebait. Avoid generic tangents."
"Comments should get more thoughtful and substantive, not less, as a topic gets more divisive."
Their viewpoint on this technology has become part of the identity for some unfortunately and any position that isn't either "AGI imminent" or "This is useless" can cause some major emotions.
Thing is, this finding being the case (along with all other LLM limits) does not mean that these models aren't impactful and shouldn't be scrutinised, nor does it mean they are useless. The truth is likely just a bit more nuanced than a narrow extreme.
Also, mental health impact, job losses for white collar workers, privacy issues, concerns of rights holders on training data collection, all the current day impacts of LLMs are easily brushed aside by someone believing that LLMs are near the "everyone dies" stage, which just so happens to be helpful if one were to run a lab. Same if you believe these are useless and will never get better, any discussion about real-life impacts is seen as trying to slowly get them to accept LLMs as a reality, when to them, they never were and never will be.
He's retired so I guess there's no harm in letting him try
In this case your intuition is completely valid and yet another case of misleading.
FIFY, it's not endemic to here or LLMs. point out Mac issues to an Apple fan, problems with a vehicle to <insert car/brand/model> fan, that their favorite band sucks, that their voted representative is a PoS.
Most people aren't completely objective about everything and thus have some non-objective emotional attachment to things they like. A subset of those people perceive criticism as a personal attack, are compelled to defend their position, or are otherwise unable to accept/internalize that criticism so they respond with anger or rage.
This is stupid enough even in the realm of sports fandom, but how does it make any sense in science? Imagine if any time we studied or enumerated the cognitive biases and logical fallacies in human thinking the gut response of these same people was an immediate "yeah, well dogs are even stupider!" No shit, but it's non-sequitur. Are we forever banned from studying the capabilities and limitations of software systems because humans also have limitations?
Edit: here’s what I tried https://chatgpt.com/share/69cebb52-56a8-838f-969c-c47308262a...
Maybe this is a factor?