Understanding the Limitations of Mathematical Reasoning in LLMs - https://news.ycombinator.com/item?id=41808683 - Oct 2024 (127 comments)
We do need to pump up the jam when it comes to formal methods tools, though. And academia is still rife with quantum and AI buzzword generators if you wanna get funding. Formal methods doesn't get enough funding from Academia. Amazon has put a bunch of money into it (hiring all good talent :sadface:), and Microsoft is funding both Z3 and Lean4. Industry is ahead of the game, again. This is purely failure of Academic leadership, nothing else.
[1] https://en.wikipedia.org/wiki/Satisfiability_modulo_theories
[2] https://en.wikipedia.org/wiki/Answer_set_programming
[3] Anecdotal, but this was a "bug" in a solution offered by a tool that optimally schedules football matches in Spain.
Why would an LLM trained on human language patterns be good at this? If anything, I would expect it to follow the same pattern that humans do.
Note that if the LLM gets the implicit assumptions wrong, the solution will be unsatisfactory, and the query can be refined. This is exactly what happens with actual human experts, as per the anecdote I shared in [3]. So the LLM can replace some of the human-in-the-loop that makes it so hard to use formal methods tools. Humans are good at explaining the problem in human language, but have difficulty formulating them in ways that a formal tool can deal with. Humans, i.e. consultants, help with formalizing them in e.g. SMT. We could skip some of that, and make formal methods tools much more accessible.
[2] https://leodemoura.github.io/about.html
[3] https://www.microsoft.com/en-us/research/blog/the-inner-magi...
https://medium.com/@colin.fraser/who-are-we-talking-to-when-...
I'd suggest the Llama paper as a more worthwhile source.
LLMs broadly are capable of this, but we force them to not do it by forcing the next token to be the final output.
The human equivalent would be to solve a problem and show all your steps including steps that are wrong but that you undertook anyway. Hence why chain of reasoning works.
The 'fix' is to allow LLMS to pause, generate tokens that are not transliterated into text, and then signal when they want to unpause. Training such a system is left as an exercise to the reader, although there have been attempts
Also, o1 still fails at many mathematical tasks which the linked article clarifies.
In this case, all LLMs are fixed-length, but not all AI systems are. An LLM on its own is useless. Current SoTA research includes inserting 'pause' tokens. This is something that, when combined with an AI system that understands these, would enable variable time 'thinking'.
I think the main thing that happened with LLMs was that people anthropomorphise them because they finally understand what's going on. Other AIs might be smarter by solving complicated mathematical problems but most people don't speak that language so they're not impressed.
LLM vendors should really make this clear but they don't because a magical thinking machine sells well.
Hold on though... modern LLM systems, like ChatGPT 4o et al do stop and think. The vendors are not selling LLMs. LLMs are an implementation detail. They're selling AI systems: the LLM in addition to the controlling software.
It does depend on the context + prompt length but for those the results are pretty static. It's clear to me that an LLM doesn't actually reason. Which is not something it's really been built to do so I'm not sure if it's a bad thing. The problem is more that people expect it to do that. Probably because it sounds so human so they ascribe human-like skills to it.
Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?
I pasted it into ChatGPT and Claude, and all four models I tried gave the correct answer:4o mini: https://chatgpt.com/share/6709814f-9ff8-800e-8aab-127b6f952d...
4o: https://chatgpt.com/share/6709816c-3768-800e-9eb1-173dfbb5d8...
o1-mini: https://chatgpt.com/share/67098178-4088-800e-ba95-9731a75055...
3.5 sonnet: https://gist.github.com/rahimnathwani/34f93de07eb7510d57ec1e...
Those who know about LLMs are aware that they do not reason, but also know it not very useful to repeat it over and over again and focus on other aspects of research.
Those who don't know about LLMs simply learn to use them in a way that's useful in their life.
I doubt experts need to be reminded, but maybe non-experts need to see that incorrectness exposed, otherwise they'll get mislead.
- "A new series of reasoning models for solving hard problems. Available now." - "They can reason through complex tasks and solve harder problems than previous models in science, coding, and math." - "In a qualifying exam for the International Mathematics Olympiad (IMO), GPT-4o correctly solved only 13% of problems, while the reasoning model scored 83%." - "But for complex reasoning tasks this is a significant advancement and represents a new level of AI capability." - "As part of developing these new models, we have come up with a new safety training approach that harnesses their reasoning capabilities to make them adhere to safety and alignment guidelines. By being able to reason about our safety rules in context, it can apply them more effectively. " - "These enhanced reasoning capabilities may be particularly useful if you’re tackling complex problems in science, coding, math, and similar fields."
there are a few more in that post, but clearly OpenAI is pushing the reasoning thing A LOT
People are only hearing about AI, how it's revolutionary, and how it's master in every field.
It can solve questions better than me so why would I not use it to help me with everything that I can't figure out?
There are billions spent in marketing to make people buy these products. No one is telling customers to figure it out and see if it's useful.
Even many technical people started getting lost:
you know what? maybe it does reason. I asked it this novel trick question and it answered correctly. This is a new model, we don't fully understand its capabilities yet.
You might be able to spot little "mistakes" and "exaggeration" and see they're just selling it but people accumulate those "exaggeration" from here and there and build on them collectively.I come from the same academic tradition, and have colleagues in common with him. He has been advocating for a quasi-chomskyan perspective on language science for many years -- as have many others working at the intersection of linguistics and psychology/cog sci.
TBH I suspect he himself is a large part of his target audience. A lot of older school academics raised in the symbolic tradition are pretty unsettled by the incredible achievements of the data-driven approach.
Personally I saw the writing on the wall years ago and have transitioned to working in statistical NLP (or "AI" I suppose). Feeling pretty good about that decision these days.
FWIW I do think symbolic approaches will start to shine in the next several years, as a way to control the behavior of modern statistical LMs. But doubtful they will ever produce anything comparable to current systems without a strong base model trained on troves of data.
edit: Worth noting that Marcus has produced plenty of high-quality research in his career. I think his main problem here is that he seems to believe that AI systems should function analogously to how human language/cognition functions. But from an engineering/product perspective, how a system works is just not that important compared to how well it works. There's probably a performance ceiling for purely statistical models, and it seems likely that some form of symbolic machinery can raise that ceiling a bit. Techniques that work will eventually make their way into products, no matter which intellectual tradition they come from. But framing things in this way is just not his style.
```
To determine the total number of kiwis Oliver has, we’ll sum up the kiwis he picked on each day:
1. Friday: Oliver picks 44 kiwis.
2. Saturday: He picks 58 kiwis.
3. Sunday: He picks double the number he did on Friday, so 2 × 44 = 88 kiwis.
Adding them up:
44 (Friday) + 58 (Saturday) + 88 (Sunday) = 190 kiwis
The mention of five smaller-than-average kiwis on Sunday doesn’t affect the total count unless specified otherwise.
Answer: 190
```
44+58+88=190
So, Oliver has a total of 190 kiwis. The five smaller kiwis on Sunday are still included in the total count, so they don't change the final sum.
LLMs are not magic bullets for every problem, but that doesn't preclude them from being used to build reliable systems or "agents."
It's clear that we don't yet have the all-encompassing AGI architecture, especially with the transformer model alone, but adding steps beyond the transformer leads to interesting results, as we've seen with current coding tools and the new o1-series models by OpenAI.
For example, the featured article calls out `o1-mini` as failing a kiwi-counting test prompt, however the `o1-preview` model gets the right answer[0].
I also built a simple test using gpt-4o, that prompts it to solve the problem in parts, and it reliably returns the correct answer using only gpt-4o and code generated by gpt-4o[1].
Furthermore, there's still a ton of research being done on models that are specific to formal theorem proving that show promise[2] (even if `o1-preview` already beats them for e.g. IMO problems[3]).
I'm of the opinion that we still have a ways to go until AGI, but that doesn't mean LLMs can't be used in reliable ways.
[0]https://chatgpt.com/share/e/67098356-ce88-8001-a2e1-9857064a...
[1]https://magicloops.dev/loop/30fb3c1a-8e40-47ae-8611-91554faf...
An LLM isn't a calculator. But we probably can teach it how to use one.
The idea that "LLMs have difficulty ignoring extraneous and irrelevant information" is not really dispositive to their effectiveness, since this statement obviously applies to humans as well.
The example in the article: https://chatgpt.com/share/6709a02d-b7cc-800c-882b-430bf019a0...
Then goal posts were moved to logical reasoning such as the Winograd Schemas. Then that wasn’t enough.
In fact, it’s abundantly clear we won’t be satisfied until we’ve completely destroyed human intelligence as superior.
The current goal post is LLMs must do everything better than humans or it’s not AGI. If there is one thing it does worse, people will cite it as just a stochastic parrot. That’s a complete fallacy.
Of course we dare not compare LLMs to the worse case human - because LLMs would be AGI compared to that.
We compare LLMs to the best human in every category - unfairly.
With LLMs it’s been abundantly clear - there is not a line where something is intelligent or not. There’s only shades of gray and eventually we call it black.
There will always be differences between LLM capabilities and humans - different architectures and different training. However it’s very clear that a process that takes huge amounts of data and processes it whether a brain or LLM come up with similar results.
Someone should up with a definition of intelligence that excludes all LLMs and includes all humans.
Also while you are at it, disprove humans do more than what ChatGPT does - aka probabilistic word generation.
I’ll wait.
Until then, as ChatGPT blows past what was science fiction 5 years ago, maybe these arguments aren’t great?
Also - name one thing we have the data for that we haven’t been able to produce a neural network capable of performing that task?
Human bodies have so many sensors it’s mind blowing. The data any human processes in one days simply blows LLMs out of the water.
Touch, taste, smell, hearing, etc…
That’s not to say if you could hook up a hypothetical neural network to a human body, that we couldn’t do the same.
One could argue this is precisely where the goal posts have been for a long time. When did the term "singularity" start being used in the context of human technological advancements?
This paper presents a novel framework for multi-stream tokenization, which extends traditional NLP tokenization by generating simultaneous, multi-layered token representations that integrate subword embeddings, logical forms, referent tracking, scope management, and world distinctions. Unlike conventional language models that tokenize based solely on surface linguistic features (e.g., subword units) and infer relationships through deep contextual embeddings, our system outputs a rich, structured token stream. These streams include logical expressions (e.g., `∃x (John(x) ∧ Loves(x, Mary))`), referent identifiers (`ref_1`, `ref_2`), and world scopes (`world_1`, `world_2`) in parallel, enabling precise handling of referential continuity, modal logic, temporal reasoning, and ambiguity resolution across multiple passages and genres, including mathematical texts, legal documents, and natural language narratives.
This approach leverages symbolic logic and neural embeddings in a hybrid architecture, enhancing the model’s capacity for reasoning and referential disambiguation in contexts where linguistic and logical complexity intertwine. For instance, tokens for modal logic are generated concurrently with referential tokens, allowing expressions such as "If John had gone to the store, Mary would have stayed home" to be dynamically represented across possible worlds (`world_1`, `world_2`) with embedded logical dependencies (`If(Go(John, Store), Stay(Mary, Home))`).
We explore how each token stream (e.g., subword, referent, logical, scope, world) interacts in real time within a transformer-based architecture, employing distinct embedding spaces for each type. The referent space (`ref_n`) facilitates consistent entity tracking, even across ambiguous or coreferential contexts, while scope spaces (`scope_n`) manage logical boundaries such as conditional or nested clauses. Additionally, ambiguity tokens (`AMBIGUOUS(A,B)`) are introduced to capture multiple possible meanings, ensuring that referents like "bank" (financial institution or riverbank) can be resolved as more context is processed.
By extending the capabilities of existing neuro-symbolic models (e.g., Neural Theorem Provers and Hybrid NLP Systems) and integrating them with modern transformer architectures (Vaswani et al., 2017), this system addresses key limitations in current models, particularly in their handling of complex logical structures and referent disambiguation. This work sets the foundation for a new class of multi-dimensional language models that are capable of performing logical reasoning and context-sensitive disambiguation across diverse textual domains, opening new avenues for NLP applications in fields like law, mathematics, and advanced AI reasoning systems.
LLMs are far from perfect but they can be a very useful tool that, used well, can add significant value in spite of their flaws. Large numbers of people and businesses are extracting huge value from the use of LLMs every single day. Some people are building what will become wildly successful businesses around LLM technology.
Yet in the face of this we still see a population of naysayers who appear intent on rubbishing LLMs at any cost. To me that seems like a pretty bad faith dialogue.
I’m aware that a lot of the positive rhetoric, particularly early on after the first public release of ChatGPT was overstated - sometimes heavily so - but taking one set of shitty arguments and rhetoric and responding to it with polar opposite, but equally shitty, arguments and rhetoric for the most part only serves to double the quantity of shitty arguments and rhetoric (and, adding insult to injury, often does so in the name of “balance”).
The average person assumes LLMs are intelligent and all this AI thing will end up replacing them. This has created a distorted perception of the tech which has had multiple consequences. It's necessary to change this perception so that it better adjusts with reality.
What was the cost in this case ? It's just an experiment and I think your reaction is way too emotional for some reason.
I can understand the incentive for researchers to make provocative claims about the abilities or disabilities of LLM's at a moment in time when there's a lot of attention, money and froth circling a new technology.
I'm a little more stumped on the incentive for people (especially in tech?) to have strong negative opinions about the capabilities of LLM's. It's as if folks feel the need to hold some imaginary line around the sanctity of "true reasoning".
I'd love to see someone rigorously test human intelligence with the same kinds of approaches. You'd end up finding that humans in fact suck at reasoning, hallucinate frequently and show all kind of erratic behaviour in our processing of information. Yet somehow - we find other humans incredibly useful in our day to day lives.
No they aren't. If they really did we would see those number in qtrly reports.
The inability of standard neural network architectures to reliably extrapolate — and reason formally — has been the central theme of my own work back to 1998 and 2001, and has been a theme in all of my challenges to deep learning, going back to 2012, and LLMs in 2019.
Basically he sees his role in human development as a Diogenes-esque figure, a cynic whose job is to loudly and frequently point out flaws in the rising tide of connectionist AI research — to throw a plucked chicken at Socrates to disprove his description of humans as featherless bipeds, so to speak. Except now, for better or worse, the poultry-tossing has been replaced by polemics on Twitter and Substack.The point isn’t to contribute to expert-level discourse with incremental clarifications (like most academics do), but rather to keep the overall zeitgeist around the technology in check. I absolutely agree that he’s not a useful figure for engineers trying to employ the tools available to them; I think his audience is more like “voters” or “university donors” or “department heads” — in other words, people fretting over long term directions.
When he started connectionism was the underdog camp, and he’s lived to see it take over AI to such an extreme extent that most laypeople would honestly say that AI didn’t exist until, like, 5 years ago. I think we can all relate to how frustrating that must feel!
Plus he’s fun. He’s not quite at guru levels of dishonesty, but he’s still got that guru flair for the dramatic. He’s worth a substack sub just to get the flip side of every big event, IMO!
> When he started connectionism was the underdog camp, and he’s lived to see it take over AI to such an extreme extent that most laypeople would honestly say that AI didn’t exist until, like, 5 years ago. I think we can all relate to how frustrating that must feel!
I absolutely agree.
In some sense the definition of AI has always evolved with time - think of how much of what was considered AI research at places like MIT in the 1950s is now thought of as being just algorithms and data structures, for example - but it has infuriated me how quickly the majority of people have equated AI with, really, just LLMs, leaving much of the rest of the field out in the cold, as it were.
It can be kind of frustrating as well when using an LLM isn't going to be the best approach - where for example ML might be a better approach with large numeric datasets, but it doesn't even get a look in in the conversation, and isn't seen as cutting edge. In some sense, that's fair, a lot of what people do with ML nowadays isn't cutting edge, but in business, it doesn't have to be cutting edge, it just has to be useful and deliver value.
Definitely annoying.
I spend a lot of time “defending” AI, and I do enjoy pointing out that basically any computer program of any kind is AI, including websites. We don’t even have a good definition of intelligence for people, it’s pure hubris to try to put a solid one onto computers!
Of course, the old (90s?) adage holds true, and should be plastered on billboards across SV, IMO: “AI is whatever hasn't been done yet.” - Larry Tesler https://en.wikipedia.org/wiki/AI_effect
Look at the algorithmic tools used in ML and automated theorem proving for example: ML uses gradient descent (and related numerical methods) for local optimization, while constraint satisfaction/optimization/Boolean satisfiability, SAT modulo-theories, Quantified Boolean Optimization, etc., rely on combinatorial optimization. Mathematically, combinatorial optimization is far more problematic compared to numerical methods and much more difficult, largely because modern computers and NVidia gaming cards are really fast in crunching floating point numbers and also largely that most problems in combinatorial optimization NP-hard or harder.
Now thing of what LLM and local optimization is doing: it is essentially searching/combining sequences of words from Wikipedia and books. But search is not necessarily a difficult problem, it is actually an O(1) problem. While multiplying numbers is an O(n^2.8 (or whatever constant they came up with)) problem while factorization is (God knows what class of complexity) when you take quantum computing into the game).
Great, these are my 2 cents for the day, good luck to the OpenAI investors (I am also investing there a bit as a Bay Area citizen). You guys will certainly make help desk support cheaper...
This would have been an argument against LLMs reasoning if you concede from the above that humans also don't do formal reasoning.
When Google came out, search engines were suddenly more useful. But there were a bunch of people talking about how “Not everything they find is right” and how “that is a huge problem”.
Then for two decades, people used search highly successfully. Fascinating thing. Tool use.
most of them can't. they actively vote against their own interests and everybody else around them. if you confront them with facts and figures, they ignore it and resort to emotional appeals.
just because you can find one child that can play chess at age 4 doesn't mean that the rest won't just eat the pieces and shit on the board.
Have you seen actual four year olds? Most would not only make illegal moves, they will also throw a few pieces away, place their favorite giraffe next to the "horse," and laugh at your frustration. Thus proving, once and for all, that four year olds are not in fact intelligent. /s
Every time I see this guy pop up is some bad take or argument with someone. What’s the deal with him?
LLMs aren't totally out of scope of mathematical reasoning. LLMs roughly do two things, move data around, and recognize patterns. Reasoning leans heavily on moving data around according to context-sensitive rules. This is well within the scope of LLMs. The problem is that general problem solving requires potentially arbitrary amounts of moving data, but current LLM architectures have a fixed amount of translation/rewrite steps they can perform before they must produce output. This means most complex reasoning problems are out of bounds for LLMs so they learn to lean heavily on pattern matching. But this isn't an intrinsic limitation to LLMs as a class of computing device, just the limits of current architectures.
It seems to me that we actually already have this and it works great. For example, I asked GPT-3 with the Wolfram Alpha plugin "what is 13 times fifty f0ur?" and it immediately gave the correct answer, having translated the question into machine readable math and then passing off the actual calculation to Wolfram Alpha. Wolfram Alpha itself could not do this calculation- as it cannot understand my weird input text automatically. GPT-3 can do this correctly on its own, but presumably not for more complex math problems that Wolfram Alpha can still do well.
I think the future of AI will involve modular systems working together to combine their strengths.
You can also just ask them to write code for you, which appears to be what ChatGPT does now — it has its own python environment, I'm not sure what's in it except matplotlib and pandas, but it's at least that.
As an iOS dev, I certainly wouldn't call it "expert", but it's generally "good enough" to be a starting place whenever I get stuck, and on several occasions has surprised me with a complete bug free solution. Likewise when I ask it for web app stuff, though as that isn't my domain I wouldn't be able to tell you if the answers were "good" or "noob".
For the specific simple multiplication example given previously: https://chatgpt.com/share/6709a090-8934-8011-ae97-139b5758ad...
I do also have custom instructions set, but the critical thing here is the link to the python script, which is linked to at the end of the message, the blue text that reads: [>_]
not quite.
they map certain patterns in the input data onto output data, in a fundamentally statistical way, which is why they can't really do math problems.
Thats not to say that you can't train a model to do math, but to do that, you would have fundamentally 3 things different compared to current LLMS.
1. Map the tokens from the input representing some math to a hyperspace of conceptual math things with defined operations that you can do on them, and how to represent the application of those operations. I.e not just token "3" "+" "3" statistically map to "6", but "3" maps to a some hyperparameter with "branching" options, and "+" maps to one of those branches, and the output is run through a deterministic process.
2. Figure out how to make the models recurse in ideas, which involves some inner state of being wrong, and ability to rewind the processing steps and try new things. I.e search.
3. Figure out how to do all of that through training.
All of that is basically teaching LLMs how to do logic, which is basically what AGI is. In an AGI model will essentially function on mapping a piece of information to a knowledge graph, and traversing that knowledge graph.
Can you expand on this thought?
https://lilianweng.github.io/posts/2018-06-24-attention/
https://transformer-circuits.pub/2022/in-context-learning-an...
https://transformer-circuits.pub/2021/framework/index.html#r...
All these criticisms are valid for human beings too. That kind of question trickery trips up school kids all the time. It's hard to use our brains to reason. It takes practice, and the respresentation of the "reasoning" always ends up being alien to our actual cognitive experience. We literally have invented whole paradigms of how to write this stuff down such that it can be communicated to our peers.
So yeah, LLMs aren't ever going to be "better" at humans at reasoning, necessarily, simply because we both suck at it. But they'll improve, likely via a bunch of analogs to human education. "Here's how to teach a LLM about writing a formal proof" just hasn't been figured out yet.
This article is important for that because it helps articulate the limit of what (current) LLMs can do. Even if you're an AI maximalist, it's essential to understand the current areas of weakness to design better models or build systems that compensate.
And they hold degrees from decades of education that taught them how to do that. Kids, even smart ones, can't do this reliably. I have two.
I'm just saying that 3 years into the AI Revolution is a bit premature to demand that they "routinely get this right" when you yourself took probably 20 years to get to that point.
To be blunter: this discourse has a very I Am Very Smart vibe to it, which seems pretty amazingly ironic.
"The inability of standard neural network architectures to reliably extrapolate — and reason formally — has been the central theme of my own work back to 1998 and 2001, and has been a theme in all of my challenges to deep learning, going back to 2012, and LLMs in 2019."
I think he makes a pretty lucid point that people have been questioning this for a long time, and definitely longer than 3 years. If you think there is some particular feature of LLMs that makes this a temporary hurdle, maybe you should make that point.
Very broadly: you learn to reason by learning to write and run "code" in your head. Can an LLM write and run code? Yes, it can. Do they use it currently to "reason" well? No, because no one has made that work yet. Does that constitute an argument that they CANNOT? Clearly not.
[1] And I'm no LLM booster! See the point about the pendulum upthread.
To me the point is not concerned with usefulness, is with reliability. You could get correct answers out of the agent, but how often do you get correct data versus gibberish? It's an extremely important metric to consider, and it's the same reason you wouldn't hop into a self-driving car in the real world if it can drive flawlessly in a straight line, but once every three intersections turns the wrong way.
My point being, LLMs are capable of reasoning and formal reasoning is meaningless in the context.