LLMs don't do formal reasoning(garymarcus.substack.com)

144 pointsby LgLasagnaModel2 years ago36 comments

dang2 years ago
Related ongoing thread:
Understanding the Limitations of Mathematical Reasoning in LLMs - https://news.ycombinator.com/item?id=41808683 - Oct 2024 (127 comments)
zero_k2 years ago
Yes, we should use LLMs to translate human requirements that are ambiguous and have a lot of hidden assumptions (e.g. that football matches should preferably be at times when people are not working & awake [3]), and use them to create formal requirements, e.g. generating SMT [1] or ASP [2] queries. Then the formal methods tool, e.g. Z3/cvc5 or clingo can solve these now formal queries. Then we can translate back the solution to human language via the LLM. This does not solve some problems, e.g. the LLM not correctly guessing the implicit requirements. But it does go around a bunch of issues.
We do need to pump up the jam when it comes to formal methods tools, though. And academia is still rife with quantum and AI buzzword generators if you wanna get funding. Formal methods doesn't get enough funding from Academia. Amazon has put a bunch of money into it (hiring all good talent :sadface:), and Microsoft is funding both Z3 and Lean4. Industry is ahead of the game, again. This is purely failure of Academic leadership, nothing else.
[1] https://en.wikipedia.org/wiki/Satisfiability_modulo_theories
[2] https://en.wikipedia.org/wiki/Answer_set_programming
[3] Anecdotal, but this was a "bug" in a solution offered by a tool that optimally schedules football matches in Spain.
- dmonitor2 years ago
  > Yes, we should use LLMs to translate human requirements that are ambiguous and have a lot of hidden assumptions (e.g. that football matches should preferably be at times when people are not working & awake [3]), and use them to create formal requirements
  Why would an LLM trained on human language patterns be good at this? If anything, I would expect it to follow the same pattern that humans do.
  - zero_k2 years ago
    It doesn't need to be good at solving the problem. It only needs to be good at translating the problem of "If the unknown x is divided by 3 it has the same value as if I subtracted 9 from it" into "x/3 == x-9 && x is an Real number". The formal method tool will do the rest.
    Note that if the LLM gets the implicit assumptions wrong, the solution will be unsatisfactory, and the query can be refined. This is exactly what happens with actual human experts, as per the anecdote I shared in [3]. So the LLM can replace some of the human-in-the-loop that makes it so hard to use formal methods tools. Humans are good at explaining the problem in human language, but have difficulty formulating them in ways that a formal tool can deal with. Humans, i.e. consultants, help with formalizing them in e.g. SMT. We could skip some of that, and make formal methods tools much more accessible.
    croes2 years ago
    Your example isn't ambiguous and if it was LLMs won't be better in choosing the right interpretation.
    wakawaka282 years ago
    The problem is not that writing input for formal method tools is tricky syntactically. The problem is that it is hard to produce the actual semantic content of the input. Humans don't need help with that, especially not the kind of human that is capable of authoring that content. It's much more technical than stuff like "Make me a website with a blue background" or some crap like that. The potential for an LLM to mistranslate the English input is probably unacceptable.
- zero_k2 years ago
  Correction, Lean4 [1] is sponsored by Amazon, the lead developer, Leonardo de Moura is at AWS now [2]. He was previously at Microsoft Research [3]. Meeting him is a real ride, I only had the chance to talk with him once.
  [1] https://lean-lang.org/
  [2] https://leodemoura.github.io/about.html
  [3] https://www.microsoft.com/en-us/research/blog/the-inner-magi...
tj-teej2 years ago
If anyone is curious, a Meta Data Scientist published a great piece about how the facts about what LLMs are actually doing (and therefore able to do) and how it's papered over by using chat bots. It's a long but very engaging read.
https://medium.com/@colin.fraser/who-are-we-talking-to-when-...
- grey-area2 years ago
  Great article which really explores why we fall for llms and think they are doing a lot more thinking than they are.Thanks.
- non_sequitur2 years ago
  this is a good article but very outdated - none of the examples he cites are relevant anymore
- tivert2 years ago
  That article is fantastic.
- rahimnathwani2 years ago
  This article is long but doesn't mention key concepts like instruction tuning.
  I'd suggest the Llama paper as a more worthwhile source.
  - grey-area2 years ago
    It does talk about openai explicitly instruction tuning the llm to try to constrain the output and the limitations of such approaches.
    rahimnathwani2 years ago
    ctrl+F 'struction'
    0 results
    grey-area2 years ago
    Thanks for demonstrating the depth of your reading.
wkat42422 years ago
One of the things that kinda illustrate this for me, is that an LLM always uses the same time to process a prompt of the same length. No matter how complicated the problem is. Obviously the complexity of the problem is not actually taken into account.
- acchow2 years ago
  This is only true if the output is the same length (which should be exceptionally rare if the input text is different).
  - wkat42422 years ago
    That's true, I was talking about tokens/sec output but I should have specified.
- obmelvin2 years ago
  the o1 model definitely has a somewhat big variance in how long the task takes depending on what you ask it to do
  - wkat42422 years ago
    True the o1 model is the one exception though it's really more of a chain of LLMs. I wouldn't consider it a pure LLM.
    Also, o1 still fails at many mathematical tasks which the linked article clarifies.
  - robterrell2 years ago
    You don't see the majority of tokens it is generating.
    obmelvin2 years ago
    Yes, I'm not claiming that is true formal reasoning, but it is certainly more of a chain of thought than was previously being done and does indicate that some questions require more and less "thought"
- lifeisstillgood2 years ago
  Wait what ? Is that real?
  - fxtentacle2 years ago
    Yes. In the end, LLMs are a sequence of matrix multiplications and since they don't loop internally, every output token gets the same number of internal processing steps, no matter what the input is. Only the input length is relevant because some steps can be skipped if the input buffer is not full.
  - modeless2 years ago
    Yes. OpenAI's o1 model is an attempt to address this, by letting the model choose to "think" by generating hidden tokens for a variable amount of time before producing the visible output tokens. But each token whether hidden or visible still takes a fixed amount of compute.
  - anon2912 years ago
    We really really really need to disambiguate the LLM, which is a fixed length, fixed compute time process which takes in an input and produces a token distribution, from the AI system, which takes the output of the LLM and eventually produces something for the user.
    In this case, all LLMs are fixed-length, but not all AI systems are. An LLM on its own is useless. Current SoTA research includes inserting 'pause' tokens. This is something that, when combined with an AI system that understands these, would enable variable time 'thinking'.
    wkat42422 years ago
    Yes. AIs come in all sorts of flavours.
    I think the main thing that happened with LLMs was that people anthropomorphise them because they finally understand what's going on. Other AIs might be smarter by solving complicated mathematical problems but most people don't speak that language so they're not impressed.
    LLM vendors should really make this clear but they don't because a magical thinking machine sells well.
    anon2912 years ago
    > LLM vendors should really make this clear but they don't because a magical thinking machine sells well.
    Hold on though... modern LLM systems, like ChatGPT 4o et al do stop and think. The vendors are not selling LLMs. LLMs are an implementation detail. They're selling AI systems: the LLM in addition to the controlling software.
  - wkat42422 years ago
    Yes, you never tried it? I always get the same tokens/s from my local LLM setup no matter what I put in (and because it's local there are no hidden resources the cloud might have added to solve my extra-hard problem).
    It does depend on the context + prompt length but for those the results are pretty static. It's clear to me that an LLM doesn't actually reason. Which is not something it's really been built to do so I'm not sure if it's a bad thing. The problem is more that people expect it to do that. Probably because it sounds so human so they ascribe human-like skills to it.
  - lostmsu2 years ago
    No, the processing time depends on the length of generated output.
anon2912 years ago
Current LLMs are one-shot. They are forced to produce an output without thinking, leading to the preponderance of hallucinations and lack of formal reasoning. Human formal reasoning is not instinctual. Unlike 'aha!' moments, it requires us to think. Part of that thinking process is turning our attention inwards into our own mind and using symbolic manipulations that we do not utter in order to 'think'.
LLMs broadly are capable of this, but we force them to not do it by forcing the next token to be the final output.
The human equivalent would be to solve a problem and show all your steps including steps that are wrong but that you undertook anyway. Hence why chain of reasoning works.
The 'fix' is to allow LLMS to pause, generate tokens that are not transliterated into text, and then signal when they want to unpause. Training such a system is left as an exercise to the reader, although there have been attempts
rahimnathwani2 years ago
The paper (published 4 days ago) has this on page 10, and says that o1-mini failed to solve it correctly:
```
   Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?
```
I pasted it into ChatGPT and Claude, and all four models I tried gave the correct answer:
4o mini: https://chatgpt.com/share/6709814f-9ff8-800e-8aab-127b6f952d...
4o: https://chatgpt.com/share/6709816c-3768-800e-9eb1-173dfbb5d8...
o1-mini: https://chatgpt.com/share/67098178-4088-800e-ba95-9731a75055...
3.5 sonnet: https://gist.github.com/rahimnathwani/34f93de07eb7510d57ec1e...
- lewhoo2 years ago
  I got 185 from 4o mini https://chatgpt.com/share/670987da-6e70-800b-b5a6-f8548fda6b...
- gitaarik2 years ago
  Isn't it because this test has since been spread on the internet and the LLM's picked up on that so now they give the correct answer?
  Maybe try a new unique logical question. And not the same question with a few words changed, because that might still match close to data the LLM already scanned.
  - rahimnathwani2 years ago
    the LLM's picked up on that so now they give the correct answer
    The models don't just 'pick up' information that appears on the internet. They must be retrained with that new data in the training set.
    I tested the models 4 days after the paper was published.
    The models are retrained every few months, and the process takes much more than 4 days.
- bitexploder2 years ago
  Wonder if this is like the old school benchmarks people would cheat on. Should not be hard to assemble a series of such puzzles and get a read on overall accuracy :)
- LgLasagnaModel2 years ago
  Remember. How. It works. Please, please remember how it works. It is generating an answer anew, every single time. It is amazing how often it produces a correct answer, but not at all surprising that it produces inconsistent and sometimes incorrect answers.
whiplash4512 years ago
I am not sure who the target audience of Gary Marcus is.
Those who know about LLMs are aware that they do not reason, but also know it not very useful to repeat it over and over again and focus on other aspects of research.
Those who don't know about LLMs simply learn to use them in a way that's useful in their life.
- marcosdumay2 years ago
  I dunno. There have been 3 comments claiming they do reason on this page alone.
  I doubt experts need to be reminded, but maybe non-experts need to see that incorrectness exposed, otherwise they'll get mislead.
- LgLasagnaModel2 years ago
  Maybe his target audience is anyone who might have read (or written) this??? https://openai.com/index/introducing-openai-o1-preview/
  - "A new series of reasoning models for solving hard problems. Available now." - "They can reason through complex tasks and solve harder problems than previous models in science, coding, and math." - "In a qualifying exam for the International Mathematics Olympiad (IMO), GPT-4o correctly solved only 13% of problems, while the reasoning model scored 83%." - "But for complex reasoning tasks this is a significant advancement and represents a new level of AI capability." - "As part of developing these new models, we have come up with a new safety training approach that harnesses their reasoning capabilities to make them adhere to safety and alignment guidelines. By being able to reason about our safety rules in context, it can apply them more effectively. " - "These enhanced reasoning capabilities may be particularly useful if you’re tackling complex problems in science, coding, math, and similar fields."
  there are a few more in that post, but clearly OpenAI is pushing the reasoning thing A LOT
- shprd2 years ago
  If only that was the case. Laywers used them in courts, key people are using it to analyze reports and make decisions for them because it's "AI" and advertised as better than humans. The problem is LLMs output look coherent and make sense so with the advertising, people are misled about what it does and what is capable of.
  People are only hearing about AI, how it's revolutionary, and how it's master in every field.
  It can solve questions better than me so why would I not use it to help me with everything that I can't figure out?
  There are billions spent in marketing to make people buy these products. No one is telling customers to figure it out and see if it's useful.
  Even many technical people started getting lost:
  you know what? maybe it does reason. I asked it this novel trick question and it answered correctly. This is a new model, we don't fully understand its capabilities yet.
  You might be able to spot little "mistakes" and "exaggeration" and see they're just selling it but people accumulate those "exaggeration" from here and there and build on them collectively.
- glomgril2 years ago
  He is coming from the perspective of a long-running debate on symbolic versus statistical/data-driven approaches to modeling language structure and use. It seems in recent years he has had trouble coming to terms with the fact that at least for real-world applications of language technology, the statistical approach has simply won the war (or at worst, forms the core foundation on top of which symbolic approaches can have some utility).
  I come from the same academic tradition, and have colleagues in common with him. He has been advocating for a quasi-chomskyan perspective on language science for many years -- as have many others working at the intersection of linguistics and psychology/cog sci.
  TBH I suspect he himself is a large part of his target audience. A lot of older school academics raised in the symbolic tradition are pretty unsettled by the incredible achievements of the data-driven approach.
  Personally I saw the writing on the wall years ago and have transitioned to working in statistical NLP (or "AI" I suppose). Feeling pretty good about that decision these days.
  FWIW I do think symbolic approaches will start to shine in the next several years, as a way to control the behavior of modern statistical LMs. But doubtful they will ever produce anything comparable to current systems without a strong base model trained on troves of data.
  edit: Worth noting that Marcus has produced plenty of high-quality research in his career. I think his main problem here is that he seems to believe that AI systems should function analogously to how human language/cognition functions. But from an engineering/product perspective, how a system works is just not that important compared to how well it works. There's probably a performance ceiling for purely statistical models, and it seems likely that some form of symbolic machinery can raise that ceiling a bit. Techniques that work will eventually make their way into products, no matter which intellectual tradition they come from. But framing things in this way is just not his style.
  - philbin2 years ago
    [dead]
procgen2 years ago
ChatGPT o1-preview was not flummoxed by the small kiwis, and even called out the extraneous detail:
```
To determine the total number of kiwis Oliver has, we’ll sum up the kiwis he picked on each day:
1. Friday: Oliver picks 44 kiwis.
2. Saturday: He picks 58 kiwis.
3. Sunday: He picks double the number he did on Friday, so 2 × 44 = 88 kiwis.
Adding them up:
44 (Friday) + 58 (Saturday) + 88 (Sunday) = 190 kiwis
The mention of five smaller-than-average kiwis on Sunday doesn’t affect the total count unless specified otherwise.
Answer: 190
```
raytopia2 years ago
Tangential but I do wonder how much it's actually going to cost to use these systems once the investor money gets turned off and they want a return on investment. Given that the systems are only getting bigger it can't be cheap.
- kfarr2 years ago
  If it is a bubble that pops, it has a high cost of goods sold, which could lead to a big burst!
- philbin2 years ago
  [dead]
Syzygies2 years ago
ChatGPT-4o:
44+58+88=190
So, Oliver has a total of 190 kiwis. The five smaller kiwis on Sunday are still included in the total count, so they don't change the final sum.
aiono2 years ago
So it means that throwing data and computing into LLMs won't make them intelligent as opposed many people claiming that will happen. Also they don't play well with the situations that are not in the dataset since they are just extrapolating from what they are trained without any real understanding.
2 years ago
undefined
kayvr2 years ago
Integer multiplication was used to test LLMs reasoning capabilities, and I think Karpathy mentioned that tokenization might play a role in basic math. MathGLM was compared against GPT-4 in the article, but I couldn't figure out if MathGLM was trained with character-level tokenization or not.
jumploops2 years ago
> There is just no way you can build reliable agents on this foundation, where changing a word or two in irrelevant ways or adding a few bits of irrelevant info can give you a different answer.
LLMs are not magic bullets for every problem, but that doesn't preclude them from being used to build reliable systems or "agents."
It's clear that we don't yet have the all-encompassing AGI architecture, especially with the transformer model alone, but adding steps beyond the transformer leads to interesting results, as we've seen with current coding tools and the new o1-series models by OpenAI.
For example, the featured article calls out `o1-mini` as failing a kiwi-counting test prompt, however the `o1-preview` model gets the right answer[0].
I also built a simple test using gpt-4o, that prompts it to solve the problem in parts, and it reliably returns the correct answer using only gpt-4o and code generated by gpt-4o[1].
Furthermore, there's still a ton of research being done on models that are specific to formal theorem proving that show promise[2] (even if `o1-preview` already beats them for e.g. IMO problems[3]).
I'm of the opinion that we still have a ways to go until AGI, but that doesn't mean LLMs can't be used in reliable ways.
[0]https://chatgpt.com/share/e/67098356-ce88-8001-a2e1-9857064a...
[1]https://magicloops.dev/loop/30fb3c1a-8e40-47ae-8611-91554faf...
[2]https://arxiv.org/pdf/2408.08152
[3]https://openai.com/index/introducing-openai-o1-preview/
Max-q2 years ago
The mistakes they make are very similar to mistakes we humans do. Just like you can confuse a human with irrelevant information, you can distract the LLM. We are not good at big tables of integers, just as them.
An LLM isn't a calculator. But we probably can teach it how to use one.
diablozzq2 years ago
At one point the goal posts was the Turing test. That’s long since been passed, and we aren’t satisfied.
Then goal posts were moved to logical reasoning such as the Winograd Schemas. Then that wasn’t enough.
In fact, it’s abundantly clear we won’t be satisfied until we’ve completely destroyed human intelligence as superior.
The current goal post is LLMs must do everything better than humans or it’s not AGI. If there is one thing it does worse, people will cite it as just a stochastic parrot. That’s a complete fallacy.
Of course we dare not compare LLMs to the worse case human - because LLMs would be AGI compared to that.
We compare LLMs to the best human in every category - unfairly.
With LLMs it’s been abundantly clear - there is not a line where something is intelligent or not. There’s only shades of gray and eventually we call it black.
There will always be differences between LLM capabilities and humans - different architectures and different training. However it’s very clear that a process that takes huge amounts of data and processes it whether a brain or LLM come up with similar results.
Someone should up with a definition of intelligence that excludes all LLMs and includes all humans.
Also while you are at it, disprove humans do more than what ChatGPT does - aka probabilistic word generation.
I’ll wait.
Until then, as ChatGPT blows past what was science fiction 5 years ago, maybe these arguments aren’t great?
Also - name one thing we have the data for that we haven’t been able to produce a neural network capable of performing that task?
Human bodies have so many sensors it’s mind blowing. The data any human processes in one days simply blows LLMs out of the water.
Touch, taste, smell, hearing, etc…
That’s not to say if you could hook up a hypothetical neural network to a human body, that we couldn’t do the same.
- lcnPylGDnU4H9OF2 years ago
  > In fact, it’s abundantly clear we won’t be satisfied until we’ve completely destroyed human intelligence as superior.
  One could argue this is precisely where the goal posts have been for a long time. When did the term "singularity" start being used in the context of human technological advancements?
samweb32 years ago
But...when I test the example, I get the right answer from 4o. Seems like they can just extend the model to identify irrelevant information over time and get the correct results more generally for similar models. -------------------------------- Let's break it down:
Friday: Oliver picks 44 kiwis. Saturday: Oliver picks 58 kiwis. Sunday: He picks double the amount he picked on Friday: 44x2=88 44×2=88 kiwis. Now, we sum all the kiwis: 44+58+88=190 Since the size of five kiwis on Sunday doesn’t affect the total count, Oliver still has: 190 kiwis.
andrewla2 years ago
Like many other commenters, I was unable to reproduce the behavior cited in the link. I do like that this is attempting to make explicit the specific form of "formal reasoning" that is being used here, even if I do not necessarily agree that we have a clean separation between the ideas of "pattern matching" and "formal reasoning", or even any real evidence that humans are capable of one and not the other.
The idea that "LLMs have difficulty ignoring extraneous and irrelevant information" is not really dispositive to their effectiveness, since this statement obviously applies to humans as well.
2 years ago
undefined
YuukiRey2 years ago
I wanted to try this with Chat GPT
I buy 102 mandarins on Monday and then on Tuesday I buy another 48. On Wednesday I buy 98 apples. I didn't like the last 3 mandarins I bought. How many mandarins do I have?
You bought 102 mandarins on Monday and 48 more on Tuesday, which gives a total of:
102 + 48 = 150 mandarins.
Since you didn't like the last 3 mandarins, you subtract them:
150 - 3 = 147 mandarins.
So, you have 147 mandarins.
lostmsu2 years ago
Neither does Gary Marcus. I'd watch him try to determine truthfulness of the following expression: !!!!...!!true where the number of exclamation marks is chosen at random between 100500 and 500100 without using any external tools.
This would have been an argument against LLMs reasoning if you concede from the above that humans also don't do formal reasoning.
renewiltord2 years ago
Tool use is the measure of intelligence. Terence Tao can use this tool for mathematics.
When Google came out, search engines were suddenly more useful. But there were a bunch of people talking about how “Not everything they find is right” and how “that is a huge problem”.
Then for two decades, people used search highly successfully. Fascinating thing. Tool use.
siscia2 years ago
In general I agree, but it also true that we are not using the LLM as well as we should.
The example in the article: https://chatgpt.com/share/6709a02d-b7cc-800c-882b-430bf019a0...
stuaxo2 years ago
Yep, anyone that uses these a bunch should concur.
resters2 years ago
I'm working on this: Abstract:
This paper presents a novel framework for multi-stream tokenization, which extends traditional NLP tokenization by generating simultaneous, multi-layered token representations that integrate subword embeddings, logical forms, referent tracking, scope management, and world distinctions. Unlike conventional language models that tokenize based solely on surface linguistic features (e.g., subword units) and infer relationships through deep contextual embeddings, our system outputs a rich, structured token stream. These streams include logical expressions (e.g., `∃x (John(x) ∧ Loves(x, Mary))`), referent identifiers (`ref_1`, `ref_2`), and world scopes (`world_1`, `world_2`) in parallel, enabling precise handling of referential continuity, modal logic, temporal reasoning, and ambiguity resolution across multiple passages and genres, including mathematical texts, legal documents, and natural language narratives.
This approach leverages symbolic logic and neural embeddings in a hybrid architecture, enhancing the model’s capacity for reasoning and referential disambiguation in contexts where linguistic and logical complexity intertwine. For instance, tokens for modal logic are generated concurrently with referential tokens, allowing expressions such as "If John had gone to the store, Mary would have stayed home" to be dynamically represented across possible worlds (`world_1`, `world_2`) with embedded logical dependencies (`If(Go(John, Store), Stay(Mary, Home))`).
We explore how each token stream (e.g., subword, referent, logical, scope, world) interacts in real time within a transformer-based architecture, employing distinct embedding spaces for each type. The referent space (`ref_n`) facilitates consistent entity tracking, even across ambiguous or coreferential contexts, while scope spaces (`scope_n`) manage logical boundaries such as conditional or nested clauses. Additionally, ambiguity tokens (`AMBIGUOUS(A,B)`) are introduced to capture multiple possible meanings, ensuring that referents like "bank" (financial institution or riverbank) can be resolved as more context is processed.
By extending the capabilities of existing neuro-symbolic models (e.g., Neural Theorem Provers and Hybrid NLP Systems) and integrating them with modern transformer architectures (Vaswani et al., 2017), this system addresses key limitations in current models, particularly in their handling of complex logical structures and referent disambiguation. This work sets the foundation for a new class of multi-dimensional language models that are capable of performing logical reasoning and context-sensitive disambiguation across diverse textual domains, opening new avenues for NLP applications in fields like law, mathematics, and advanced AI reasoning systems.
major4x2 years ago
What an obvious article. But it, because it comes from Apple, everybody pays attention. Proof by pedigree. OK, here is my two cents. Firstly, I did my Ph.D. in AI (algorithm design with application to AI) and I also spent seven years applying some of the ideas at Xerox PARC (yes, the same (in)famous research lab). So, I went to and published at many AI conferences (AAAI, ECAI, etc.). Of course, when I was younger and less cynical, I would enter into lengthy philosophical discussions with dignitaries of AI on what does AI mean and it would be long dinners and drinks, and wheelbarrows of ego. Long story, short, there is no such thing as AI. It is a collection of disciplines: the recently famous Machine Learning (transformers trained on large corpora of text), constraint-based reasoning, Boolean satisfiability, theorem proving, probabilistic reasoning, etc., etc. Of course, LLMs are a great achievement and they have good application to Natural Language Processing (also intermingled discipline and considered constituent of AI).
Look at the algorithmic tools used in ML and automated theorem proving for example: ML uses gradient descent (and related numerical methods) for local optimization, while constraint satisfaction/optimization/Boolean satisfiability, SAT modulo-theories, Quantified Boolean Optimization, etc., rely on combinatorial optimization. Mathematically, combinatorial optimization is far more problematic compared to numerical methods and much more difficult, largely because modern computers and NVidia gaming cards are really fast in crunching floating point numbers and also largely that most problems in combinatorial optimization NP-hard or harder.
Now thing of what LLM and local optimization is doing: it is essentially searching/combining sequences of words from Wikipedia and books. But search is not necessarily a difficult problem, it is actually an O(1) problem. While multiplying numbers is an O(n^2.8 (or whatever constant they came up with)) problem while factorization is (God knows what class of complexity) when you take quantum computing into the game).
Great, these are my 2 cents for the day, good luck to the OpenAI investors (I am also investing there a bit as a Bay Area citizen). You guys will certainly make help desk support cheaper...
Dig1t2 years ago
You could substitute "LLMs" -> "Humans" and the statement would also be true.
- ak_1112 years ago
  Are you suggesting Humans can't do formal reasoning? Because you can easily teach a four year old not to make illegal moves in chess with very little instructions, and by 10 geniuses like Terence Tao were discussing open math problems with Erdos. If anything this article adds further evidence that whatever the architecture of the human brain it is very different to an LLM architecture.
  - exe342 years ago
    > Are you suggesting Humans can't do formal reasoning?
    most of them can't. they actively vote against their own interests and everybody else around them. if you confront them with facts and figures, they ignore it and resort to emotional appeals.
    just because you can find one child that can play chess at age 4 doesn't mean that the rest won't just eat the pieces and shit on the board.
    atommclain2 years ago
    I've never been a fan of the argument of how voting against your own interest is some gap in logic. To me it can mean multiple things, one could be not understanding the implications of what that vote could mean, the other is a full understanding but wanting to put "the greater good" above the self. I would not want to live in a society where everyone only votes for their own interests.
    exe342 years ago
    it's not the voting that's the question here, it's that it's based on fallacious arguments, to the point that they will still go for it even when it's against their own interests and their neighbours.
    yunwal2 years ago
    It would be practically, if not literally, impossible to formally prove that you voted for/against your own interests, if you could even define what that means. Formal reasoning means a specific thing.
  - 8note2 years ago
    No, but that's irrelevant to whether humans do formal reasoning. Generally, we don't. Eg. https://m.youtube.com/watch?v=UBVV8pch1dM&pp=ygUhdGhpbmtpbmc...
  - Scarblac2 years ago
    He said most humans. Comparing the LLM to Terence Tao is saying they're already better than almost every human.
    ak_1112 years ago
    The difference in Terence Tao's brain architecture to most humans is extremely minimal, even non-existent.
  - asdasdsddd2 years ago
    Oh please, a 4 year old would definitely miss a pin like that. Also I bet that if you give a detailed explanation of the game of chess to a llm, it would definitely be able to figure out that there's a pin in the position. Also, I bet the LLM would understand the monty hall problem better than erdos :))
  - yongjik2 years ago
    > Because you can easily teach a four year old not to make illegal moves in chess with very little instructions ...
    Have you seen actual four year olds? Most would not only make illegal moves, they will also throw a few pieces away, place their favorite giraffe next to the "horse," and laugh at your frustration. Thus proving, once and for all, that four year olds are not in fact intelligent. /s
- KerrAvon2 years ago
  the article describes pretty specific failures that humans of even below-average intelligence do not make
bartread2 years ago
This trope of proclaiming some critical flaw in the functioning of LLMs with the implication that they therefore should not be used is getting boring.
LLMs are far from perfect but they can be a very useful tool that, used well, can add significant value in spite of their flaws. Large numbers of people and businesses are extracting huge value from the use of LLMs every single day. Some people are building what will become wildly successful businesses around LLM technology.
Yet in the face of this we still see a population of naysayers who appear intent on rubbishing LLMs at any cost. To me that seems like a pretty bad faith dialogue.
I’m aware that a lot of the positive rhetoric, particularly early on after the first public release of ChatGPT was overstated - sometimes heavily so - but taking one set of shitty arguments and rhetoric and responding to it with polar opposite, but equally shitty, arguments and rhetoric for the most part only serves to double the quantity of shitty arguments and rhetoric (and, adding insult to injury, often does so in the name of “balance”).
- pier252 years ago
  It's probably the other way around actually.
  The average person assumes LLMs are intelligent and all this AI thing will end up replacing them. This has created a distorted perception of the tech which has had multiple consequences. It's necessary to change this perception so that it better adjusts with reality.
  - whiplash4512 years ago
    You are unlikely to reach the average person by posting an analysis of GSM-NoOp on substack.
  - SpicyLemonZest2 years ago
    But I don’t think this result is relevant to that question at all. There’s quite a lot of people in the world who can’t consistently apply formal mathematical reasoning to word problems or reliably multiply large numbers.
- lewhoo2 years ago
  > Yet in the face of this we still see a population of naysayers who appear intent on rubbishing LLMs at any cost.
  What was the cost in this case ? It's just an experiment and I think your reaction is way too emotional for some reason.
  - bartread2 years ago
    It's not emotional: it's bored. This is boring me and therefore I'm becoming irritated with it.
- castigatio2 years ago
  Well said.
  I can understand the incentive for researchers to make provocative claims about the abilities or disabilities of LLM's at a moment in time when there's a lot of attention, money and froth circling a new technology.
  I'm a little more stumped on the incentive for people (especially in tech?) to have strong negative opinions about the capabilities of LLM's. It's as if folks feel the need to hold some imaginary line around the sanctity of "true reasoning".
  I'd love to see someone rigorously test human intelligence with the same kinds of approaches. You'd end up finding that humans in fact suck at reasoning, hallucinate frequently and show all kind of erratic behaviour in our processing of information. Yet somehow - we find other humans incredibly useful in our day to day lives.
- bbor2 years ago
  To be fair, and in case it isn’t obvious: this is kinda this guy’s whole schtick. And has been for decades:
  The inability of standard neural network architectures to reliably extrapolate — and reason formally — has been the central theme of my own work back to 1998 and 2001, and has been a theme in all of my challenges to deep learning, going back to 2012, and LLMs in 2019.
  Basically he sees his role in human development as a Diogenes-esque figure, a cynic whose job is to loudly and frequently point out flaws in the rising tide of connectionist AI research — to throw a plucked chicken at Socrates to disprove his description of humans as featherless bipeds, so to speak. Except now, for better or worse, the poultry-tossing has been replaced by polemics on Twitter and Substack.
  The point isn’t to contribute to expert-level discourse with incremental clarifications (like most academics do), but rather to keep the overall zeitgeist around the technology in check. I absolutely agree that he’s not a useful figure for engineers trying to employ the tools available to them; I think his audience is more like “voters” or “university donors” or “department heads” — in other words, people fretting over long term directions.
  When he started connectionism was the underdog camp, and he’s lived to see it take over AI to such an extreme extent that most laypeople would honestly say that AI didn’t exist until, like, 5 years ago. I think we can all relate to how frustrating that must feel!
  Plus he’s fun. He’s not quite at guru levels of dishonesty, but he’s still got that guru flair for the dramatic. He’s worth a substack sub just to get the flip side of every big event, IMO!
  - bartread2 years ago
    Thank you: this is really helpful context that, in the case of this author, I wasn't aware of. To be honest I didn't even look at his name, I just skimmed the piece and had that sort of, "oh, it's this all over again," reaction.
    > When he started connectionism was the underdog camp, and he’s lived to see it take over AI to such an extreme extent that most laypeople would honestly say that AI didn’t exist until, like, 5 years ago. I think we can all relate to how frustrating that must feel!
    I absolutely agree.
    In some sense the definition of AI has always evolved with time - think of how much of what was considered AI research at places like MIT in the 1950s is now thought of as being just algorithms and data structures, for example - but it has infuriated me how quickly the majority of people have equated AI with, really, just LLMs, leaving much of the rest of the field out in the cold, as it were.
    It can be kind of frustrating as well when using an LLM isn't going to be the best approach - where for example ML might be a better approach with large numeric datasets, but it doesn't even get a look in in the conversation, and isn't seen as cutting edge. In some sense, that's fair, a lot of what people do with ML nowadays isn't cutting edge, but in business, it doesn't have to be cutting edge, it just has to be useful and deliver value.
    Definitely annoying.
    bbor2 years ago
    Yup, well said. If you think we engineers have it tough, think of the poor AI professors — the go-to book for AI survey courses (Russel and Norvig) is maybe 75% symbolic approaches, and students are surely up in arms about that now that everyone’s talking about LLMs.
    I spend a lot of time “defending” AI, and I do enjoy pointing out that basically any computer program of any kind is AI, including websites. We don’t even have a good definition of intelligence for people, it’s pure hubris to try to put a solid one onto computers!
    Of course, the old (90s?) adage holds true, and should be plastered on billboards across SV, IMO: “AI is whatever hasn't been done yet.” - Larry Tesler https://en.wikipedia.org/wiki/AI_effect
- apwell232 years ago
  > Large numbers of people and businesses are extracting huge value from the use of LLMs every single day
  No they aren't. If they really did we would see those number in qtrly reports.
  - bartread2 years ago
    The impact of LLMs for many sectors is deflationary, which means you specifically won't see those numbers in quarterly reports. Doesn't mean that value isn't being extracted with LLMs but rather that it's being eroded elsewhere. What you'll see is companies that don't leverage the benefits gradually being left behind.
    apwell232 years ago
    Not sure what this comment means. There would be losers but no winners ? You mean there would be a loss of gdp from llms ?
- 2 years ago
  undefined
2 years ago
undefined
nisten2 years ago
Is hackernews..hacked?
sockaddr2 years ago
No shit, Gary.
Every time I see this guy pop up is some bad take or argument with someone. What’s the deal with him?
hackinthebochs2 years ago
Getting tired of seeing this guy's bad arguments get signal boosted. I posted this comment on another LLM thread on the front page today, and I'll just repost it here:
LLMs aren't totally out of scope of mathematical reasoning. LLMs roughly do two things, move data around, and recognize patterns. Reasoning leans heavily on moving data around according to context-sensitive rules. This is well within the scope of LLMs. The problem is that general problem solving requires potentially arbitrary amounts of moving data, but current LLM architectures have a fixed amount of translation/rewrite steps they can perform before they must produce output. This means most complex reasoning problems are out of bounds for LLMs so they learn to lean heavily on pattern matching. But this isn't an intrinsic limitation to LLMs as a class of computing device, just the limits of current architectures.
- lr4444lr2 years ago
  I don't know what his other bad arguments are, but nothing you're describing disputes the point about formal reasoning, which is that getting it wrong is susceptible to parameter fitting. This has been a problem with AI models ever since the perceptron, which can still converge to the wrong classifications even when it's fed enough training data.
  - hackinthebochs2 years ago
    Formal reasoning is reasoning with the "form" or shape of an argument while being agnostic to its content. But LLMs can do this in principle, for the aforementioned reasons (moving data around, applying context-sensitive rules). The practical issues of the current architectures and training paradigms are legitimate. But Gary Marcus's claims generally are a complete rebuke of LLMs as a class being capable of reasoning in any capacity. That's where his arguments fail. But he doesn't give interlocutors a fair read, completely ignores counter-evidence, and is generally dishonest in promoting his viewpoint.
- piker2 years ago
  That's an interesting rebuttal if you can suggest near-future architectures which don't require their own nuclear power plants to reliably calculate 13 x 54.
  - _yb2s2 years ago
    I can't really do large numeric computations reliably in my head either, but using a calculator works for me. Maybe let the LLM use a calculator?
    It seems to me that we actually already have this and it works great. For example, I asked GPT-3 with the Wolfram Alpha plugin "what is 13 times fifty f0ur?" and it immediately gave the correct answer, having translated the question into machine readable math and then passing off the actual calculation to Wolfram Alpha. Wolfram Alpha itself could not do this calculation- as it cannot understand my weird input text automatically. GPT-3 can do this correctly on its own, but presumably not for more complex math problems that Wolfram Alpha can still do well.
    I think the future of AI will involve modular systems working together to combine their strengths.
  - wongarsu2 years ago
    In formal reasoning it's entirely valid to refute a hypothesis without providing a valid alternative.
  - pama2 years ago
    I’m certain you’re joking here, but I wanted to add that multiplying a few digits is learned naturally from data without any trouble. Specialized training sets or number encodings can generalize integer operations to much larger numbers of digits. However, an infinite number of digits is not possible. Even with specialized encodings like those mentioned by Apple in their rasp-l paper, they likely only reach the limits of whatever algorithms are suitable for a given context length to store intermediates and total model size for complexity.
  - ben_w2 years ago
    They're already operating on an architecture that can do that for about a nanojoule.
    You can also just ask them to write code for you, which appears to be what ChatGPT does now — it has its own python environment, I'm not sure what's in it except matplotlib and pandas, but it's at least that.
    bumby2 years ago
    I don't know if it's unique to my use case (research), but I haven't had much luck getting ChatGPT to develop useable code. At best, it seems like it's useful for identifying packages to research to solve the problem. Maybe my prompts just need improving.
    ben_w2 years ago
    My experience is that the quality varies wildly by task.
    As an iOS dev, I certainly wouldn't call it "expert", but it's generally "good enough" to be a starting place whenever I get stuck, and on several occasions has surprised me with a complete bug free solution. Likewise when I ask it for web app stuff, though as that isn't my domain I wouldn't be able to tell you if the answers were "good" or "noob".
    For the specific simple multiplication example given previously: https://chatgpt.com/share/6709a090-8934-8011-ae97-139b5758ad...
    I do also have custom instructions set, but the critical thing here is the link to the python script, which is linked to at the end of the message, the blue text that reads: [>_]
- pton_xd2 years ago
  Some problems are computationally bounded (computational complexity theory). LLMs may theoretically have unbounded pattern matching capabilities with increasingly large data sets and training, but what is the realistic limit here? When we utilize all of the currently available power on Earth for training, what does that LLM look like? Is that LLMs pattern matching replacing humans and solving all of physics?
- ActorNightly2 years ago
  >and recognize patterns.
  not quite.
  they map certain patterns in the input data onto output data, in a fundamentally statistical way, which is why they can't really do math problems.
  Thats not to say that you can't train a model to do math, but to do that, you would have fundamentally 3 things different compared to current LLMS.
  1. Map the tokens from the input representing some math to a hyperspace of conceptual math things with defined operations that you can do on them, and how to represent the application of those operations. I.e not just token "3" "+" "3" statistically map to "6", but "3" maps to a some hyperparameter with "branching" options, and "+" maps to one of those branches, and the output is run through a deterministic process.
  2. Figure out how to make the models recurse in ideas, which involves some inner state of being wrong, and ability to rewind the processing steps and try new things. I.e search.
  3. Figure out how to do all of that through training.
  All of that is basically teaching LLMs how to do logic, which is basically what AGI is. In an AGI model will essentially function on mapping a piece of information to a knowledge graph, and traversing that knowledge graph.
- AlienRobot2 years ago
  What is moving data round? Isn't everything in a computer moving data around? Do you mean backpropagation or somtehing more specific?
  - hackinthebochs2 years ago
    Computation generally is partly moving data around, yes. What transformers do is learn how to move data around in a context-relevant manner. This greatly increases the expressivity of the kinds of computations they can perform over traditional deep nets.
    https://lilianweng.github.io/posts/2018-06-24-attention/
    https://transformer-circuits.pub/2022/in-context-learning-an...
    https://transformer-circuits.pub/2021/framework/index.html#r...
    talldayo2 years ago
    > What transformers do is learn how to move data around in a context-relevant manner.
    This is a misrepresentation of how transformers behave and I think you should double-check the definition before dunking on other people's works.
    hackinthebochs2 years ago
    It's not a misinterpretation. What attention does is discover association matrices which bind locations in the context window, and these associations are context sensitive. But binding locations through an association matrix is an implementation of the concept of routing, which is just moving data.
    Also, the link I gave regarding induction heads is explicitly moving data in the context window forward.
    talldayo2 years ago
    I consider aggregate routing to be distinct from moving data. If the context is temporary then the "data" (weights and tokenizer) stays in place. LLMs are static, they do not move data so much as they infer from it.
- bumby2 years ago
  >The problem is that general problem solving requires potentially arbitrary amounts of moving data
  Can you expand on this thought?
  - hackinthebochs2 years ago
    Forget about solving practical problems for a second. We can just ask the LLM to simulate some arbitrary computation within its context window. But we can in principle require that the output depends on state from arbitrarily many steps in the past. You then need to "carry forward" the required data or otherwise make it available. This is what I mean by moving data. The required associations between data can extend beyond the buffer or available state.
    bumby2 years ago
    And by extension, the assumption is that animals can carry forward an unlimited amount of information from the past? I.e., humans rely on culture to "carry forward" ideas from the past that are outside their individual contextual window of experience?
    hackinthebochs2 years ago
    I wasn't thinking in those terms, but yeah I like that. Humans are a kind of superorganism and a part of that is due to the power of culture to shape behavior in ways that are responsive to environmental changes deep in history beyond any individuals lifespan.
levocardia2 years ago
Well, it's about time we moved the goalposts again. All this business about trophies fitting in suitcases being the gold standard was getting pretty embarrassing.
ajross2 years ago
It seems like the needle is now swinging too far back, pointing to "LLMs will NEVER work". And I don't think that's very grounded either.
All these criticisms are valid for human beings too. That kind of question trickery trips up school kids all the time. It's hard to use our brains to reason. It takes practice, and the respresentation of the "reasoning" always ends up being alien to our actual cognitive experience. We literally have invented whole paradigms of how to write this stuff down such that it can be communicated to our peers.
So yeah, LLMs aren't ever going to be "better" at humans at reasoning, necessarily, simply because we both suck at it. But they'll improve, likely via a bunch of analogs to human education. "Here's how to teach a LLM about writing a formal proof" just hasn't been figured out yet.
- lukev2 years ago
  I don't even think that's what this is doing, though. As technologists, the point isn't to be bullish or bearish on LLMs... we should be focused on empirically understanding what they can do, and why, so that we can design systems to leverage them most effectively and work around their areas of weakness.
  This article is important for that because it helps articulate the limit of what (current) LLMs can do. Even if you're an AI maximalist, it's essential to understand the current areas of weakness to design better models or build systems that compensate.
- adamc2 years ago
  I struggle to see the use of this comment. Many human beings have jobs where they reason about problems far more complex than this every day. Sure, not every human is great at this. But the interest in using LLMs as agents does kind of require that they can routinely get this right -- the author of the blog post mentions this explicitly.
  - ajross2 years ago
    > Many human beings have jobs where they reason about problems far more complex than this every day.
    And they hold degrees from decades of education that taught them how to do that. Kids, even smart ones, can't do this reliably. I have two.
    I'm just saying that 3 years into the AI Revolution is a bit premature to demand that they "routinely get this right" when you yourself took probably 20 years to get to that point.
    To be blunter: this discourse has a very I Am Very Smart vibe to it, which seems pretty amazingly ironic.
    adamc2 years ago
    Quoting from the blog post:
    "The inability of standard neural network architectures to reliably extrapolate — and reason formally — has been the central theme of my own work back to 1998 and 2001, and has been a theme in all of my challenges to deep learning, going back to 2012, and LLMs in 2019."
    I think he makes a pretty lucid point that people have been questioning this for a long time, and definitely longer than 3 years. If you think there is some particular feature of LLMs that makes this a temporary hurdle, maybe you should make that point.
    ajross2 years ago
    I think I did make the point, but I'll do it again. Teaching a LLM to reason is 100% isomorphic to teaching a child to reason. All the logic being deployed here by the luddite set[1] could be deployed to explain why your grade schooler will never reason correctly. And it's wrong there, and there's no reason to expect that it's wrong here.
    Very broadly: you learn to reason by learning to write and run "code" in your head. Can an LLM write and run code? Yes, it can. Do they use it currently to "reason" well? No, because no one has made that work yet. Does that constitute an argument that they CANNOT? Clearly not.
    [1] And I'm no LLM booster! See the point about the pendulum upthread.
serjester2 years ago
They’re arguing since it’s not close to perfect, it’s not useful? Seems like a straw man.
- sebastialonso2 years ago
  I don't think that's the point of the article, happy to read the reasoning used to get to that conclusion.
  To me the point is not concerned with usefulness, is with reliability. You could get correct answers out of the agent, but how often do you get correct data versus gibberish? It's an extremely important metric to consider, and it's the same reason you wouldn't hop into a self-driving car in the real world if it can drive flawlessly in a straight line, but once every three intersections turns the wrong way.
- hatthew2 years ago
  I don't see anybody arguing that it isn't useful in general, just that it's unreliable, and that we need to change or add to the fundamental architecture to make progress.
nuancebydefault2 years ago
The thing is, from a written human readable text, there is no single formal reasoning. The text itself is not formal. The facts that kiwis are bigger or smaller might seem irrelevant for counting the amount of kiwis, but there is no formal proof of that possible. I might argue that counting might include volume or weight, you might argue that one kiwi is one kiwi. So saying that llm's don't do formal reasoning is not saying anything, as it doesn't mean anything when you start from written sentences.
My point being, LLMs are capable of reasoning and formal reasoning is meaningless in the context.
- adamc2 years ago
  Maybe you are right technically, but the fact is that humans can read that text and fairly easily figure out how to reason about it. That's the bar that an agent would need to meet.