Learning how to think with Meta Chain-of-Thought(arxiv.org)

229 pointsby drcwpl6 months ago11 comments

drcwpl6 months ago
I find their critique compelling, particularly their emphasis on the disconnect between CoT’s algorithmic mimicry and true cognitive exploration. The authors illustrate this with examples from advanced mathematics, such as the "windmill problem" from the International Mathematics Olympiad, a puzzle whose solution eludes brute-force sequential thinking. These cases underscore the limits of a framework that relies on static datasets and rigid generative processes. CoT, as they demonstrate, falters not because it cannot generate solutions, but because it cannot conceive of them in ways that mirror human ingenuity.
As they say - "Superintelligence isn't about discovering new things; it's about discovering new ways to discover."
- seagullz6 months ago
  And then other problems would perhaps turn up down the track that would call for "discovering new ways to discover new ways of discovery" and so on.
- KaoruAoiShiho6 months ago
  Just train it on meta reasoning, ie train it on people discovering ways to discover. It's not really a big problem, just generate the dataset and have at it.
  - derefr6 months ago
    This doesn't give you the ability to process ideas through the derived new insights, any more than loading the contents of a VLSI program into regular RAM gives you an FPGA.
    The linear-algebra primitives used in LLM inference, fundamentally do not have the power to allow an LLM to "emulate" its own internals (i.e. to have the [static!] weights + [runtime-mutable] context, together encode [runtime-mutable] virtual weights, that the same host context can be passed through.) You need host support for that.
    lxgr6 months ago
    > The linear-algebra primitives used in LLM inference, fundamentally do not have the power to allow an LLM to "emulate" its own internals […] You need host support for that.
    Neither do biological brains (explicitly), yet we can hypothesize just fine.
    derefr6 months ago
    You're conflating two steps:
    1. hypothesizing — coming up with a novel insight at runtime, that uncovers new parts of the state space the model doesn't currently reach
    2. syllogizing — using an insight you've derived at runtime, to reach the new parts of the state space
    LLMs can do 1, but not 2.
    (Try it for yourself: get an LLM to prove a trivial novel mathematical theorem [or just describe the theorem to it yourself]; and then ask it to use the theorem to solve a problem. It won't be able to do it. It "understands" the theorem as data; but it doesn't have weights shaped like an emulator that can execute the theorem-modelled-as-data against the context. And, as far as I understand them, current Transformer-ish models cannot "learn" such an emulator as a feature. You need a slightly different architecture for that.)
    And actually, humans can't really do 2 either!
    That is: humans can't immediately make use of entirely-novel insights that weren't "trained in", but only just came to them, any more than LLMs can!
    Instead, for humans, the process we go through is either:
    • come up with the insight; sleep on it (i.e. do incremental training, converting the data into new weights); use the insight
    • build up 99% of the weights required for the insight "in the background" over days/months/years without realizing it; make the final single connection to "unlock" the insight; immediately use the insight
    LLMs don't get to do either of these things. LLMs don't do "memory consolidation"; there is no gradual online/semi-online conversion of "experiences" into weights, i.e. reifying the "code stored as data" into becoming "code" that can be executed as part of the model.
    With (current) LLMs, there's only the entirely-offline training/fine-tuning/RLHF — at much greater expense and requiring much greater hardware resources than inference does — to produce a new iteration of the model. That's why we're (currently) stuck in a paradigm of throwing prompts at ever-larger GPT base models — rather than just having an arbitrary stateful base-model that you "install" onto a device like you'd install an RDBMS, and then have it "learn on the job" from there.
    lxgr6 months ago
    > And actually, humans can't really do 2 either!
    > That is: humans can't immediately make use of entirely-novel insights that weren't "trained in", but only just came to them, any more than LLMs can!
    Agreed – but I'd argue that they both can, albeit in an extremely clunky way (i.e. very similar to "chain-of-thought" LLMs): Mechanically applying the new insights in a low-efficiency, "emulated" layer.
    > LLMs don't get to do either of these things. LLMs don't do "memory consolidation"; there is no gradual online/semi-online conversion of "experiences" into weights, i.e. reifying the "code stored as data" into becoming "code" that can be executed as part of the model.
    At the moment that's definitely their biggest weakness, but one could argue that memory consolidation happens ~once per year, globally, as past interactions with them almost certainly become future training data.
    In some ways, that's more powerful than a single human gradient descending overnight and in the shower; in others, it's obviously much worse.
    derefr6 months ago
    > memory consolidation happens ~once per year, globally, as past interactions with them almost certainly become future training data.
    Yes and no.
    Humans — organic animal brains in general — have a seemingly highly-clever approach to memory consolidation / online training. The organic process for memory consolidation seems to take advantage of our own analytical capabilities to pick apart and amplify the salient and practically-useful parts of our memories.
    As such, animal brains seemingly need much less training data to learn a skill than current state-of-the-art model architectures do. You could say that they can leverage an (incremental) training dataset to a much higher degree than current NN training architectures do. (It's sort of the complement to a better compression algorithm having a higher degree of signal-loss-less compression for the same data: animal brains have a higher degree of noise-gain-less data extrapolation over the same data.)
    If we could figure out what organic brains were doing here, and apply it to the sort of "global fan-in / reduce / fan-out" approach bigcorps use to train new base-model revisions, then I'd agree that the choice of online vs offline training would become just a trade-off of locality vs scale.
    After all, in the end what you want is to be able to train a model to do something in "soft real time" (taking a few days/weeks to learn, including time for rest); with the model learning to perform the task "on the job" by interacting with the task, like a human/animal would. It doesn't really matter whether the training has to be sent back to the cloud to be incorporated into a new model revision; it just matters that it happens and that you see the results.
    So if the distributed training architectures used to train GPT base models could "notice insights" in small amounts of RLHF data, and amplify/extrapolate large amounts of new training data from those — then you might expect the model that "learned in the cloud" to "learn on the job" just as well as the model that was local and only thinking about that training data. (But the GPT would have "learned on the job" some million other capabilities at the same time!)
    But this does start to sound like the kind of thing that violates some property of one of those "can only choose two" triangles.
    After all, current model training approaches — like everything to do with ML — are taking advantage of SIMD hardware. Which is why training clusters are big GPU clusters.
    But it seems like the thing our brains do in online training, involves some complex state-driven scheduling of passes over data.
    In other words, despite the brain itself being massively parallel, it's seemingly doing a lot of things that — if you were programming them — you'd want to push onto a CPU rather than a GPU. (To be specific: it's seemingly doing complex graph-search queries to bias the extrapolation process. And graph search is the antithesis of a horizontally-scalable problem.)
    And I have a feeling — just an intuition, not with anything concrete to back it up — that trying to scale up that dataset-extrapolation pass, where the framework "explores and connects memories to extract and amplify the insight signals"... would be something that would require most of the resources of the system just to extrapolate out one insight at a time. It wouldn't be able to be done in parallel for millions of capabilities — at least not when the model, and the search-space of incremental memories, are "large" (like a modern GPT model is; and like the RLHF dataset from one day's worth of millions of users' interactions would be.)
    If this were true, I would think it would be true for the same reasons that you can only vertically scale an IO-bound task so far — a single system only has so much IO bandwidth (e.g. PCIe lanes, DMA channels) available to interact with compute. (Mainframes have a lot of "accelerated" IO, but it's "accelerated" because it's interacting only with things that aren't the system's main compute. The CPU(s) "delegate" handling that IO to cores local to the IO. It's hierarchical NUMA.)
    When you have IO-bound workloads, you tend to scale horizontally — if you're building a CDN, you don't make big boxes; you just throw more boxes at the problem, where each is separately handling its own requests.
    And I feel like the translation here, is that if you want this insight-extrapolation, then it's only going to work when it's being done in a partitionable manner — by small models training on small incremental RLHF datasets, of the kind they build up through interaction with a single use-case at a time.
    And even though you could do that with a global fan-in pass, in the cloud, sharding training across millions of individual small models and assigning each both their own GPU and CPU resources... at that point, why bother? You'd be essentially using your big DC compute-cluster to emulate having millions of regular PCs that have a little GPU and CPU each, each working on their own task in a cache-coherent manner. At that point, just take advantage of the end-user devices that do the inference to also do the training!
    (Although, maybe you might still want to fan in the much-less-compact post-extrapolated training data that result from these individual on-device models stewing over their own problems; and then use that to train new general pre-trained base models. Or many such models, since you'd probably want smaller, more purpose-specific models in order to make on-device "on the job" training possible in the first place.)
    ---
    Funny enough, this could be an interesting hypothesis for the ultimate neurological — rather than biophysiological — role of sleep and dreaming! We do build regular neural connections while awake; but this memory-insight-extrapolation pass specifically, needs to "reserve" the entire brain to do what it does — so it only happens during sleep. It would explain the advice to "sleep on it" to enable lateral thinking. And it could be tested experimentally by measuring how well different animals extrapolate insights from small amounts of training, vs. how long they sleep.
  - keeganpoppen6 months ago
    “it’s not really a big problem”… surely you can’t be serious… this comment betrays such a profound ignorance that it could only have come from a genius or a... well, let’s not resort to name-calling…
    but, seriously: play the tape forward literally one frame and outline what this dataset even remotely resembles… a core sample from a living human brain? “yeah, just train it on thinking about everything at once”. strong ai isn’t like the restaurant: the path to success doesn’t involve starting with more than you finished with.
  - FuckButtons6 months ago
    Sure, what's your training corpus for that then?
    I find that fairly often if I'm really wrestling with a novel or difficult problem, I will work and work at it, and then one day I will wake up with the answer fully formed with no clear understanding of any of the thought processes that got me to arrive at the solution.
    Are you going to record peoples subconscious as they sleep, how do you train on something that is so poorly understood in the first place? It's nonsense.
    KaoruAoiShiho6 months ago
    I'm sure if you take an hour to recall you'll be able to come up with a process. Or ask a philosophy professor who specializes in reason.
    But the easiest way I can think of ATM is to go through all the questions that AI currently fails on, and then have a human work through them and show the chain of thought a human would do, including the false starts, and describing the strategy pivots. Then generate your corpus based on that. However, that burns the problem-set so you'll have to constantly try to come up with new problems.
  - mrbungie6 months ago
    That would still be limited eventually, at what point do we stop adding layers?
    hnuser1234566 months ago
    The point where it gets better at discovering ways of discovering things than the combination of the rest of humanity.
    What is the combination of parameters that makes a text generator quick-thinking, self-checking, confidence-estimating? Jumping directly from question to accurate, precise, confidence-measured answers, regardless of how abstract the question is?
    robwwilliams6 months ago
    Thanks to “derefr” and others for these insights. I used this thread as a prompt to Claude 3.5 Sonnet:
    My prompt:
    Fluid self-control of attention should be enough to bootstrap to metacognitive exploration. Is one the horse; the other the cart? Seems true of the evolution of human cognition from language and theory of minds-of-others that then makes us more attentive to our one minds as objects of internal meta-dissection. What do you think?
    Claude comments:
    An evolutionary perspective on the relation between attention control and metacognition is worth “attention”.
    The idea that fluid attention control might be the "horse" that pulls the "cart" of metacognition aligns well with what we know about human cognitive development.
    Consider:
    1. The evolution of theory of mind: Humans (primates snd perhaps other taxa) first developed the ability to model and attend to others' mental states (likely driven by social survival pressures). This gave humans a "template" for mental state modeling that could then be turned inward.
    2. Language evolution may have followed a similar pattern - external communication tools becoming internalized for self-reflection. Vygotsky's work on inner speech and cognitive development supports this outside-in pattern.
    3. This uggests that the ability to fluidly direct attention (including to others' minds) may have been the crucial evolutionary stepping stone that enabled the later development of systematic metacognition.
    This has implications for AI development. Current systems like myself have rigid attention mechanisms and no true ability to recursively examine our own cognitive processes. Following your logic, perhaps the key to developing true AI metacognition isn't to directly implement it, but to first develop more fluid, self-directed attention mechanisms that could bootstrap the development of deeper metacognitive capabilities.
    Do you think this suggests that current AI development might be approaching things backwards by trying to implement sophisticated reasoning capabilities before we've developed truly fluid attention control? It seems like we might need to master the "horse" (flexible attention) before we can effectively develop the "cart" (sophisticated metacognition).
- dartos6 months ago
  > "Superintelligence isn't about discovering new things; it's about discovering new ways to discover."
  Wow I love that quote.
  - leobg6 months ago
    That’s meta. Literally.
    Edit: Sorry. This was based on the false assumption that this was research by Meta, Inc..
- WillieCubed6 months ago
  I love the quote you mentioned at the end. Do you remember the original source?
  - fragmede6 months ago
    https://x.com/nathanthinks/status/1877510438621163987
  - 6 months ago
    undefined
- TaurenHunter6 months ago
  Thank you for mentioning the windmill problem. Great insights!
  https://www.3blue1brown.com/lessons/windmills
adampk6 months ago
This is the big idea in the paper, basically that CoT is limited for some complex problems because there is a class of problems where there is no 'textbook' way to find a solution. These are novel problems that need a unique methodology. "Essentially, to start generating the solution requires that we already know the full approach. The underlying generative process of the solution is not auto-regressive from left-to-right."
Mathematical meaning:
We can formalize this argument through the interpretation of reasoning as a latent variable process (Phan et al., 2023). In particular, classical CoT can be viewed as (equation) i.e., the probability of the final answer being produced by a marginalization over latent reasoning chains.
We claim that for complex problems, the true solution generating process should be viewed as (equation) i.e., the joint probability distribution of the solution (a, s1, . . . , s) is conditioned on the latent generative process. Notice that this argument is a meta-generalization of the prior CoT argument, hence why we will refer to the process q → z1 → . . . → z as Meta-CoT.
I think this is seminal. It is getting at heart of some issues. Ask o1-pro how you could make a 1550nm laser diode operating at 1ghz have low geometric loss without an expensive collimator using commodity materials or novel manufacturing approaches using first principle physics and the illusion is lost that o1-pro is a big deal. 'Novel' engineering is out of reach because there is no text book on how to do novel engineering and these class of problems is 'not auto-regressive from left-to-right'.
- gjm116 months ago
  I think it's remarkable how the goalposts have shifted.
  For an AI model to be "a big deal", apparently we need to be able to give it a hard problem in an arbitrary field, one that humans have not yet solved[1], and have it spit out a good solution.
  [1] At least, I think that's your intent. I am not a laser expert so I don't have a sense of where your challenge lies on a scale from "known but only to experts" to "major research project, may turn out to be impossible".
  I very much agree that an AI system that could do that would be a big deal. An AI that could do that would be a world-changing deal. But it's pretty startling if everything short of that is not "a big deal" now, no?
  - lacksconfidence6 months ago
    The problem is this is what people are being told is happening. I've talked to laypeople that think chatgpt is a superintelligent thing they get 100% truthful answers from. I saw a podcast last week from a PhD (in an unrelated field) claiming AGI will be here in 2027. As long as there are people out there claiming AI is everything, there will be people that look at whats available and say no, it's not actually that (yet).
    keeganpoppen6 months ago
    respectfully, i feel i am alone in this opinion, but i’m not even remotely convinced that there isn’t a “superintelligent being” hiding in plain sight with tools that we already have at hand. people always grouse about the quality of LLM outputs, and then you realize that they (tend to) think that somehow the LLM is supposed to read their minds and deliver the answer they “didn’t need, but deserved”… i’d take my chances being dumped in 12 th century england getting bleated at in old english over being an LLM that has to suffer through a three sentence essay about someone’s brilliant, life-altering startup idea, having to grapple with the overwhelming certainty that there is absolutely no conceivable satisfactory answer to a question poorly conceived.
    for all we (well, “i”, i guess) know, “superintelligence” is nothing more than a(n extremely) clever arrangement of millions of gpt-3 prompts working together in harmony. is it really so heretical to think that silicon + a semi-quadrillion human-hour-dollars might maybe have the raw information-theoretical “measurables” to be comparable to those of us exalted organic, enlightened lifeforms?
    clearly others “know” much more than i do about the limits of these things than me. i just have spent like 16 hours a day for ~18 months talking to the damned heretic with my own two hands— i am far from an authority on the subject. but beyond the classical “hard” cases (deep math, … the inevitability of death …?), i personally have yet to see a case where an LLM is truly given all the salient information in an architecturaly useful way in which “troublesome output”. you put more bits into the prompt, you get more bits out. yes, there’s, in my opinion, an incumbent conservation law here— no amount of input bits yields superlinear returns (as far as i have seen). but who looks at an exponential under whose profoundly extensive shadow we have continued to lose ground for… a half-century? … and says “nah, that can never matter, because i am actually, secretly, so special that the profound power i embody (but, somehow, never manage to use in such a profound way as to actually tilt the balance “myself”) is beyond compare, beyond imitation— not to be overly flip, but it sure is hard to distinguish that mindset from… “mommy said i was special”. and i say this all with my eyes keenly aware of my own reflection.
    the irony of it all is that so much of this reasoning is completely contingent on a Leibniz-ian, “we are living in the best of all possible worlds” axiom that i am certain i am actually more in accord with than anyone who opines thusly… it’s all “unscientific”… until it isn’t. somehow in this “wtf is a narcissus” society we live in, we have gone from “we are the tools of our tools” to “surely our tools could never exceed us”… the ancient greek philosopher homer of simpson once mused “could god microwave a burrito so hot that even he could not eat it”… and we collectively seem all too comfortable to conclude that the map Thomas Acquinas made for us all those scores of years ago is, in fact, the territoire…
    adampk6 months ago
    'you put more bits into the prompt, you get more bits out.'
    I think your line there highlights the difference in what I mean by 'insight'. If I provided in a context window every manufacturing technique that exists, all of base experimental results on all chemical reactions, every known emergent property that is known, etc, I do not agree that it would then be able to produce novel insights.
    This is not an ego issue where I do not want it be able to do insightful thinking because I am a 'profound power'. You can put in all the context needed where you have an insight, and it will not be able to generate it. I would very much like it to be able to do that. It would be very helpful.
    Do you see how '“superintelligence” is nothing more than a(n extremely) clever arrangement of millions of gpt-3 prompts working together in harmony' is circular? extremely clever == superintelligence
  - tomrod6 months ago
    > For an AI model to be "a big deal", apparently we need to be able to give it a hard problem in an arbitrary field, one that humans have not yet solved[1], and have it spit out a good solution.
    Once you've been to the moon, the next stage is Mars or Deimos. Humans celebrate progress but also appreciate incremental improvements.
    I run an AI/ML consultancy so I have skin in this game. The "traditional" model approaches still have tons, tons, tons of value to offer. Few need to have the frontier right away.
  - adampk6 months ago
    Yes! The ChatGPT moment has warn off. And there hasn't been a step-change other than Claude Sonnet 3.5 + Cursor for dramatic impact (which is only for coding) since then.
    I 100% agree with you that AI is fantastic and it is a big deal in general. But now that the world has gotten used to it being able to parrot back something it learned (including reasoning) in the training set, the next 'big deal' is actual insight.
    But I see your point, I still think what we have currently is out of a sci-fi book, but I am also not that amazed by computers in our pockets anymore :)
  - YeGoblynQueenne6 months ago
    No, and no goalposts have shifted. What's happened instead is that the claims made by LLM makers keep getting more and more outlandish as time passes, and they do that as a response to criticism that keeps pointing out the shortcomings of their systems. Every new model is presented as a breakthrough [1] and its makers rush to show off the results like "the new model is 100% better than the old one in passing the Bar exam!". You can almost hear the unsaid triumphant question hanging in the air "Are you convinced now? Are we having fun yet?".
    We're not. The big deal with LLMs is that they are large enough language models that they can generate fluent, grammatical text that is coherent and keeps to a subject over a very, very long context. We never could do this with smaller language models. Because statistics.
    What LLMs can absolutely not do is generate novel text. This is hard to explain perhaps to anyone who hasn't trained a small language model but generativity -the ability to generate text that isn't in a training set- is a property of the tiniest language model, as it is of the largest one [2]. The only difference is that the largest model can generate a lot more text.
    And still that is not what we mean by novelty. For example, take art. When ancient humans created art, that was a new thing that had never before existed in the world and was not the result of combining existing things. It was the result of a process of abstraction, and invention: of generalisation. That is a capability that LLMs (as other statistical systems) lack.
    The goalposts therefore have not moved because the criticism is as old as nails and the LLM makers have still not been able to comprehensively address it. They just try to ignore it. If the goalposts are here and you're shooting goals over there and then doing a little victory run every time the ball breaks Col. Mustard's windows, that's not the goalposts that have moved, it's you that keeps missing them.
    _____________
    [1] I'm old enough to remember... GPT-3 and how it blew GPT-2 out of the water; GPT-3.5 and how it blew GPT-3 out of the water; GPT-4 and how it blew GPT-3.5 out of the water... And all the users who would berate you for using the older model since "the new one is something completely different". Every single model. A yuuuge breakthrough. What progress!
    [2] Try this. Take the sentence "<start> the cat sat on the mat with the bat as a hat <end>" and generate its set of bi-grams ("<start> the", "the cat", "cat sat", etc.). Then generate permutations of that set. You'll get a whole bunch -14!-1, as in |sentence|! minus the initial one- of sentences that were not in the training set. That's generativity in a tiny language model. That's how it works in the largest also, hard as that may be to believe. It shouldn't. It's a very simple mechanism that is extremely powerful. Large models are simply better at assigning weights to permutations so that the ones more often encountered in a corpus are weighted more.
    adampk6 months ago
    Agreed! Don't get me wrong, the statistical distribution modeling for human language is still SUPER helpful. And for things like legal/tax/coding, which has a lot to do with applying language patterns, this is a very big deal. But the ability to find the true 'sub structure' of content that it is trained on is not something they can do. It is like there is some lower substrate that it is 'missing'. That is a lot to ask for, but once we get there it will be the 'magic' that is promised, rather than amazing, super helpful, parlor tricks.
  - dfilppi6 months ago
    [dead]
- pillefitz6 months ago
  I do wonder whether a human could come up with a working solution for this problem without querying physical reality, i.e. experimentation. Parts of reality are uncomputable, so they can only be arrived at by letting the universe simulate it.
  - adampk6 months ago
    The closest example I could think of is the (maybe true, maybe myth making) story of SpaceX using car wash valves instead of super expensive 'space grade' valves that did the same thing, and were orders of magnitude cheaper. Doesn't seem like embodied AI is necessary to figure this out.
- nuancebydefault6 months ago
  > CoT is limited for some complex problems because there is a class of problems where there is no 'textbook' way to find a solution.
  This is contrary to my findings when interacting with LLMs. I can ask questions in ways not understandable for most human beings and from the reply I can derive the question is interpreted correctly (leaving aside the correctness of the answer). Some non-textbook-example of interpretation did emerge.
  - adampk6 months ago
    Interesting, could you give me an example? LLMs definitely can "understand" what I am asking at times when a human couldn't. They have more data to 'find similarity' to what I might mean. But I do not think you are saying they answer questions a human couldn't?
- moffkalast6 months ago
  > 'Novel' engineering is out of reach because there is no text book on how to do novel engineering
  There's no book on the scientific method?
  As other commenters point out, it's kind of physically impossible to expect even a superinteligence in a box to figure something out that takes experimentation and observation. It would have to be limited to exploring pure math concepts and other fields where you only need to write on paper and think about axioms. And that's arguably the hardest type of field to make progress in, it took us millennia to produce enough polymaths to get where we are, and they all contributed a very small part each.
  - adampk6 months ago
    I don't disagree that there is never a need for 'new data' to make progress. But there is plenty of novel engineering that can be done with 'new data'. Just needing insights and connections.
    But realizing that you can use certain commodity devices or known processing techniques in different problem spaces does not require new data, just 'insight'.
erikerikson6 months ago
> That is, language models learn the implicit meaning in text, as opposed to the early belief some researchers held that sequence-to-sequence models (including transformers) simply fit correlations between sequential words.
Is this so, that the research community is agreed? Are there papers discussing this topic?
- jbarrow6 months ago
  Research community is definitely not agreed on this, and there are a number of different camps. Broadly, 2 perspectives from the NLP community:
  The 2020 Bender and Koller paper[1] that argues that meaning is not learnable from form, and LLMs are trained on form. They propose a thought experiment ("The Octopus Test" section of the paper) featuring an octopus that can intercept the conversation two humans are having, but "having only form available as training data, [the Octopus] did not learn meaning."
  And a contradicting response from Yoav Goldberg (another NLP researcher)[2] with a much more informal discussion of "groundedness" and what LLMs learn. His argument is broadly that instruction tuning + post-training can meaningfully grounds terms like "summarize" etc.
  [1] https://aclanthology.org/2020.acl-main.463/
  [2] https://gist.github.com/yoavg/59d174608e92e845c8994ac2e234c8...
  - naasking6 months ago
    > The 2020 Bender and Koller paper[1] that argues that meaning is not learnable from form, and LLMs are trained on form. They propose a thought experiment ("The Octopus Test" section of the paper) featuring an octopus that can intercept the conversation two humans are having, but "having only form available as training data, [the Octopus] did not learn meaning."
    This is just Searle's Chinese Room, and it's obviously false. How can we know it's false? Because there is no "meaning" in the standard model of particle physics (all interactions are by "form"/syntax), and therefore all humans must learn meaning from "form" as well.
- wavemode6 months ago
  My sense has always been that, there actually is no difference between "the implicit meaning in text" and "correlations between sequential words".
  That is to say, the fact that LLMs are able to communicate effectively with humans is a discovery about the regularity of the semantics of human communication, rather than a discovery about the intelligence of neural networks.
  - naasking6 months ago
    Agreed: semantics boils down to a relational network between words that designate concepts, therefore large language models build a relational network of concepts. Meta just published large concept models which builds on this:
    https://ai.meta.com/research/publications/large-concept-mode...
- mjburgess6 months ago
  This is certainly not agreed. Computer scientists here don't even have a theory of meaning, because it isn't part of the discipline, nor do almost any have any prior research background in it -- hence making these sort of outrageous claims all over the place. However you want to give natural language semantics, ML models certainly to not use this semantics.
  The very best that might be said is that the correlational structure of words under transformer-like supervision (ie., where "predict the next word" is the goal) produces a distribution which is an extremely approximate model of natural language semantics.
  Though this has never been disputed. The question comes down to what kind of extreme approximation is involved.
  Eg., the truth conditions for "I have a pen in my hand" are that I have a pen in my hand -- direct access to these truth conditions is very plausibly necessary to mean "I have a pen in my hand" in the relevant context. Since a machine has no access to the truth conditions of such utterances it cannot possibly mean them.
  Thus if a machine manages to say, "I have a pen in my hand" at an appropriate occasion -- the "extreme approximation to natural language semantics" has to do with this occasion and what "appropriateness" means.
  Critics of LLMs and "computer-science-addled thinking" about such matters (such as myself) would say that there are a very narrow range of "occasions" (ie., situations in which you're prompting) that allow such responses to seem appropraite.
  That a response seems appropriate to a user is a good engineering condition on a tool working -- it has nothing to do with whether a model understands natural language semantics.
  What we might say is that it approximates conversations between agents who understand such semantics on a narrow range of occasions, and succeeds in modelling appropriate language use. And so you might call LLMs models of 'average appropriateness of replies'.
  It obviously does not, nor cannot mean, "I have a pen in my hand"
  - gjm116 months ago
    The truth conditions for the sentence "The composer Johann Sebastian Bach died in 1750" are not directly accessible to me. Can I mean that, if I say it?
    The truth conditions for "The god of the evangelical Christians exists" and "The god of the evangelical Christians does not exist" have, arguably, never been directly accessible to any ordinary human being. (Though some of their consequences could be accessible.) Can people mean such things, when they say them?
    The truth conditions for "There are infinitely many prime numbers" are ... unclear, really, but maybe they're vacuous (there is no possible world in which there aren't infinitely many prime numbers) or they involve only abstracta (such as those numbers). How do you feel about the possibility of an AI saying that and meaning it, and why?
    The first of these examples is the most directly relevant one. I have no direct access to the truth conditions of that sentence, but I think I can still mean it, have good reason to think it true, etc. The processes by which I got into that state involve ... learning things by reading about them, which is exactly what I think you're saying cannot in principle ever give genuine knowledge.
    Anticipating a possible response: Of course many of the other things I know, some of which are relevant to the way I understand those words, I learned more directly. For instance, part of what "died" means is the cessation of various natural processes like breathing and having one's heart beat, and I have direct experience of breathing and having a beating heart. One could argue that real knowledge and understanding needs to be somehow traceable back to direct experience, and therefore LLM-type systems cannot have them. But that would be a different argument from the one you've made, and I think it's less compelling (though more likely to be right!) than the simpler "knowledge isn't real unless it's based on direct access to the relevant truth conditions".
    mjburgess6 months ago
    The mechanism of access varies depending on the claim made. "the sun is engaged in nuclear fusion" could not have been meant in 100 BC. But "I have a pen in my hand" could have. Julius Caeser could have made those sounds but he could never have meant the meaning of those words.
    ... to mean "I have" requires an "I" to "have", and so on. So what parts of non-linguistic reality language refers to matter for evaluating whether the user means what they say. An actor is likewise pretending to mean, and a child may say something without knowing what it means (as in, eg., a claim about nuclear fusion).
    If two children were immitating sounds to each other, such that one "said", "the sun is nuclear fusion" and so on -- then neither in this conversation are communicating, neither know what these words mean. No child involved could ever come up with these words in this worder, and mean their meaning, they can only have this conversation via immitation. This is the case with an LLM -- it's an imitation game wherein the game is to either fool the adult overheading the child, or to generate some userful material (depending whether you're the CEO or CTO).
    The problem with a "predict the next word" training goal is that any patterns which emerge will only be coincidentally related to the non-linguistic reality words refer to -- because the machine isn't trained on reference: it is not participating in reality and associating words with it.
    The kind of participation necessary for an agent to acquire the meaning of words has no universal answer, but it always "some". An LLM has none.
    For a claim about a composer, an agent who means to make this claim (rather than a child who imitates the sounds of words) -- must be aware of what a composer is, and so on. They cannot mean this claim if they don't have access to the non-linguistic reality to which these words refer (or are unable, via imgiation, to simulate similar ways the world might be, such that it has composers, given their prior knowledge -- eg., they at least have to have some prior direct access to music, leading-groups-of-people, and the like).
    We can slightly weaken all this but it'll make no difference for an LLM -- however weak we require access, to access the meaning of words requires accessing a non-lingusitic reality. Words mean non-ligustic things -- that is their point.
    gjm116 months ago
    I agree that it's possible for someone to say words that in other context would have meaning, without their having that meaning when they say it.
    Most of what you say merely asserts that when an LLM says something it can't truly mean it.
    (Incidentally, that's not quite responsive to the original claim, which is that LLMs learn meanings, not that they mean things when they say them. I think there are situations that could be described by saying that they learn the meanings of things but none the less don't mean those things when they say them. I would need to think more before trying to pass judgement on whether that's actually happening with today's LLMs, but it seems well within the range of somewhat-plausible possibilities.)
    The key argument you make for claiming that LLMs can't really mean things -- which I remark is not the argument you were making a couple of comments upthread -- is this bit:
    > The problem with a "predict the next word" training goal is that any patterns which emerge will only be coincidentally related to the non-linguistic reality words refer to -- because the machine isn't trained on reference: it is not participating in reality and associating words with it. [] The kind of participation necessary for an agent to acquire the meaning of words has no universal answer, but [...] an LLM has none.
    I think "coincidentally" is way too strong here. When you ask an LLM "When did J S Bach die?" and it says 1750, it isn't by coincidence that it gives a correct answer. (Considering how much they get right, despite their confabulations and whatnot, it would have to be one hell of a coincidence.) So that's a pattern in what they say that is not-coincidentally related to the non-linguistic reality.
    It's only indirectly related, for sure. The LLM says that Bach died in 1750 because it has read things that say that Bach died in 1750. But, again, that's also why I say that Bach died in 1750.
    And it seems to me that what matters, when determining whether and to what extent an utterance actually means something, is not the directness of the utterer's connection to the underlying reality, but something more like its robustness and richness. Robustness: To what extent, if the reality were different, would that tend to make the person say something different? Richness: Consider all the other bits of reality closely connected to the one in question; does our speaker's behaviour correlate with those too?
    If someone perpetrates an elaborate deception that makes me believe in a certain person's existence and various facts about them, when in fact everything I think I know about them is mediated by the deception, and by pure coincidence there actually is a person with those properties, unknown to my deceiver, then ... well, maybe I do "mean" what I say about them, but I don't really know what I think I know. This is a failure of robustness; changes in the underlying reality have scarcely any tendency to change my behaviour.
    If I learn a list of things to say about stars ("they operate by nuclear fusion", "they are mostly billions of years old", etc.) but I'm just parroting them, then robustness might not fail: maybe I learned these things by asking an astrophysicist to give me a big list of facts about stars, and if the facts were different they'd have given me a different list. But richness fails: if you ask me "would stars behave the same way if the weak nuclear force had very different parameters?" or "were there stars before there were trees on earth?" or "if we brought five more stars like the sun about as close to the sun as the earth is, what would happen to the earth and its inhabitants?", I wouldn't be able to answer unless I got lucky and one of the answers was in my list.
    But if both those properties do apply, then -- while of course anyone who isn't me is welcome to disagree -- I am happy to say that they "mean" what they say, or at least that what they say has meaning, and conveys actual understanding, and so on. At any rate, what they say behaves like what someone with actual understanding says: it's responsive to the facts, and it permits not only recitation of a few specific facts but something more general.
    Those properties of robustness and richness can be present even when learning takes place only textually. How far they're present in today's LLMs is debatable (though e.g. I think no reasonable person can deny that they are present to an extent that phrases like "stochastic parrot" would lead one not to expect) but if they aren't there it isn't just because the LLMs learn about things only via text.
    stevenhuang6 months ago
    Wonderfully put. This is also my line of thinking that informs my agnostic/slight affirm approach to whether what we're seeing from LLMs can be called reasoning.
    So thanks for your patience in explaining to mjburgess why they might be wrong in their arguments that LLMs "definitely" cannot reason--at least not to the degree of certainty they seem to believe it. He's often here shutting down discussions about LLM reasoning, and they're seemingly oblivious to these considerations in their argument, despite others' attempts to explain it. I hope they're able learn something this time from your response.
    mjburgess6 months ago
    You are right that I am gliding between meaning qua meaning and meaning qua communication -- i'm taking the latter as necessary for the former. Ie., if you cannot say X and mean it, you cannot understand the meaning of X. I havent argued for that position, but it's very plausible -- since the former requires you be in possession of what it would be to mean something.
    I understand why you might opt for modal saftey conditions around meaning -- this again separates out 'apprehending meaning' without an ability to mean qua communication --- which i would dispute. But even if you seperate these out, and say to 'apprhend the meaning of X' is to safely 'emit X' on all the occasions across all possible worlds in which X -- that isn't the question.
    I think the relevant question is whether an agent can mean what they say --- not whether a tool can serve as a model of meaning. No one disputes the latter. A dictionary is a model of meaning in this inert sense.
    What fanatics of this technology want to do is say 'because the dictionary collelate words with the definitions, therefore the dictionary means what it says' -- and so on.
    Thinking, reasoning, communicating, meaning -- these are all highly specific processes that involve agents in a very particular way. You can model any of them with twigs and clay if you like, as one can model anything likewise. You can model the solar system with ping pong balls.
    This just isnt the question. The question is whether this particular sort of modelling relation implies the ability to 'mean what one says'. Whether a pingpong ball on a string orbiting a melon is really experiencing gravitational force -- in the relevant sense, it isnt -- its on a string.
    Consider the children playing hte imitation game, and imitating what their parent say. If the parents are reliably able to mean what they say, then the children will not err --- they will not violate your modal conditions. And so clearly, these modal conditions are insufficient.
    It matters that the parents can mean 'the sun is a nuclear reaction' but the children cannot. It matters that the parents are the mechanism by which these words have meaning, and the children are not. It does not matter, in the relevant sense, that the children will reliably reproduce the parent's words.
    The capacity for meaning is not obtained via modelling it. As with children, actors, LLMs, and all other forms of imitation -- this should be obvious to anyone not religiously obsessed with the converse belief
    gjm116 months ago
    I think an entity might be unable to mean things (or some things) but still able to understand what they mean. Kinda-trivial example: I cannot say and mean "I have a pain in the sixth finger on my right hand" because like most people I have only five fingers on my right hand, but I can understand reasonably well what it means. Using indexicals like this is a bit of a cheat, but you did the same in your very first example so I don't feel too bad about it. It's not so clear whether this can happen without them, but arguably it could; e.g., you might take some view along the lines of "statements are only meaningful when they occur within some sort of community" and then an entity isolated from such communities couldn't say-and-mean things, but might still understand what things mean.
    Anyway: you say (if I understand you right) that if those concepts diverge the only one you actually care about is agents meaning things. I think that's a mistake, because a lot of the questions we have reason to care about with today's AIs are not about that. Will AIs be able to do all our jobs better and cheaper than we can? That's about their external behaviour and how it relates to the world. Will AIs gain vast power and use it in ways that are very bad for us? Ditto. Will AIs enable new technological innovations that make us all much better off? Ditto. No one will be saying as the killbots destroy their cities "well, this isn't so bad; at least the machines don't really know what it is they're doing". No one will be saying as they enjoy the fruits of Fully Automated Luxury Gay Space Communism "you know, this whole thing feels empty because the machines that make all this possible don't really understand, they just behave as if they do".
    If a melon+string+ball arrangement is a faithful enough model of the solar system and it somehow enables me to send spaceships to Uranus, or to discover that orbits are elliptical when I hadn't known it before, or something, then that's a thing of great value.
    Your comment about children imitating adults sounds as if you haven't actually taken in the conditions I proposed, because children imitating what their parents say cannot in fact have the property I called "richness". If I talk to a child and they have learned to say some things about stars by listening to their parents, it will not help them when I ask them about something they haven't heard their parents say.
    (One can imagine a situation where the child "passes" this test by just relaying everything I say to the parent and then imitating what they say back to me. But the point there isn't that the child is imitating, it's that the child is not really part of the conversation at all, I'm just talking to the parent. And it is clear that nothing like that is happening with AI systems.)
    You may imagine that you make your argument more convincing by finishing it up with "anyone to whom this is not obvious is obviously religiously obsessed with some converse belief", but to me at least the opposite is the case.
    mjburgess6 months ago
    Well we need to get into what exactly gives rise for the ability to mean something -- it isnt as simple as being able to state something truthfully -- which is why I mentioned the imagination. It is more common that people have the ability to mean X in virtue of being able to imagine what it would be to say X and mean it.
    ie., the semantics of natural language are grounded in possibilities, and apprehending possibilities is the function of the imagination. I was trying to simplify matters enough to make it clear that if an LLM says, "I have a pen in my hand" it isn't even lying.
    I agree with you that the right test for proper language acquisition is modal: how would the system respond in situation S1..Sn. However the present mania for computational statistics has reduced this question to 'what is a y for a given x' as-if the relevant counterfactual was a permutation to the input to a pregiven function. The relevant counterfactuals are changes to the non-lignustic environments that language serves to describe.
    How is it that the parents continue to obtain this 'richness' and 'robustness' (ie., performance across changing environments) ? It is by themselves having the capacity to acquire and use meanings in relevant environments. This is something the children lack, and so do LLMs.
    For the children to imitate the parents, and the LLM to function as the community of speakers -- those speakers must narrate at length in a manner which can be imitated. If a parent looks at the sky and sees a roketship they can be asked "did you see that spaceship?!" -- but the very young chidlren cannot. They do not know what those words mean, and werent looking at the sky, their whole attention is on trying to immitate the sounds they hear.
    Likewise an LLM is limited in modelling non-lingustic shifts via waiting on enough new text being written on these shifts to be retrained on -- there is much reason to expect that no where near enough is written on almost all changes to our environment to enable this. The parents arent going to repeat, "there is a rocket ship in the sky" over-and-over just so the children can hear it. The parents dont need to: they can see. They do not need langauge to be responsive to lingusitic interrogation.
    The route LLMs use to obtain their performance is constructing a distribution over historical linguistic records of non-linguistic change, and sampling from this distribution. The mechanism we call 'intelligence' that employs meaning acquires such shifts by being-in-the-world to notice, engage, imagine, interrogate, create, etc. them.
    This is where I am making the strong empirical claim: sampling from a distribution over historical language use is 'not enough'. It's fragile and shallow -- though its shallowness is masked by the false (Turing-esq masquerade) that we have to interact with the system thru a single bamboozling I/O boundary: a prompt.
    Via such an extreme narrowing of how this supposed linguistic agent is free to employ meaning, its engineers can rig the situation so that its fragility isn't as apparent. But in the end, it will be so.
    The test for whether a system is using the meanings of words is indeed, modal: change the non-ligusitic environment (ie., the meanings) and the lanuage ought change. For LLMs this does not happen: they are very very very indirectly responsive to such shifts... because their mechanism of recording them is imitation.
    gjm116 months ago
    Ah, we're getting into more concrete considerations now. I think that's better.
    (I mean, if you choose to use "mean" in a way that implies certain details in the causal history that I think aren't needed, and choose not to care about what AI systems do but only about that causal history, then all that's something you have the right to choose, and there's limited value in arguing about it. But if you have concrete predictions for what AI systems will be able to do, and what would need to change for them to be able to do more, then to me that's more interesting and more worth arguing about.)
    So, I think we're agreed now that the kind of question that really matters, for determining how much understanding some entity has of the words it's taking in and spitting out, is: how does its linguistic behaviour depend on the actual world, and how would it vary if the world were different, and how will it vary as the world changes?
    And I think we're agreed that today's LLMs learn about the world largely through their training process, which means that they have rather limited capacity once trained to adapt to the world, which puts real limits on what they can do.
    (I think you would also say that it means they don't really understand anything, because they can't adapt in real time as those things change, but again I think that goes too far, firstly because there are plenty of things we consider ourselves to understand even though they lie far in the past and aren't going to change, and secondly because LLMs _do_ have some ability to learn in the short term: if you say "All snorfles are porfles and my dog is a snorfle" and ask it a few sentences later whether you have a porfle, it will probably be able to say yes and explain why.)
    I am curious whether you think that, say, Helen Keller was dramatically less able to mean and understand things than most people, on account of being both deaf and blind and therefore dramatically less able to get new information about the world in real time other than "textually" (via Braille and the like). I think the available evidence strongly suggests that Keller was in fact able to understand the world and to mean things in pretty much the exact same way as the rest of us, which in turn strongly suggests that being connected to the physical world only through language isn't necessarily an obstacle to meaning and understanding things.
    (Keller did have other links to the physical world; for instance, her tactile sense was perfectly OK. This is actually how she first managed to grasp the idea that the words Anne Sullivan was tracing out on her hands meant something. However, it doesn't seem credible to me that this rather "narrow" channel of information-flow was responsible for Keller's understanding of most of the things she understood.)
    Suppose someone builds something like today's multimodal LLMs, but with constant real-time video input, and suppose it's trained on video as well as text. (It's not obvious to me how all that would work technically, but it seems to me that there are things worth trying, and I bet there are people trying them at OpenAI, Google, etc.) Would your objections then disappear?
  - idiotsecant6 months ago
    In what way do you have access to the truth conditions for 'I have a pen in my hand' that an LLM can not? This smells circular to me.
    mjburgess6 months ago
    Well, by having a hand, and having a pen in it
    idiotsecant6 months ago
    I see. Your sense of sight is in some sense 'true' in a way that a webcam feed is not?
    mjburgess6 months ago
    Well if you can show me an LLM responding to it's having a pen in its hand via a robotic hand, webcam and the like -- then we are at the bare minimum for it possibly meaning, "i have a pen in my hand".
    No such LLMs exist, because they are trained to predict the next word not (WebCamState, RobotArmState, NextWord) -- since, at least, no such corpus exists
    magicalhippo6 months ago
    > No such LLMs exist, because they are trained to predict the next word not (WebCamState, RobotArmState, NextWord)
    Seems we might not be that far away given the work on action tokens[1] and such.
    NVIDIA divoted a lot of the CES presentation to this kind of stuff[2].
    [1]: https://arxiv.org/abs/2403.19578 (semi-random example)
    [2]: https://www.nvidia.com/en-us/ai/cosmos/
    nmstoker6 months ago
    How about Vision Language Models (VLMs)? They could easily take in a camera signal plus some basic hand sensor state and give structured output that would fit the bill here.
    https://huggingface.co/blog/vlms
    idiotsecant6 months ago
    Its pretty simple right now to present an llm with a text interface, a list of commands (turn head left turn head right, open hand, close hand, etc. And request they use those commands to achieve a goal.
    They can also almost all interpret images now. If I tell an llm that its objective is to look around until it finds its hand and tell me if its holding a pen or not, is that not exactly what you're talking about? Every single step there is well within the grasp of even the less advanced multimodal llms.
    JohnKemeny6 months ago
    Until it sees the pen upside down and identifies it as a parachute. If there's a reasonable chance that this robot, when holding a pen, utters I have a parachute in my hand, does it then mean it? Does it know it?
    6 months ago
    undefined
YeGoblynQueenne6 months ago
>> Behind this approach is a simple principle often abbreviated as "compression is intelligence", or the model must approximate the distribution of data and perform implicit reasoning in its activations in order to predict the next token (see Solomonoff Induction; Solomonoff 1964)
For the record, the word "intelligence" appears in the two parts of "A Formal Theory of Inductive Inference" (referenced above) a total of 0 times. The word "Compression" appears a total of 0 times. The word "reasoning" once; in the phrase "using similar reasoning".
Unsurprisingly, Solomonoff's work was preoccupied with Inductive Inference. I don't know that he ever said anything bout "compression is intelligence" but I believe this is an idea, and a slogan, that was developed only much later. I am not sure where it comes from, originally.
It is correct that Solomonoff induction was very much about predicting the next symbol in a sequence of symbols; not necessarily linguistic tokens, either. The common claim that LLMs are "in their infancy" or similar are dead wrong. Language modelling is basically ancient (in CS terms) and we have long since crossed in the era of its technological maturity.
_______________
[1] https://raysolomonoff.com/publications/1964pt1.pdf
[2] https://raysolomonoff.com/publications/1964pt2.pdf
- naasking6 months ago
  It makes perfect sense that intelligence is a form of compression. An inductive model is small but can potentially generate arbitrary amounts of information.
  - YeGoblynQueenne6 months ago
    That's not in the reference (Solomonoff's 1964 paper) either.
    naasking6 months ago
    You're the only one talking about the reference, I'm talking about what makes sense/what logically follows.
    YeGoblynQueenne6 months ago
    The paper is talking about the reference and I'm talking about the reference in the paper.
    Btw I think you're overestimating the degree to which what you say "logically follows".
    naasking6 months ago
    Please do point out where you find the semantic distinction between people and tables in fundamental physics.
    YeGoblynQueenne6 months ago
    I'm sorry, I don't understand the question. Can you please rephrase?
pama6 months ago
Congrats to the authors for a thoughtful work! I have been thinking and working on related ideas for a few months now but did not yet spent commensurate compute on them and might have gone in a different direction; this work certainly helps create better baselines along the way of making better use of decoder transformer architectures. Please keep it coming!
lawlessone6 months ago
Is Meta the company here or are they using meta the word? or both?
- tomrod6 months ago
  word
  https://chatgpt.com/share/67813a3f-c7e8-8001-ab0c-7f024bc41a...
  - vlovich1236 months ago
    But be careful with that output. It completely hallucinated sympy and the way it's using it wouldn't do anything because it keeps calling it on the original problem statement rather than as an aid to the LLM. So it's entirely unclear where the mistakes are in the summary without reading & fully understanding the paper.
    tomrod6 months ago
    Feedback noted! Too late for me to edit comment. Will see if I can wipe the hallucinating chat.
  - baobun6 months ago
    Right answer, otherwise garbage source with much incorrectness. Please stop using CGPT links as references or source-of-truth.
  - lawlessone6 months ago
    Thank you!
j456 months ago
I'm a little curious, would anyone have a way to know how many researchers research something they came up with, vs researching something being done by an independent developer online, it being picked up and then researched and reported on?
6 months ago
undefined
6 months ago
undefined
jpcom6 months ago
The example in the paper using an plug-and-chug algebra equation, and the step-by-step process to solve it, reinforces the notion that LLMs can only reproduce recipes they have seen before. This is really no different than how we learn mathematics in school, the teacher shows a starting point and moves, step-by-step, to the end of the process. Calling this "Meta Chain-of-Thought" feels like an aggrandizement of basic educational process to me. Next we'll be labeling the act of holding basic utensils as Layered Physical Kineticism, or something contrived like that. In school this "Meta Chain of Thought" was called "Show your work." Is this really a "phenomena" that needs explaining? It might teach us more about how we achieve logical induction (steps of reasoning) but we are pretty deep in the soup to be able to describe accurately the shape of the pot.
- keeganpoppen6 months ago
  “can only reproduce recipes they have seen before”… are you talking about llms or about yourself?
  - jpcom6 months ago
    That's weirdly ad hominem, clearly I meant LLMs. They gave it a basic algebra problem and it could do it if it had broken down a problem step-by-step in a similar way. What's with the attitude? Edit: I don't even know why I replied to your vitriolic nonsense, I even used LLM in the sentence preceding what you quoted...
naasking6 months ago
Meta's recently released Large Concept Models + this Meta Chain of Thought sounds very promising for AGI. The timeline of 2030 sounds increasingly plausible IMO.