That was a bit of a shock to me so wanted to share this thought. Basically i think its not unreasonable to say llms are trained to predict the next book instead of single token.
Hope this is usefull to someone.
LLMs are trained to do whole book prediction, at training time we throw in whole books at the time. It's only when sampling we do one or a few tokens at the time.
honking intensifies
WHERE DO YOU GET THESE BOOKS?!
But as stated above, next token prediction is a misleading frame for the training process. While the sampling is indeed happening 1 token at a time, due to the training process, much more is going on in the latent space where the model has its internal stream of information.
You're implicitly assuming that what you asked the LLM to do is unrepresented in the training data. That assumption is usually faulty - very few of the ideas and concepts we come up with in our everyday lives are truly new.
All that being said, the refine.ink tool certainly has an interesting approach, which I'm not sure I've seen before. They review a single piece of writing, and it takes up to an hour, and it costs $50. They are probably running the LLM very painstakingly and repeatedly over combinations of sections of your text, allowing it to reason about the things you've written in a lot more detail than you get with a plain run of a long-context model (due to the limitations of sparse attention).
It's neat. I wonder about what other kinds of tasks we could improve AI performance at by scaling time and money (which, in the grand scheme, is usually still a bargain compared to a human worker).
We could run Claude on our code and call it a day, but we have hundreds of style, safety, etc rules on a very large C++ codebase with intricate behaviour (cooperative multitasking be fun).
So we run dozens of parallel CLI agents that can review the code in excruciating detail. This has completely replaced human code review for anything that isn't functional correctness but is near the same order of magnitude of price. Much better than humans and beats every commercial tool.
"scaling time" on the other hand is useless. You can just divide the problem with subagents until it's time within a few minutes because that also increases quality due to less context/more focus.
I made a cursed CPU in the game 'Turing Complete'; and had an older version of claude build me an assembler for it?
Good luck finding THAT in the training data. :-P
(just to be sure, I then had it write actual programs in that new assembly language)
Claude 4.5: not overfitted too much -- does the right thing 6/10 times.
Claude 4.6: overfitted -- does the right thing 2/10 times.
OpenAI 5.3: overfitted -- does the right thing 3/10 times.
These aren't perfect benchmarks, but it lets me know how much babysitting I need to do.
My point being that older Claude models weren't overfitted nearly as much, so I'm confirming what you're saying.
This is just as stuck in a moment in time as "they only do next word prediction" What does this even mean anymore? Are we supposed to believe that a review of this paper that wasn't written when that model (It's putatively not an "LLM", but IDK enough about it to be pushy there) was trained? Does that even make sense? We're not in the regime of regurgitating training data (if we really ever were). We need to let go of these frames which were barely true when they took hold. Some new shit is afoot.
Similarly, if there are millions of academic papers and thousands of peer reviews in the training data, a review of this exact paper doesn't need to be in there for the LLM to write something convincing. (I say "convincing" rather than "correct" since, the author himself admits that he doesn't agree with all the LLM's comments.)
I tend to recommend people learn these things from first principles (e.g. build a small neural network, explore deep learning, build a language model) to gain a better intuition. There's really no "magic" at work here.
Claude figured out how the language worked and debugged segfaults until the compiler compiled, and then until the program did. That might not be magic, but it shows a level of sophistication where referring to “statistics” is about as meaningful as describing a person as the statistics of electrical impulses between neurons.
What you wrote would apply to a human approaching this task as well, sans the “many trillion lines of code”.
What behavior would you need to see for that explanation to no longer hold? Because it seems like it explains too much.
Took me a bit of messing around, but try to write out each state sequentially, with a check step between each.
I think they should be the perfect tool to find methods or results in a field which look like it could be used in another field.
This is an interesting claim to me. Are there any models that exist that have been trained with a (single digit) number omitted from the training data?
If such a model does exist, how does it represent the answer? (What symbol does it use for the '7'?)
"I don't know how you get here from "predict the next word"" is not really so much a statement of ignorance where someone needs you to step in but a reflection that perhaps the tech is not so easily explained as that. No magic needs to be present for that to be the case.
Predict the next word is a terrible summary of what these machines do though, they certainly do more than that, but there are significant limitations.
‘Reasoning’ etc are marketing terms and we should not trust the claims made by companies who make these models.
The Turing test had too much confidence in humans it seems.
What would that be?
I do not personally feel it resembles thinking or reasoning though and really object to that framing because it is misleading many people.
What does that even mean? Their weights are essentially created by training. There aren't some magic golden weights that are then distorted.
LLMs miss very important concepts, like the concept of a fact. There is no "true", just consensus text on the internet given a certain context. Like that study recently where LLMs gave wrong info if there was the biography of a poor person in the context.
And of course they also miss things like embodiment, mirror neurons etc.
If an LLM makes a mistake, it will tell you it is sorry. But does it really feel sorry?
I don't get this. When you say "predict the next word" what you mean is "predict the word that someone who understands would write next". This cannot be done without an understanding that is as complete as that of the human whose behaviour you are trying to predict. Otherwise you'd have the paradox that understanding doesn't influence behaviour.
There's also this blog post: https://julianmichael.org/blog/2020/07/23/to-dissect-an-octo... (which IMO is better to read than the paper)
The language model predicts the next syllable by FIRST arriving in a point in space that represents UNDERSTANDING of the input language. This was true all the way back in 2017 at the time of Attention Is All You Need. Google had a beautiful explainer page of how transformers worked, which I am struggling to find. Found it. https://research.google/blog/transformer-a-novel-neural-netw...
The example was and is simple and perfect. The word bank exists. You can tell what bank means by its proximity to words, such as river or vault. You compare bank to every word in a sentence to decide which bank it is. Rinse, repeat. A lot. You then add all the meanings together. Language models are making a frequency association of every word to every other word, and then summing it to create understanding of complex ideas, even if it doesn't understand what it is understanding and has never seen it before.
That all happens BEFORE "autocompleting the next syllable."
The magic part of LLMs is understanding the input. Being able to use that to make an educated guess of what comes next is really a lucky side effect. The fact that you can chain that together indefinitely with some random number generator thrown in and keep saying new things is pretty nifty, but a bit of a show stealer.
What really amazes me about transformers is that they completely ignored prescriptive linguistic trees and grammar rules and let the process decode the semantic structure fluidly and on the fly. (I know google uses encode/decode backwards from what I am saying here.) This lets people create crazy run on sentences that break every rule of english (or your favorite language) but instructions that are still parsable.
It is really helpful to remember that transformers origins are language translation. They are designed to take text and apply a modification to it, while keeping the meaning static. They accomplish this by first decoding meaning. The fact that they then pivoted from translation to autocomplete is a useful thing to remember when talking to them. A task a language model excels at is taking text, reducing it to meaning, and applying a template. So a good test might be "take Frankenstein, and turn it into a magic school bus episode." Frankenstein is reduced to meaning, the Magic School Bus format is the template, the meaning is output in the form of the template. This is a translation, although from English to English, represented as two completely different forms. Saying "find all the Wild Rice recipes you can, normalize their ingredients to 2 cups of broth, and create a table with ingredient ranges (min-max) for each ingredient option" is closer to a translation than it is to "autocomplete." Input -> Meaning -> Template -> Output. With my last example the template itself is also generated from its own meaning calculation.
A lot has changed since 2017, but the interpreter being the real technical achievement still holds true imho. I am more impressed with AI's ability to parse what I am saying than I am by it's output (image models not withstanding.)
It does not have an understanding, it pattern matches the "idea shape" of words in the "idea space" of training data and calculates the "idea shape" that is likely to follow considering all the "idea shape" patterns in its training data.
It mimics understanding. It feels mysterious to us because we cannot imagine the mapping of a corpus of text to this "idea space".
It is quite similar to how mysterious a computer playing a movie can appear, if you are not aware of mapping of movie to a set of pictures, pictures to pixels, and pixels to co-ordinates and colors codes.
When AI marketing (ab)uses the word, it is to project the appearance of human equivalence. And I don't like to fall for it.
Yea
>encoding a meaning is understanding.
encoding a meaning is encoding. Nothing more!
No need to gatekeep the word "understanding" behind subjective human experience eg qualia.
Yea, I think gatekeeping is needed exactly for the same reason. Make up another word if you want..
it does much more than this. first layer has an attention mechanism on all previous tokens and spits out an activation representing some sum of all relations between the tokens. then the next layer spits out an activation representing relations of relations, and the next layer and so forth. the llm is capable of deducing a hierarchy of structural information embedded in the text.
not clear to me how this isn't "understanding".
Understanding would be a bit generous of a term for that I guess, but that also depends on the definition of understanding.
Google chose the word understanding.
Just because Transformers work well on the "Natural language understanding" task in AI, doesn't mean that a Transformer actually "understands" language in the human sense.
LLMs can be really good at "get all arguments against this", "Incorporated this view point in this text while making it more concise.", "Are these views actually contradicting or can I write it such that they align. Consider incentives".
If you know what you're doing and understand the matter deeply (and that is very important) you'll find that the LLM is sometimes better at wording what you actually mean, especially when not writing in your native language. Of course, you study the generated text, make small changes, make it yours, make sure you feel comfortable with it etc. But man can it get you over that "how am I going to write this down"-hump.
Also: "Make an executive summary" "Make more concise", are great. Often you need to de-linkedIn the text, or tell it to "not sound like an American waiter", and "be business-casual", "adopt style of rest of doc", etc. But it works wonders.
There's the 3b1b video series which does a pretty good job, but now we are interfacing with models that probably have parameter counts in each layer larger than the first models that we interacted with.
The novel insights that these models can produce is truly shocking, I would guess even for someone who does understand the latest techniques.
[1] https://www.manning.com/books/build-a-large-language-model-f...
But.. I recently had a LLM suggest an approach to negative mold-making that was novel to me. Long story, but basically isolating the gross geometry and using NURBS booleans for that, plus mesh addition/subtraction for details.
I’m sure there’s prior art out there, but that’s true for pretty much everything.
I haven't done any 3D modeling so I'll take your word for it but I can tell you that I am working on a very simple interpreter & bytecode compiler for a subset of Erlang & I have yet to see anything novel or even useful from any of the coding assistants. One might naively think that there is enough literature on interpreters & compilers for coding agents to pretty much accomplish the task in one go but that's not what happens in practice.
The difference in quality of output between Claude Sonnet and Claude Opus is around an order of magnitude.
The results that you can get from agent mode vs using a chat bot are around two orders of magnitude.
have you run these models in an agent mode that allows for executing the tests, the agent views the output, and iterates on its own for a while? up to an hour or so?
you will get vastly different output if you ask the agent to write 200 of its own test cases, and then have it iterate from there
My advice: ask for more than what you think it can do. #1 mistake is failing to give enough context about goals, constraints, priorities.
Don’t ask “complete this one small task”, ask “hey I’m working on this big project, docs are here, source is there, I’m not sure how to do that, come up with a plan”
The next-word bit may be slightly higher than an individual transistor, possibly functional units.
Now the machines are getting better than we are. It's exciting and a little bit terrifying.
We were polymers that evolved intelligence. Now the sand is becoming smart.
Then AI companies should stop looking for investors and instead play stock markets with all that predictive powers!
You mean, money sucking companies, right?
>You're removed from orders of magnitude in upside potential if you have to wait for the public markets.
because that won't work. That is why!
Is that what you (and all people) are in your job function? A money suck?
Do you ever buy anything for food, shelter, and clothing? Do you have hobbies?
Capitalism means we don't have to all be hunter-gatherers, and I'm pretty keen on that trade.
> because that won't work. That is why!
This is the forum for a venture capital firm. A lot of the folks here build things with the intention of creating value and getting compensated for that value creation. Other valid options are sitting at home and playing video games, reading a book, or posting on HN.
I like working on problems where I'm the customer and where I would buy the product if it existed. Turns out, there tend to be other people who would buy my software too.
RL is where the magic comes from, and RL is more than just "predict the next word". It has agents and environments and actions and rewards.
The question puts horse behind the buggy. The main point isn't "from", it is how you get to “predict the next word.” During the training the LLM builds inside itself compressed aggregated representation - a model - of what is fed into it. Giving the model you can "predict the next word" as well as you can do a lot of other things.
For starting point for understanding i'd suggest to look back at the key foundational stone that started it all - "sentiment neuron"
https://openai.com/index/unsupervised-sentiment-neuron/
"simply predicting the next character in Amazon reviews resulted in discovering the concept of sentiment. "
Suppose you prompted the underlying LLM with "You are an expert reviewer in..." and a bunch of instructions followed by the paper. LLM knows from the training that 'expert reviewer' is an important term (skipping over and oversimplifying here) and my response should be framed as what I know an expert reviewer would write. LLMs are good at picking up (or copying) the patterns of response, but the underlying layer that evaluates things against a structural and logical understanding is missing. So, in corner cases, you get responses that are framed impressively but do not contain any meaningful inputs. This trait makes LLMs great at demos but weak at consistently finding novel interesting things.
If the above is true, the author will find after several reviews that the agent they use keeps picking up on the same/similar things (collapsed behavior that makes it good at coding type tasks) and is blind to some other obvious things it should have picked up on. This is not a criticism, many humans are often just as collapsed in their 'reasoning'.
LLMs are good at 8 out of 10 tasks, but you don't know which 8.
Or do you think article's author wrote this an an ad? He's a reputable academic who seems impressed with an AI tool he used and is honestly sharing his thoughts.
For reference he published the 80 page inflation mini-book 2 weeks ago asking for feedback: https://www.grumpy-economist.com/p/inflation
Ghuntley used to be reputable on here, then the crypto money looked too juicy.
Remade the conversation with personal information stripped here https://chatgpt.com/share/699fef77-b530-8007-a4ed-c3dda9461d...
I’m not too familiar with the history, but the import of this article is brushing up on my nose hairs in a way that makes me think a sort of neo-Sophistry is on the horizon.
That is my take too, I was surprised to see how many people object to their works being trained on. It's how you can leave your mark, opening access for AI, and in the last 25 years opening to people (no restrictions on access, being indexed in Google).
You're words will be like a drop in the ocean, an ocean where the water volume keeps increasing every year. Also if nobody reads anything anymore what's the point?
Your surprise to people’s objections makes sense if you can’t count.
the value being extracted via LLM techniques is new value, which did not previously exist. The producer(s) of the old data had an asking price, which was taken by the LLM trainers. They cannot make the argument that since the LLM is producing new value, they should retroactively update their old asking price for their works.
They could update their asking price for any new works they produce. They also have the right to ask their works not be used for training, etc. But they cannot ask their old works to be paid for by the new uses in LLM in a retroactive way.
This is... blatantly untrue?
https://arstechnica.com/tech-policy/2026/02/microsoft-remove...
https://www.theatlantic.com/technology/archive/2025/03/libge...
That's to say, most people recognize when they're getting fucked over and are correct to object to it.
Actually we have an awful lot of those.
I'm not sure if emergent is quite the right term here. We carefully craft a scenario to produce a usable gradient for a black box optimizer. We fully expect nontrivial predictions of future state to result in increasingly rich world models out of necessity.
It gets back to the age old observation about any sufficiently accurate model being of equal complexity as the system it models. "Predict the next word" is but a single example of the general principle at play.
This is admission we don't know how it emerges.
Sure, we expect the behavior to emerge, but we don't know how.
The "black box" bit refers to a generic, interchangeable optimization algorithm that simply makes the number go down (or up or whatever).
There are certainly various details about the internal workings of models that we don't properly understand but a blanket claim about the whole is erroneous.
[1] https://en.wikipedia.org/wiki/What_Is_It_Like_to_Be_a_Bat%3F
We fully do. There is a significant quality difference between English language output and other languages which lends a huge hint as to what is actually happening behind the scenes.
> but how exactly does anthill behavior come from ant behavior?
You can't smell what ants can. If you did I'm sure it would be evident.
1. Can you reveal "what's actually happening behind the scenes" beyond the hint you gave? I can't figure it out.
2. Can you explain how an ants sense of smell leads to anthills?
Ant 0: doesn’t seem to be dangerous here. I’ll drop a scent.
Ant 1: oh cool, a safe place. And I didn’t die either. I’ll reinforce that.
Ant 142,857,098,277: cool anthill.
?
Not your fault obviously, but they have not yet described what that huge hint is, and I'm just at the edge of my seat with anticipation here.
And this is an ad, I assume.
I use LLMs for different research-related tasks and I surely can relate. In the past few months, the latest models have become better than me at many tasks. And I am not an ad.
Scientific: It's a combined response from everyone's collective unconscious blend of everyone participating. In other words, its a probabilistic result of an "answer" to the question everyone hears.
Occult: If an entity is present, it's basically the unshielded response of that entity by collectively moving everyone's body the same way, as a form of a mild channel. Since Ouija doesn't specific to make a circle and request presence of a specific entity, there's a good chance of some being hostile. Or, you all get nothing at all, and basically garbage as part of the divination/communication.
But comparing Ouija to LLMs? The LLM, with the same weights, with the same hyperparameters, and same questions will give the same answers. That is deterministic, at least in that narrow sense. An Ouija board is not deterministic, and cannot be tested in any meaningful scientific sense.
At least AI-haters don’t seem to be talking about “stochastic parrots” quite so much now. Maybe they finally got the memo.
That is the exact thing to say because that is exactly what it does, despite how it does so.
It is not useful to say it if you are an AI-shill though. You bought up AI-hater, so I think I am entitled to bring up AI-shills.
On the other hand, calling these tools "intelligent", capable of "reasoning" and "thought", is not only more confusing and can never be simplified, but dishonest and borderline gaslighting.
To me LLMs are incredibly simple. Next word next sentence next paragraph and next answer are stacked attention layers which identify manifolds and run in reverse to then keep the attention head on track for next token. It’s pretty straight forward math and you can sit down and make a tiny LLM pretty easily on your home computer with a good sized bag of words and context
To me it’s baffling everyone goes around saying constantly that not even Nobel prize winners know how this works it’s a huge mystery.
Has anyone thought to ask the actual people like me and others who invented this?
I’m hesitant to call this an outright win, though.
Perhaps the review service the author is using is really good.
Almost certainly the taste, expertise and experience of the author is doing unseen heavy lifting.
I found that using prompts to do submission reviews for conferences tended to make my output worse, not better.
Letting the LLM analyze submissions resulted in me disconnecting from the content. To the point I would forget submissions after I closed the tab.
I ended up going back to doing things manually, using them as a sanity check.
On the flip side, weaker submissions using generative tools became a nightmare, because you had to wade through paragraphs of fluff to realize there was no substantive point.
It’s to the point that I dread reviewing.
I am going to guess that this is relatively useful for experts, who will submit stronger submissions, than novices and journeymen, who will still make foundational errors.
Ah yes, the famous "Cut GDP in half, abolish public schooling and use that as a control" experiment. Majority of economic "models" are entirely correlational without any mechanistic explanation whatsoever or an explanation so superficial that it contradicts either itself or observed reality.
If you look deeper and read explanatory notes of economic laws, the model may refer some publications, but then the actual figures plugged in the model are explained as "these values have been observed to lead to the desired outcomes, therefore are set without any modeling or validation, hope for the best, lesssgoooo".
Personally I believe them, considering the content of the article.
If anything, even the included quotes from Refine don't smell much of typical AI, but maybe I am less discerning there. I did notice the em-dashes though!
Sort of the lowest hanging fruit imaginable. Just because it became "fundamental" to the process doesn't mean it gained any quality.