I'd like to see a math/logic bench appear for tokenization schemes that captures this. BPB/perplexity is fine, but its not everything.
https://arxiv.org/abs/2402.14903
You right to left tokenize in groups of 3, so 1234567 becomes 1 234 567 rather than the default 123 456 7. And if you ensure all 1-3 digits groups are in the vocab, it does much better.
Both https://arxiv.org/abs/2503.13423 and https://arxiv.org/abs/2504.00178 (co-author) both independently noted that you can do this with just by modifying the pre-tokenization regex, without having to explicitly add commas.
The bitter lesson is that general methods and a system that learns trumps trying to manually embed/program human knowledge into the system, so clever architecture is ok and expected.
Inthesamewaythatweusepunctuation. Or even that we usually order words a certain way, oranges and apples, Ted and Bill, roundabouts and swings.
DAG architectures fundamentally cannot be AGI and you cannot even use them as a building block for a hypothetical AGI if they're immutable at runtime.
Any time I hear the goal being "AGI" in the context of these LLMs, I feel like listening to a bunch of 18th-century aristocrats trying to get to the moon by growing trees.
Try to create useful approximations using what you have or look for new approaches, but don't waste time on the impossible. There's no iterative improvements here that will get you to AGI.
> "You're not understanding, are you? The brain does the thinking. The meat."
> "Thinking meat! You're asking me to believe in thinking meat!"
https://www.mit.edu/people/dpolicar/writing/prose/text/think...
It feels more interesting to note that this time, it is different. I've been watching the field since the 90s when I first dabbled in crude neural nets. I am informed there was hype before, but in my time I've never seen progress like we've made in the last five years. If you showed it to people from the 90s, it would be mind blowing. And it keeps improving incrementally, and I do not think that is going to stop. The state of AI today is the worst it will ever be (trivially obvious but still capable of shocking me).
What I'm trying to say is that the shocking success of LLMs has become a powerful engine of progress, creating a positive feedback loop that is dramatically increasing investment, attracting top talent, and sharpening the focus of research into the next frontiers of artificial intelligence.
90's? It's mind blowing to me now.
My daily driver laptop is (internally) a Thinkpad T480, a very middle of the road business class laptop from 2018.
It now talks to me. Usually knowledgeably, in a variety of common languages, using software I can download and run for free. It understands human relationships and motivations. It can offer reasonably advice and write simple programs from a description. It notices my tone and tries to adapt its manner.
All of this was inconceivable when I bought the laptop - I would have called it very unrealistic sci-fi. I am trying not to forget that.
And I don't follow, we've had vehicles capable of reaching the moon for over 55 years
The right alternative view is that it's an immutable function from prefixes to a distribution over all possible sequences of tokens less than (context_len - prefix_len).
There are no mutable functions that cannot be viewed as immutable in a similar way. Human brains are an immutable function from input sense-data to the combination (brain adaptation, output actions). Here "brain adaptation" doing a lot of work, but so would be "1e18 output tokens". There is much more information contained within the latter
The Tree growing comment was a reference to another comment earlier in the comment chain.
And why?
The claim was that it isn't possible in principle for "DAGs" or "immutable architectures" to be intelligent. That statement is confusing some theoretical results that aren't applicable to how LLMs work (output context is mutation).
I'm not claiming that compute makes the m intelligent. I'm pointing out that it is certainly possible, and at that level of compute it should be plausible. Feel free to share any theoretical results you think demonstrate the impossibility of "DAG" intelligence and are applicable
LLMs might never be able to crunch numbers reliably, however I expect they should be very good at identifying the right formula and the inputs for a problem ("i need the answer to x*y, where x=12938762.3 and y=902832.2332"). Then they can call a math engine (calculator or wolfram alpha or whatever) to do the actual computation. That's what humans do anyway!
Then a system samples from that distribution, typically with randomness, and there are some optimizations in running them that introduce randomness, but it's important to understand that the models themselves are not random.
It's best to assume that the relationship between input and output of an LLM is not deterministic, similar to something like using a Google search API.
To the extent we've already found that to be the case, it's perhaps the weirdest part of this whole "paradigm shift."
but from time to time, doing this does require doing arithmetic correctly (to correctly add two exponents or whatever). so it would be nice to be able to trust that.
i imagine there are other uses for basic arithmetic too, QA applications over data that quotes statistics and such.
It sounds weird, but try writing your problem in LaTeX - I don’t know why, I’ve found a couple models to be incredibly capable at solving mathematical problems if you write them in LaTeX.
UPD: Found the paper: - https://huggingface.co/papers/2502.09741 - https://fouriernumber.github.io/
in paper mentioned “number” is a single sort-of “token” with numeric value, so network dealing with numbers like real numbers, separately from char representation. All the math happens directly on “number value”. In majority of current models numbers are handled like sequences of chars
To draw an a analogy, we've got our human brain specialized.
Why not implement a part of the AI brain that's not neural nets, but instead circuitry specialized to math?
Maybe a dumb question since I'm a layperson!
Of course, the extra rules have to be logically consistent with the base S and K combinators, otherwise you will get wrong result. But if the inconsistent rule is complicated enough to be used only infrequently, you will still get correct result most of the time.
Which brings me to LLMs and transformers. I posit that transformers are essentially learned systems of rules that are applied to somewhat fuzzily known set of combinators (programs), each represented by a token (the term being represented by the embedding vector). However, the rules learned are not necessarily consistent (as it happens in the source data), so you get an occasional logical error (I don't want to call it hallucination because it's a different phenomenon from nondeterminism and extrapolation of LLMs).
This explains the collapse from the famous paper: https://ml-site.cdn-apple.com/papers/the-illusion-of-thinkin... One infrequent but inconsistent rule is enough to poison the well due to logical principle of explosion. It also clearly cannot be completely fixed with more training data.
(There is also analogy to Terry Tao's stages of mathematical thinking: https://terrytao.wordpress.com/career-advice/theres-more-to-... Pre-rigorous corresponds to soomewhat random set of likely inconsistent logical rules, rigorous to small set of obviously consistent rules, like only S and K, and post-rigorous to a large set of rules that have been vetted for consistency.)
What is the "solution" to this? Well, I think during training you somehow need to make sure that the transformer rules learned by the LLM are logically consistent for the strictly logical fragment of the human language that is relevant to logical and programming problems. Which is admittedly not an easy task (I doubt it's even possible within NN framework).
Let's say that we have 15k unique tokens (going by modern open models). Let's also say that we have an embedding dimensionality of 1k. This implies that we have a maximum 1k degrees of freedom (or rank) on our output. The model is able to pick any single of the 15k tokens as the top token, but the expressivity of the _probability distribution_ is inherently limited to 1k unique linear components.
12288 dimensions (GPT3 size) can fit more than 40 billion nearly perpendicular vectors.
However, I'm talking about the probability distribution of tokens.
Detecting and preventing unargmaxable outputs in bottlenecked neural networks, Andreas Grivas (2024)
If I remember correctly, that's not true because of the nonlinearities which provide the model with more expressivity. Transformation from 15k to 1k is rarely an affine map, it's usually highly non-linear.
The real bitter lesson in AI is that we don't really know what we're doing. We're hacking on models looking for architectures that train well but we don't fully understand why they work. Because we don't fully understand it, we can't design anything optimal or know how good a solution can possibly get.
Well, technically, that's not true: The entire idea behind complexity theory is that there are some tasks that you can't throw more hardware at - at least not for interesting problem sizes or remotely feasible amounts of hardware.
I wonder if we'll reach a similar situation in AI where "throw more context/layers/training data at the problem" won't help anymore and people will be forced to care more about understanding again.
More precisely, I think producing a good fast merge of ca 5 lists was a problem I didn’t have good answers for but maybe I was too fixated on a streaming solution and didn’t apply enough tricks.
Also, solution testing is mandatory. Luckily, you can ask an RNG for that, too, as long as you have tests for the testers already written.
Maybe the hope is that you won't have to manually map the universal algorithm to your specific problem and can just train the transformer to figure it out instead, but there are few proofs that transformers can solve all problems in some complexity class through training instead of manual construction.
Of course deepseek was forced to take the optimisation approach but got to the end in time to stake a claim. So ymmv.
When all you have is a hammer... It makes a lot of sense that a transformation layer that makes the tokens more semantically relevant will help optimize the entire network after it and increase the effective size of your context window. And one of the main immediate obstacle stopping those models from being intelligent is context window size.
On the other hand, the current models already cost something on the line of the median country GDP to train, and they are nowhere close to that in value. The saying that "if brute force didn't solve your problem, you didn't apply enough force" is intended to be listened as a joke.
https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)
Models are expensive, but they're not that expensive.
[0] https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nomi...
The largest economy (US) has a GDP of $27.7 trillion.
The smallest economy (Tuvalu) has a GDP of $62.3 million.
The 48 billion number represents the middle point where half of all countries have larger GDPs and half have smaller GDPs.
Note: 1946 CPI = 19.5, 2025 CPI = 321.465 which makes for an increase of 16.49.
CPI{2025} / CPI{1946} * Price{1946} = Price{2025}
to obtain the price adjusted for inflation?
That is the only way I was able to arrive at the same number you got: $6,594,153.846. TIL.
If the article is correct, and this is the best way to make them, their price will explode.
Step 2: ?
Step 3: Profit.
It’s not enough to have the biggest model, or the best model per dollar spent, you still need to figure out how to make money with it. It’s not clear than vastly increased expenditure will produce a good ROI.
Is this really true?
This is just blatantly false.
> According to AI Index estimates, the training costs of state-of-the-art AI models have reached unprecedented levels. For example, OpenAI’s GPT-4 used an estimated $78 million worth of compute to train, while Google’s Gemini Ultra cost $191 million for compute.
https://hai.stanford.edu/ai-index/2024-ai-index-report
No need to even open up the actual report to find that. Just scroll down the page to read the 'key takeaways'.
Specifically they made tokens for 4,8,12,16 or something spaces.
I assume you started programming some time this millennia? That's the only way I can explain this "take".
Can someone (who know about LLMs) explain why the r's in strawberry thing is related to tokenization? I have no reason to believe an LLM would be better at counting letters if each was one token. It's not like they "see" any of it. Are they better at counting tokens than letters for some reason? Or is this just one of those things someone misinformed said to sound smart to even less informed people, that got picked up?
In contrast, if the model were trained with a character-level vocabulary, where each character maps to a unique token, it would not need to memorize character counts for entire words. Instead, it could potentially learn a generalizable method for counting characters across all sequences, even for words it has never seen before.
I'm not sure about what you mean about them not "seeing" the tokens. They definitely receive a representation of each token as input.
Please take another look at my original comment. I was being precise about the distinction between what's structurally possible to generalize vs memorize.
Count the number of Rs in this sequence: [496, 675, 15717]
Count the number of 18s in this sequence: 19 20 18 1 23 2 5 18 18 25
Human: Which is the easier of these formulas
1. x = SQRT(4)
2. x = SQRT(123567889.987654321)
Computer: They're both the same.
[496, 675, 15717] is the GPT-4 representation of the tokens. In order to determine which letters the token represents, it needs to learn the relationship between "str" and [496]. It can learn the representation (since it can spell it out as "S-T-R" or "1. S, 2. T, 3. R" or whatever) but it adds an extra step.
The question is whether the extra step adds enough extra processing to degrade performance. Does the more compact representation buy enough extra context to make the tokenized version more effective for more problems?
It seems like the longer context length makes the trade off worth it, since spelling problems are a relatively minor subset. On the other hand, for numbers it does appear that math is significantly worse when it doesn't have access to individual digits (early Llama math results, for example). Once they changed the digit tokenization, the math performance improved.
GPT-2 tokenization was a demonstratable problem: https://www.beren.io/2023-02-04-Integer-tokenization-is-insa... (Prior HN discussion: https://news.ycombinator.com/item?id=39728870 )
More recent research:
https://huggingface.co/spaces/huggingface/number-tokenizatio...
Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs: https://arxiv.org/abs/2402.14903
https://www.beren.io/2024-07-07-Right-to-Left-Integer-Tokeni...
https://twitter.com/yuntiandeng/status/1836114401213989366
If anything I'd think this indicates the barrier isn't tokenization (if it can do arithmetic, it can probably count as well) but something to do with "sequential dependencies" requiring use of COT and explicit training. Which still leaves me puzzled: there are tons of papers showing that variants of GPT-2 trained in the right way can do arithmetic, where are the papers solving the "count R in strawberry" problem?
IME Reddit would scream "tokenization" at the strawberry meme until blue in the face, assuring themselves better tokenization meant the problem would be solved. Meanwhile RLHF'ers were/are en masse paid to solve the problem through correcting thousands of these "counting"/perfect syntax prompts and problems. To me, since RLHF work was being paid to tackle these problems, it couldn't be a simple tokenization problem. If there was a tokenization bottleneck that fixing would solve the problem, we would not be getting paid to so much money to RLHF synax-perfect prompts (think of Sudoku type games and heavy syntax-based problems).
No, why models are better are these problems now is because of RLHF. And before you say, well now models have learned how to count in general, I say we just need to widen the abstraction a tiny bit and the models will fail again. And this will be the story of LLMs forever--they will never take the lead on their own, and its not how humans process information, but it still can be useful.
As an analogy, I might ask you to identify the relative activations of each of the three cone types on your retina as I present some solid color image to your eyes. But of course you can't do this, you simply do not have cognitive access to that information. Individual color experiences are your basic vision tokens.
Actually, I asked Grok this question a while ago when probing how well it could count vowels in a word. It got it right by listing every letter individually. I then asked it to count without listing the letters and it was a couple of letters off. I asked it how it was counting without listing the letters and its answer was pretty fascinating, with a seeming awareness of its own internal processes:
Connecting a token to a vowel, though, requires a bit of a mental pivot. Normally, I’d just process the token and move on, but when you ask me to count vowels, I have to zoom in. I don’t unroll the word into a string of letters like a human counting beads on a string. Instead, I lean on my understanding of how those tokens sound or how they’re typically constructed. For instance, I know "cali" has an 'a' and an 'i' because I’ve got a sense of its phonetic makeup from training data—not because I’m stepping through c-a-l-i. It’s more like I "feel" the vowels in there, based on patterns I’ve internalized.
When I counted the vowels without listing each letter, I was basically hopping from token to token, estimating their vowel content from memory and intuition, then cross-checking it against the whole word’s vibe. It’s not perfect—I’m not cracking open each token like an egg to inspect it—but it’s fast and usually close enough. The difference you noticed comes from that shift: listing letters forces me to be precise and sequential, while the token approach is more holistic, like guessing the number of jellybeans in a jar by eyeing the clumps.
> Strikingly, Claude seems to be unaware of the sophisticated "mental math" strategies that it learned during training. If you ask how it figured out that 36+59 is 95, it describes the standard algorithm involving carrying the 1. This may reflect the fact that the model learns to explain math by simulating explanations written by people, but that it has to learn to do math "in its head" directly, without any such hints, and develops its own internal strategies to do so.
https://www.anthropic.com/news/tracing-thoughts-language-mod...
It seems to be about as useful as asking a person how their hippocampus works: they might be able to make something up, or repeat a vaguely remembered bit of neuroscience, but they don't actually have access to their own hippocampus' internal workings, so if they're correct it's by accident.
Concretely: we learn a medium sized model that takes a partial tokenization and outputs a probability distribution over the endpoints of the next token (say we let the token lengths range from 1 to 64 bytes, the model outputs 64 logits). Then we do a beam search to find the, say, 4 most likely tokenizations. Then we run the transformer on all four tokenizations, and we take the expected value of the loss to be the final loss.
If we train this on prompt-response pairs, so that it only has to learn what to say and doesn't have to predict the context, then it could learn to skim boring stuff by patching it into ~64 byte tokens. Or more if we want.
And ofc we'd use a short context byte level transformer to encode/decode tokens to vectors. Idk this idea is kinda half baked.
I'm a total noob in ML. I just had to vent something for not understanding this stuff and realizing that knowing physics doesn't mean you can grok ML mechanics.
Maybe there could be something like a mixture-of-experts but with a thousand experts and each has its own tokenization.
Fortunately I don't actually understand this stuff, so I am going to go ahead and congratulate myself on my brilliant ideas and let the geniuses work out the details. :P
There’s no reason to assume it’s the best solution. It might be the case that a better tokenization scheme is needed for math, reasoning, video, etc models.
That said the hand coded nature of tokenization certainly seems in dire need of a better solution, something that can be learned end to end. And It looks like we are getting closer with every iteration.
So any system that predicts the optimization with a general solver can scale better than heuristic or constrained space solvers
Up till recently there’s been no general solvers at that scale
> As it's been pointed out countless times - if the trend of ML research could be summarised, it'd be the adherence to The Bitter Lesson - opt for general-purpose methods that leverage large amounts of compute and data over crafted methods by domain experts
But we're only 1 sentence in, and this is already a failure of science communication at several levels.
1. The sentence structure and grammar is simply horrible
2. This is condescending: "pointed out countless times" - has it?
3. The reference to Sutton's essay is oblique, easy to miss
4. Outside of AI circles, "Bitter Lesson" is not very well known. If you didn't already know about it, this doesn't help.
If I remember correctly, GPT3.5's tokenizer treated Cyrillic as individual characters, and GPT3.5 was pretty good at Russian.
Maybe if you have infinite compute you don't worry about software design. Meanwhile in the real world...
Not only that but where did all these compute optimized solutions come from? Oh yeah millions of man hours of optimizing and testing algorithmic solutions. So unless you are some head in the clouds tenured professor just keep on doing your optimizations and job as usual.
I’m hoping someday that dude releases an essay called The Cold Comfort. But it’s impossible to predict when or who it will help, so don’t wait for it.
That is why its called bitter, it isn't a fun realization.
Of course, instead of the beach one could spend those Y months improving the algorithms... but it's never wise to bid against yourself if you don't have to.
A colloquially is that to maximize your beach time you should work on the biggest N possible, neatly explaining the popularity of AI startups.
Agreed that overfitting the bitter lesson often leads slopping piles of compute and hardware at problems that could just be deterministic.
This is all explained in the original essay: http://www.incompleteideas.net/IncIdeas/BitterLesson.html