The bitter lesson is coming for tokenization(lucalp.dev)

294 pointsby todsacerdoti3 days ago21 comments

smeeth3 days ago
The main limitation of tokenization is actually logical operations, including arithmetic. IIRC most of the poor performance of LLMs for math problems can be attributed to some very strange things that happen when you do math with tokens.
I'd like to see a math/logic bench appear for tokenization schemes that captures this. BPB/perplexity is fine, but its not everything.
- cschmidt3 days ago
  This paper has a good solution:
  https://arxiv.org/abs/2402.14903
  You right to left tokenize in groups of 3, so 1234567 becomes 1 234 567 rather than the default 123 456 7. And if you ensure all 1-3 digits groups are in the vocab, it does much better.
  Both https://arxiv.org/abs/2503.13423 and https://arxiv.org/abs/2504.00178 (co-author) both independently noted that you can do this with just by modifying the pre-tokenization regex, without having to explicitly add commas.
  - nielsole2 days ago
    Isn't that the opposite of the bitter lesson - adding more cleverness to the architecture?
    cschmidt2 days ago
    I suppose it is. There is a lot to tokenization - pre-tokenization, how to handle digits, the tokenization training approach - that is about adding cleverness. In the long run, the bitter lesson would be to just get rid of it all and learn from more data. Many people would love to do it. But I think for the case of BLT, digits will still be an issue. There is no way an autoregressive entropy model will be able to split numbers sensibly, since it has no idea how many digits are coming. It seems like it will struggle more with arithmetic. Perhaps you could reverse all the digits in a number, then it has a chance. So 12334 becomes 43321, and it gets to start from the ones digit. This has been suggested as an approach for LLM's.
    infogulch2 days ago
    Little endian wins in the end.
    pas2 days ago
    ... why does reversing the all the digits help? could you please explain it? many thanks!
    cschmidta day ago
    Math operations go right to left in the text, while we write them left to right. So if you see the digits 123... in an autoreressive manner, you don't know really anything, since it could be 12345 or 1234567. If you flipped 12345 as 543..., you know the place value of each. You know that the 5 you encounter first is in the ones place, the 4 is the tens place, etc. It gives the LLM a better chance of learning arithmetic.
    RaftPeople2 days ago
    > Isn't that the opposite of the bitter lesson - adding more cleverness to the architecture?
    The bitter lesson is that general methods and a system that learns trumps trying to manually embed/program human knowledge into the system, so clever architecture is ok and expected.
    fennecbutt2 days ago
    I guess it's just working with the brain model (so to speak) than against it.
    Inthesamewaythatweusepunctuation. Or even that we usually order words a certain way, oranges and apples, Ted and Bill, roundabouts and swings.
  - Y_Y2 days ago
    What do the vector space embeddings for digit strings even look like? Can you do arithmetic on them? If that's even desirable that it seems like you could just skip "embedding" altogether and intern all the numbers along one dimension.
  - jvanderbot3 days ago
    Ok great! This is precisely how I chunk numbers for comparison. And not to diminish a solid result or the usefulness of it or the baseline tech: its clear that it we keep having to create situation - specific inputs or processes, we're not at AGI with this baseline tech
    chmod7752 days ago
    > [..] we're not at AGI with this baseline tech
    DAG architectures fundamentally cannot be AGI and you cannot even use them as a building block for a hypothetical AGI if they're immutable at runtime.
    Any time I hear the goal being "AGI" in the context of these LLMs, I feel like listening to a bunch of 18th-century aristocrats trying to get to the moon by growing trees.
    Try to create useful approximations using what you have or look for new approaches, but don't waste time on the impossible. There's no iterative improvements here that will get you to AGI.
    kristjansson2 days ago
    > "So... what does the thinking?"
    > "You're not understanding, are you? The brain does the thinking. The meat."
    > "Thinking meat! You're asking me to believe in thinking meat!"
    https://www.mit.edu/people/dpolicar/writing/prose/text/think...
    munksbeera day ago
    It doesn't feel particularly interesting to keep dismissing "these LLMs" as incapable of reaching AGI.
    It feels more interesting to note that this time, it is different. I've been watching the field since the 90s when I first dabbled in crude neural nets. I am informed there was hype before, but in my time I've never seen progress like we've made in the last five years. If you showed it to people from the 90s, it would be mind blowing. And it keeps improving incrementally, and I do not think that is going to stop. The state of AI today is the worst it will ever be (trivially obvious but still capable of shocking me).
    What I'm trying to say is that the shocking success of LLMs has become a powerful engine of progress, creating a positive feedback loop that is dramatically increasing investment, attracting top talent, and sharpening the focus of research into the next frontiers of artificial intelligence.
    dTala day ago
    >If you showed it to people from the 90s, it would be mind blowing
    90's? It's mind blowing to me now.
    My daily driver laptop is (internally) a Thinkpad T480, a very middle of the road business class laptop from 2018.
    It now talks to me. Usually knowledgeably, in a variety of common languages, using software I can download and run for free. It understands human relationships and motivations. It can offer reasonably advice and write simple programs from a description. It notices my tone and tries to adapt its manner.
    All of this was inconceivable when I bought the laptop - I would have called it very unrealistic sci-fi. I am trying not to forget that.
    AllegedAlec2 days ago
    Thank you. It's maddening how people keep making this fundamental mistake.
    mgraczyk2 days ago
    This is meant to be some kind of Chinese room argument? Surely a 1e18 context window model running at 1e6 tokens per second could be AGI.
    chmod7752 days ago
    Personally I'm hoping for advancements that will eventually allow us to build vehicles capable of reaching the moon, but do keep me posted on those tree growing endeavors.
    mgraczyk2 days ago
    Tree growing?
    And I don't follow, we've had vehicles capable of reaching the moon for over 55 years
    anonymoushn2 days ago
    It's about the immutability of the network at runtime. But I really don't think this is a big deal. General-purpose computers are immutable after they are manufactured, but can exhibit a variety of useful behaviors when supplied with different data. Human intelligence also doesn't rely on designing and manufacturing revised layouts for the nervous system (within a single human's lifetime, for use by that single human) to adapt to different settings. Is the level of mutability used by humans substantially more expressive than the limits of in-context learning? what about the limits of more unusual in-context learning techniques that are register-like, or that perform steps of gradient descent during inference? I don't know of a good argument that all of these techniques used in ML are fundamentally not expressive enough.
    mgraczyk2 days ago
    LLMs, considered as a function of input and output, are not immutable at runtime. They create tokens that change the function when it is called again. That breaks most theoretical arguments
    anonymoushn2 days ago
    Sure. Another view is that an LLM is an immutable function from document-prefixes to next-token distributions.
    mgraczyk2 days ago
    But that view is wrong, the model outputs multiple tokens.
    The right alternative view is that it's an immutable function from prefixes to a distribution over all possible sequences of tokens less than (context_len - prefix_len).
    There are no mutable functions that cannot be viewed as immutable in a similar way. Human brains are an immutable function from input sense-data to the combination (brain adaptation, output actions). Here "brain adaptation" doing a lot of work, but so would be "1e18 output tokens". There is much more information contained within the latter
    VonGallifrey2 days ago
    Excuse me for the bad joke, but it seems like your context window was too small.
    The Tree growing comment was a reference to another comment earlier in the comment chain.
    mgraczyk2 days ago
    It's not a tree though
    lukan2 days ago
    "Surely a 1e18 context window model running at 1e6 tokens per second could be AGI."
    And why?
    mgraczyk2 days ago
    Because that's quite a bit more information processing than any human brain
    lukan2 days ago
    I don't think it is quantity that matters. Otherwise supercomputers are smart by definition.
    mgraczyk2 days ago
    Well no, that's not what anyone is saying.
    The claim was that it isn't possible in principle for "DAGs" or "immutable architectures" to be intelligent. That statement is confusing some theoretical results that aren't applicable to how LLMs work (output context is mutation).
    I'm not claiming that compute makes the m intelligent. I'm pointing out that it is certainly possible, and at that level of compute it should be plausible. Feel free to share any theoretical results you think demonstrate the impossibility of "DAG" intelligence and are applicable
    lukana day ago
    I am not saying it is impossible, I am saying it might be possible, but far from plausible with the current approach of LLMs in my experience with them.
    rar002 days ago
    This argument works better for state space models. A transformer would still steps context one token at a time, not maintain an internal 1e18 state.
    mgraczyk2 days ago
    That doesn't matter, are you familiar with any theoretical results in which the computation is somehow limited in ways that practically matter when the context length is very long? I am not
- williamdclt2 days ago
  Even if LLMs get better at arithmetic, they don't seem like the right tool for the job.
  LLMs might never be able to crunch numbers reliably, however I expect they should be very good at identifying the right formula and the inputs for a problem ("i need the answer to x*y, where x=12938762.3 and y=902832.2332"). Then they can call a math engine (calculator or wolfram alpha or whatever) to do the actual computation. That's what humans do anyway!
- calibas3 days ago
  It's a non-deterministic language model, shouldn't we expect mediocre performance in math? It seems like the wrong tool for the job...
  - rictic3 days ago
    Models are deterministic, they're a mathematical function from sequences of tokens to probability distributions over the next token.
    Then a system samples from that distribution, typically with randomness, and there are some optimizations in running them that introduce randomness, but it's important to understand that the models themselves are not random.
    geysersam3 days ago
    The LLMs are deterministic but they only return a probability distribution over following tokens. The tokens the user sees in the response are selected by some typically stochastic sampling procedure.
    danielmarkbruce3 days ago
    Assuming decent data, it won't be stochastic sampling for many math operations/input combinations. When people suggest LLMs with tokenization could learn math, they aren't suggesting a small undertrained model trained on crappy data.
    anonymoushn3 days ago
    I mean, this depends on your sampler. With temp=1 and sampling from the raw output distribution, setting aside numerics issues, these models output nonzero probability of every token at each position
    danielmarkbruce2 days ago
    A large model well trained on good data will have logits so negative for something like "1+1=" -> 3 that they won't come up in practice unless you sample in a way to deliberately misuse the model.
    mgraczyk3 days ago
    This is only ideally true. From the perspective of the user of a large closed LLM, this isn't quite right because of non-associativity, experiments, unversioned changes, etc.
    It's best to assume that the relationship between input and output of an LLM is not deterministic, similar to something like using a Google search API.
    ijk3 days ago
    And even on open LLMs, GPU instability can cause non-determinism. For performance reasons, determinism is seldom guaranteed in LLMs in general.
    rar002 days ago
    yep, even with greedy sampling and fixed system state, numerical instability is sufficient to make output sequences diverge when processing the same exact input
  - CamperBob23 days ago
    We passed 'mediocre' a long time ago, but yes, it would be surprising if the same vocabulary representation is optimal for both verbal language and mathematical reasoning and computing.
    To the extent we've already found that to be the case, it's perhaps the weirdest part of this whole "paradigm shift."
  - currymj3 days ago
    thanks to training data + this being a popular benchmark, they're pretty good at grinding through symbolic mathematical derivations, which is often useful if you want an explanation of a mathematical concept. there's not really a better tool for this job, except for "a textbook which answers the exact question you have".
    but from time to time, doing this does require doing arithmetic correctly (to correctly add two exponents or whatever). so it would be nice to be able to trust that.
    i imagine there are other uses for basic arithmetic too, QA applications over data that quotes statistics and such.
    agarren3 days ago
    > but from time to time, doing this does require doing arithmetic correctly (to correctly add two exponents or whatever). so it would be nice to be able to trust that.
    It sounds weird, but try writing your problem in LaTeX - I don’t know why, I’ve found a couple models to be incredibly capable at solving mathematical problems if you write them in LaTeX.
  - drdeca3 days ago
    Deterministic is a special case of not-necessarily-deterministic.
- search_facility3 days ago
  regarding “math with tokens”: There was paper with tokenization that has specific tokens for int numbers, where token value = number. model learned to work with numbers as numbers and with tokens for everything else... it was good at math. can’t find a link, was on hugginface papers
  - samus3 days ago
    Shouldn't production models already do this? They already tend to use tokenizers with complex rules to deal with a lot of input that would otherwise be tokenized in a suboptimal way. I recall a bug in an inference engine (maybe llama.cpp?) because of an implementation difference in their regex engine compared to the model trainer. Which means that the tokenizer used regex-based rules to chop up the input.
    search_facility2 days ago
    turns out - no, by intuition they should do this for sure - but no.
    UPD: Found the paper: - https://huggingface.co/papers/2502.09741 - https://fouriernumber.github.io/
    in paper mentioned “number” is a single sort-of “token” with numeric value, so network dealing with numbers like real numbers, separately from char representation. All the math happens directly on “number value”. In majority of current models numbers are handled like sequences of chars
- vendiddy2 days ago
  Do LLMs need to be good at math with the same approach?
  To draw an a analogy, we've got our human brain specialized.
  Why not implement a part of the AI brain that's not neural nets, but instead circuitry specialized to math?
  Maybe a dumb question since I'm a layperson!
- js82 days ago
  It's not strange at all. I am playing with lambda calculus and combinatory logic now, as a base for mathematics (my interest is to understand rigorous thinking). You can express any computation using just S and K combinators, however, there is a price to that - the computations will be rather slow. So to make the computation faster, we can use additional combinators and rules to speed things up (good example is clapp() function in https://github.com/tromp/AIT/blob/master/uni.c).
  Of course, the extra rules have to be logically consistent with the base S and K combinators, otherwise you will get wrong result. But if the inconsistent rule is complicated enough to be used only infrequently, you will still get correct result most of the time.
  Which brings me to LLMs and transformers. I posit that transformers are essentially learned systems of rules that are applied to somewhat fuzzily known set of combinators (programs), each represented by a token (the term being represented by the embedding vector). However, the rules learned are not necessarily consistent (as it happens in the source data), so you get an occasional logical error (I don't want to call it hallucination because it's a different phenomenon from nondeterminism and extrapolation of LLMs).
  This explains the collapse from the famous paper: https://ml-site.cdn-apple.com/papers/the-illusion-of-thinkin... One infrequent but inconsistent rule is enough to poison the well due to logical principle of explosion. It also clearly cannot be completely fixed with more training data.
  (There is also analogy to Terry Tao's stages of mathematical thinking: https://terrytao.wordpress.com/career-advice/theres-more-to-... Pre-rigorous corresponds to soomewhat random set of likely inconsistent logical rules, rigorous to small set of obviously consistent rules, like only S and K, and post-rigorous to a large set of rules that have been vetted for consistency.)
  What is the "solution" to this? Well, I think during training you somehow need to make sure that the transformer rules learned by the LLM are logically consistent for the strictly logical fragment of the human language that is relevant to logical and programming problems. Which is admittedly not an easy task (I doubt it's even possible within NN framework).
Scene_Cast23 days ago
I realized that with tokenization, there's a theoretical bottleneck when predicting the next token.
Let's say that we have 15k unique tokens (going by modern open models). Let's also say that we have an embedding dimensionality of 1k. This implies that we have a maximum 1k degrees of freedom (or rank) on our output. The model is able to pick any single of the 15k tokens as the top token, but the expressivity of the _probability distribution_ is inherently limited to 1k unique linear components.
- molf3 days ago
  The key insight is that you can represent different features by vectors that aren't exactly perpendicular, just nearly perpendicular (for example between 85 and 95 degrees apart). If you tolerate such noise then the number of vectors you can fit grows exponentially relative to the number of dimensions.
  12288 dimensions (GPT3 size) can fit more than 40 billion nearly perpendicular vectors.
  [1]: https://www.3blue1brown.com/lessons/mlp#superposition
  - bravesoul22 days ago
    High dimensions are weird!
- blackbear_3 days ago
  While the theoretical bottleneck is there, it is far less restrictive than what you are describing, because the number of almost orthogonal vectors grows exponentially with ambient dimensionality. And orthogonality is what matters to differentiate between different vectors: since any distribution can be expressed as a mixture of Gaussians, the number of separate concepts that you can encode with such a mixture also grows exponentially
  - Scene_Cast23 days ago
    I agree that you can encode any single concept and that the encoding space of a single top pick grows exponentially.
    However, I'm talking about the probability distribution of tokens.
    anonymoushn3 days ago
    I think within the framework of "almost-orthogonal axes" you can still create a vector that has the desired mix of projections onto any combination of these axes?
    yorwba2 days ago
    No. You can fit an exponential number of almost-orthogonal vectors into the input space, but the number of not-too-similar probability distributions over output tokens is also exponential in the output dimension. This is fine if you only care about a small subset of distributions (e.g. those that only assign significant probability to at most k tokens), but if you pick any random distribution, it's unlikely to be represented well. Fortunately, this doesn't seem to be much of an issue in practice and people even do top-k sampling intentionally.
    anonymoushn2 days ago
    I see. You're right. I was either badly mistaken or only thinking about small k.
- imurray3 days ago
  A PhD thesis that explores some aspects of the limitation: https://era.ed.ac.uk/handle/1842/42931
  Detecting and preventing unargmaxable outputs in bottlenecked neural networks, Andreas Grivas (2024)
- unoti3 days ago
  I imagine there’s actually combinatorial power in there though. If we imagine embedding something with only 2 dimensions x and y, we can actually encode an unlimited number of concepts because we can imagine distinct separate clusters or neighborhoods spread out over a large 2d map. It’s of course much more possible with more dimensions.
- incognito1243 days ago
  (I left academia a while ago, this might be nonsense)
  If I remember correctly, that's not true because of the nonlinearities which provide the model with more expressivity. Transformation from 15k to 1k is rarely an affine map, it's usually highly non-linear.
- kevingadd3 days ago
  It seems like you're assuming that models are trying to predict the next token. Is that really how they work? I would have assumed that tokenization is an input-only measure, so you have perhaps up to 50k unique input tokens available, but output is raw text or synthesized speech or an image. The output is not tokens so there are no limitations on the output.
  - anonymoushn3 days ago
    yes, in typical architectures for models dealing with text, the output is a token from the same vocabulary as the input.
qoez3 days ago
The counter argument is that the theoretical minimum is a few mcdonalds meals a day worth of energy even for the highest ranked human pure mathematician.
- tempodox3 days ago
  It's just that no human would live long on McDonalds meals.
  - floxy3 days ago
    https://www.today.com/health/man-eating-only-mcdonalds-100-d...
    imtringued2 days ago
    This makes me wonder if there is a market for weight loss restaurants that automatically size your portions around your diet plan.
    pfdietz3 days ago
    Just don't drink the sugary soda.
    pylotlight3 days ago
    ye coke zero solves that right :P
    handsclean2 days ago
    “Somebody survived this for about as long as one can survive starvation, and it made the news” is not ringing endorsement you think it is.
  - astrange3 days ago
    Cheeseburgers are a pretty balanced meal. Low fiber though.
  - bravetraveler3 days ago
    President in the distance, cursing
cheesecompiler3 days ago
The reverse is possible too: throwing massive compute at a problem can mask the existence of a simpler, more general solution. General-purpose methods tend to win out over time—but how can we be sure they’re truly the most general if we commit so hard to one paradigm (e.g. LLMs) that we stop exploring the underlying structure?
- api3 days ago
  CS is full of trivial examples of this. You can use an optimized parallel SIMD merge sort to sort a huge list of ten trillion records, or you can sort it just as fast with a bubble sort if you throw more hardware at it.
  The real bitter lesson in AI is that we don't really know what we're doing. We're hacking on models looking for architectures that train well but we don't fully understand why they work. Because we don't fully understand it, we can't design anything optimal or know how good a solution can possibly get.
  - xg153 days ago
    > You can use an optimized parallel SIMD merge sort to sort a huge list of ten trillion records, or you can sort it just as fast with a bubble sort if you throw more hardware at it.
    Well, technically, that's not true: The entire idea behind complexity theory is that there are some tasks that you can't throw more hardware at - at least not for interesting problem sizes or remotely feasible amounts of hardware.
    I wonder if we'll reach a similar situation in AI where "throw more context/layers/training data at the problem" won't help anymore and people will be forced to care more about understanding again.
    svachalek3 days ago
    I think it can be argued that ChatGPT 4.5 was that situation.
    jimbokun3 days ago
    And whether that understanding will be done by humans or the AIs themselves.
  - dan-robertson3 days ago
    Do you have a good reference for sims merge sort? The only examples I found are pairwise-merging large numbers of streams but it seems pretty hard to optimise the late steps where you only have a few streams. I guess you can do some binary-search-in-binary-search to change a merge of 2 similarly sized arrays into two merges of similarly sized arrays into sequential outputs and so on.
    More precisely, I think producing a good fast merge of ca 5 lists was a problem I didn’t have good answers for but maybe I was too fixated on a streaming solution and didn’t apply enough tricks.
- falcor843 days ago
  The way I see this, from the explore-exploit point of view, it's pretty rational to put the vast majority of your effort into the one action that has shown itself to bring the most reward, while spending a small amount of effort exploring other ones. Then, if and when that one action is no longer as fruitful compared to the others, you switch more effort to exploring, now having obtained significant resources from that earlier exploration, to help you explore faster.
- logicchains3 days ago
  We can be sure via analysis based on computational theory, e.g. https://arxiv.org/abs/2503.03961 and https://arxiv.org/abs/2310.07923 . This lets us know what classes of problems a model is able to solve, and sufficiently deep transformers with chain of thought have been shown to be theoretically capable of solving a very large class of problems.
  - dsr_3 days ago
    A random number generator is guaranteed to produce a correct solution to any problem, but runtime usually does not meet usability standards.
    Also, solution testing is mandatory. Luckily, you can ask an RNG for that, too, as long as you have tests for the testers already written.
  - yorwba3 days ago
    Keep in mind that proofs of transformers being able to solve all problems in some complexity class work by taking a known universal algorithm for that complexity class and encoding it as a transformer. In every such case, you'd be better off using the universal algorithm you started with in the first place.
    Maybe the hope is that you won't have to manually map the universal algorithm to your specific problem and can just train the transformer to figure it out instead, but there are few proofs that transformers can solve all problems in some complexity class through training instead of manual construction.
  - tsimionescu2 days ago
    Note that these theorems show that there exists a transformer that can solve these problems, they tell you nothing about whether there is any way to train that transformer using gradient descent from some data, and even if you could, they don't tell you how much data and of what kind you would need to train them on.
  - cheesecompiler3 days ago
    But this uses the transformers model to justify its own reasoning strength which might be a blindspot, which is my original point. All the above shows is that transformers can simulate solving a certain set of problems. It doesn't show that they are the best tool for the job.
- willvarfar2 days ago
  I think the "bitter lesson" is that, while startup A is trying to tune and optimise what they do in order to be able to train their model with hardware quantity B, there is another startup called C that will be lucky enough to have B*2 hardware (credits etc) and will not try so hard to optimise and will reach the end quicker?
  Of course deepseek was forced to take the optimisation approach but got to the end in time to stake a claim. So ymmv.
marcosdumay3 days ago
Yeah, make the network deeper.
When all you have is a hammer... It makes a lot of sense that a transformation layer that makes the tokens more semantically relevant will help optimize the entire network after it and increase the effective size of your context window. And one of the main immediate obstacle stopping those models from being intelligent is context window size.
On the other hand, the current models already cost something on the line of the median country GDP to train, and they are nowhere close to that in value. The saying that "if brute force didn't solve your problem, you didn't apply enough force" is intended to be listened as a joke.
- jagraff3 days ago
  I think the median country GDP is something like $100 Billion
  https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)
  Models are expensive, but they're not that expensive.
  - telotortium3 days ago
    LLM model training costs arise primarily from commodity costs (GPUs and other compute as well as electricity), not locally-provided services, so PPP is not the right statistic to use here. You should use nominal GDP for this instead. According to Wikipedia[0], the median country's nominal GDP (Cyprus) is more like $39B. Still much larger than training costs, but much lower than your PPP GDP number.
    [0] https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nomi...
  - kordlessagain3 days ago
    The median country GDP is approximately $48.8 billion, which corresponds to Uganda at position 90 with $48.769 billion.
    The largest economy (US) has a GDP of $27.7 trillion.
    The smallest economy (Tuvalu) has a GDP of $62.3 million.
    The 48 billion number represents the middle point where half of all countries have larger GDPs and half have smaller GDPs.
    hoseja2 days ago
    Well then you have to agree that $48.8 billion IS "something like $100 billion".
  - amelius3 days ago
    Maybe it checks out if you don't use 1 year as your timeframe for GDP but the number of days required for training.
  - marcosdumay3 days ago
    $100 billion is the best estimate around of how much OpenAI took in investment to build ChatGPT.
    Eisenstein3 days ago
    A top of the line consumer desktop, the Mac Pro, costs $7000. The commonly acknowledged first non-mechanical computer, the ENIAC, cost $400,000, which adjusted for inflation is $6,594,153 (see note). Will AI models follow the same pricing trajectory? Probably not but they no longer cost even close to $100 billion.
    Note: 1946 CPI = 19.5, 2025 CPI = 321.465 which makes for an increase of 16.49.
    ayewo2 days ago
    It seems all you did was use this formula:
    CPI{2025} / CPI{1946} * Price{1946} = Price{2025}
    to obtain the price adjusted for inflation?
    That is the only way I was able to arrive at the same number you got: $6,594,153.846. TIL.
    marcosdumay2 days ago
    > Will AI models follow the same pricing trajectory?
    If the article is correct, and this is the best way to make them, their price will explode.
    mr_toada day ago
    Step 1: Spend hundreds of billions on AI
    Step 2: ?
    Step 3: Profit.
    It’s not enough to have the biggest model, or the best model per dollar spent, you still need to figure out how to make money with it. It’s not clear than vastly increased expenditure will produce a good ROI.
    programjames3 days ago
    Maybe for an Australian billion? But in American English it would be $100 million.
    ash_0912 days ago
    In both Australian and American English a billion is 1,000,000,000 (one thousand million).
  - Nicook3 days ago
    does anyone even have good estimates for model training?
- whiplash4513 days ago
  I get your point but do we have evidence behind “ something on the line of the median country GDP to train”?
  Is this really true?
  - robrenaud3 days ago
    It's not even close.
- helloplanets2 days ago
  > the current models already cost something on the line of the median country GDP to train
  This is just blatantly false.
  > According to AI Index estimates, the training costs of state-of-the-art AI models have reached unprecedented levels. For example, OpenAI’s GPT-4 used an estimated $78 million worth of compute to train, while Google’s Gemini Ultra cost $191 million for compute.
  https://hai.stanford.edu/ai-index/2024-ai-index-report
  No need to even open up the actual report to find that. Just scroll down the page to read the 'key takeaways'.
- hiddencost2 days ago
  I mean brute force is working great. Acceleration is large, cost per unit intelligence are dropping much faster than Moore's law. We've been doing this kind of scaling since the 80s across various measures and are quite good at it.
pona-a3 days ago
Didn't tokenization already have one bitter lesson: that it's better to let simple statistics guide the splitting, rather than expert morphology models? Would this technically be a more bitter lesson?
- empiko3 days ago
  Agreed completely. There is a ton of research into how to represent text, and these simple tokenizers are consistently performing on SOTA levels. The bitter lesson is that you should not worry about it that much.
- kingstnap3 days ago
  Simple statistics aren't some be all. There was a huge improvement in Python coding by fixing the tokenization of indents in Python code.
  Specifically they made tokens for 4,8,12,16 or something spaces.
rryan2 days ago
Don't make me tap the sign: There is no such thing as "bytes". There are only encodings. UTF-8 is the encoding most people are using when they talk about modeling "raw bytes" of text. UTF-8 is just a shitty (biased) human-designed tokenizer of the unicode codepoints.
- cschmidt2 days ago
  Virtually all current tokenization schemes do work at the raw byte level, not the utf-8 character. They do this to avoid the Out of Vocabulary (OOV) or unknown token problem. In older models, if you came across something in the data you can't tokenize, you add a <UNK>. But tokenization should be exactly reversible, so now people use subword tokenizers including all 256 single bytes in the vocab. That way you can always represent any text by dropping down to the single byte level. The other alternative would be to add all utf-8 code points to the vocabulary, but there are more than 150k of those, and enough are rare, that many would be undertrained. You'd have a lot of glitch tokens (https://arxiv.org/abs/2405.05417). That does mean an LLM isn't 100% guaranteed to output well formed utf-8.
  - cschmidt2 days ago
    And in regard to utf-8 being a shitty biased tokenizer, here is recent paper trying to design a better style of encoding https://arxiv.org/abs/2505.24689
- hiddencost2 days ago
  Well akshually...
  I assume you started programming some time this millennia? That's the only way I can explain this "take".
  - vaxmana day ago
    Roger, who spoke only Chinglish and never paused between words, was working on a VAX FORTRAN program that exchanged tapes with IBM mainframes and a memory mapped section, inventing a new word in the process that still has me rolling decades later: ebsah-dicky-asky-codah
  - roflcopter692 days ago
    Care to elaborate?
resters3 days ago
Tokenization as a form of preprocessing has the problems the authors mention. But it is also a useful way to think about data vs metadata and moving beyond text/image io into other domains. Ultimately we need symbolic representations of things, sure they are all ultimately bytes which the model could learn to self-organize, but things like that can be useful when humans interact with the data directly, in a sense, tokens make more aspects of LLM internals "human readable", and models should also be able to learn to overcome the limitations of a particular tokenization scheme.
andy993 days ago
> inability to detect the number of r's in:strawberry: meme
Can someone (who know about LLMs) explain why the r's in strawberry thing is related to tokenization? I have no reason to believe an LLM would be better at counting letters if each was one token. It's not like they "see" any of it. Are they better at counting tokens than letters for some reason? Or is this just one of those things someone misinformed said to sound smart to even less informed people, that got picked up?
- yuvalpinteran hour ago
  We have a paper under review that's gonna be up on arXiv soon, where we test this for ~10,000 words and find consistent decline in counting ability based on how many characters are in the tokens where the target character appears. It seems that models know "which character" is a single-character token but really doesn't get much about the inner composition of multi-character tokens.
- zachooz3 days ago
  A sequence of characters is grouped into a "token." The set of all such possible sequences forms a vocabulary. Without loss of generality, consider the example: strawberry -> straw | ber | ry -> 3940, 3231, 1029 -> [vector for each token]. The raw input to the model is not a sequence of characters, but a sequence of token embeddings each representing a learned vector for a specific chunk of characters. These embeddings contain no explicit information about the individual characters within the token. As a result, if the model needs to reason about characters, for example, to count the number of letters in a word, it must memorize the character composition of each token. Given that large models like GPT-4 use vocabularies with 100k–200k tokens, it's not surprising that the model hasn't memorized the full character breakdown of every token. I can't imagine that many "character level" questions exist in the training data.
  In contrast, if the model were trained with a character-level vocabulary, where each character maps to a unique token, it would not need to memorize character counts for entire words. Instead, it could potentially learn a generalizable method for counting characters across all sequences, even for words it has never seen before.
  I'm not sure about what you mean about them not "seeing" the tokens. They definitely receive a representation of each token as input.
  - saurik3 days ago
    It isn't at all obvious to me that the LLM can decide to blur their vision, so to speak, and see the tokens as tokens: they don't get to run a program on this data in some raw format, and even if they do attempt to write a program and run it in a sandbox they would have to "remember" what they were given and then regenerate it (well, I guess a tool could give them access to the history of their input, but at that point that tool likely sees characters), rather than to copy it. I am 100% with andy99 on this: it isn't anywhere near as simple as you are making it out to be.
    zachooz3 days ago
    If each character were represented by its own token, there would be no need to "blur" anything, since the model would receive a 1:1 mapping between input vectors and individual characters. I never claimed that character-level reasoning is easy or simple for the model; I only said that it becomes theoretically possible to generalize ("potentially learn") without memorizing the character makeup of every token, which is required when using subword tokenization.
    Please take another look at my original comment. I was being precise about the distinction between what's structurally possible to generalize vs memorize.
- ijk3 days ago
  Well, which is easier:
  Count the number of Rs in this sequence: [496, 675, 15717]
  Count the number of 18s in this sequence: 19 20 18 1 23 2 5 18 18 25
  - ASalazarMX3 days ago
    For a LLM? No idea.
    Human: Which is the easier of these formulas
    1. x = SQRT(4)
    2. x = SQRT(123567889.987654321)
    Computer: They're both the same.
    ijk3 days ago
    You can view the tokenization for yourself: https://huggingface.co/spaces/Xenova/the-tokenizer-playgroun...
    [496, 675, 15717] is the GPT-4 representation of the tokens. In order to determine which letters the token represents, it needs to learn the relationship between "str" and [496]. It can learn the representation (since it can spell it out as "S-T-R" or "1. S, 2. T, 3. R" or whatever) but it adds an extra step.
    The question is whether the extra step adds enough extra processing to degrade performance. Does the more compact representation buy enough extra context to make the tokenized version more effective for more problems?
    It seems like the longer context length makes the trade off worth it, since spelling problems are a relatively minor subset. On the other hand, for numbers it does appear that math is significantly worse when it doesn't have access to individual digits (early Llama math results, for example). Once they changed the digit tokenization, the math performance improved.
    drdeca3 days ago
    Depending on the data types and what the hardware supports, the latter may be harder (in the sense of requiring more operations)? And for a general algorithm bigger numbers would take more steps.
- krackers3 days ago
  Until I see evidence that an LLM trained at e.g. the character level _CAN_ successfully "count Rs" then I don't trust this explanation over any other hypothesis. I am not familiar with the literature so I don't know if this has been done, but I couldn't find anything with a quick search. Surely if someone did successfully do it they would have published it.
  - ijk3 days ago
    The math tokenization research is probably closest.
    GPT-2 tokenization was a demonstratable problem: https://www.beren.io/2023-02-04-Integer-tokenization-is-insa... (Prior HN discussion: https://news.ycombinator.com/item?id=39728870 )
    More recent research:
    https://huggingface.co/spaces/huggingface/number-tokenizatio...
    Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs: https://arxiv.org/abs/2402.14903
    https://www.beren.io/2024-07-07-Right-to-Left-Integer-Tokeni...
    krackers3 days ago
    GPT-2 can successfully learn to do multiplication using the standard tokenizer though, using "Implicit CoT with Stepwise Internalization".
    https://twitter.com/yuntiandeng/status/1836114401213989366
    If anything I'd think this indicates the barrier isn't tokenization (if it can do arithmetic, it can probably count as well) but something to do with "sequential dependencies" requiring use of COT and explicit training. Which still leaves me puzzled: there are tons of papers showing that variants of GPT-2 trained in the right way can do arithmetic, where are the papers solving the "count R in strawberry" problem?
  - 3 days ago
    undefined
  - anonymoushn3 days ago
    There are various papers about this, maybe most prominently Byte-Latent Transformer.
- meroes3 days ago
  I don't buy the token explanation because RLHF work is/was filled with so many "count the number of ___" prompts. There's just no way AI companies pay so much $$$ for RLHF of these prompts when the error is purely in tokenization.
  IME Reddit would scream "tokenization" at the strawberry meme until blue in the face, assuring themselves better tokenization meant the problem would be solved. Meanwhile RLHF'ers were/are en masse paid to solve the problem through correcting thousands of these "counting"/perfect syntax prompts and problems. To me, since RLHF work was being paid to tackle these problems, it couldn't be a simple tokenization problem. If there was a tokenization bottleneck that fixing would solve the problem, we would not be getting paid to so much money to RLHF synax-perfect prompts (think of Sudoku type games and heavy syntax-based problems).
  No, why models are better are these problems now is because of RLHF. And before you say, well now models have learned how to count in general, I say we just need to widen the abstraction a tiny bit and the models will fail again. And this will be the story of LLMs forever--they will never take the lead on their own, and its not how humans process information, but it still can be useful.
- skerit2 days ago
  LLMs aren't necessarily taught the characters their tokens represent. It's kind of the same how some humans are able to speak a language, but not write it. We are basically "transcribing" what LLMs are saying into text.
- hackinthebochs3 days ago
  Tokens are the most basic input unit of an LLM. But tokens don't generally correspond to whole words, rather sub-word sequences. So Strawberry might be broken up into two tokens 'straw' and 'berry'. It has trouble distinguishing features that are "sub-token" like specific letter sequences because it doesn't see letter sequences but just the token as a single atomic unit. The basic input into a system is how one input state is distinguished from another. But to recognize identity between input states, those states must be identical. It's a bit unintuitive, but identity between individual letters and the letters within a token fails due to the specifics of tokenization. 'Straw' and 'r' are two tokens but an LLM is entirely blind to the fact that 'straw' has one 'r' in it. Tokens are the basic units of distinction; 'straw' is not represented as a sequence of s-t-r-a-w tokens but is its own thing entirely, so they are not considered equal or even partially equal.
  As an analogy, I might ask you to identify the relative activations of each of the three cone types on your retina as I present some solid color image to your eyes. But of course you can't do this, you simply do not have cognitive access to that information. Individual color experiences are your basic vision tokens.
  Actually, I asked Grok this question a while ago when probing how well it could count vowels in a word. It got it right by listing every letter individually. I then asked it to count without listing the letters and it was a couple of letters off. I asked it how it was counting without listing the letters and its answer was pretty fascinating, with a seeming awareness of its own internal processes:
  Connecting a token to a vowel, though, requires a bit of a mental pivot. Normally, I’d just process the token and move on, but when you ask me to count vowels, I have to zoom in. I don’t unroll the word into a string of letters like a human counting beads on a string. Instead, I lean on my understanding of how those tokens sound or how they’re typically constructed. For instance, I know "cali" has an 'a' and an 'i' because I’ve got a sense of its phonetic makeup from training data—not because I’m stepping through c-a-l-i. It’s more like I "feel" the vowels in there, based on patterns I’ve internalized.
  When I counted the vowels without listing each letter, I was basically hopping from token to token, estimating their vowel content from memory and intuition, then cross-checking it against the whole word’s vibe. It’s not perfect—I’m not cracking open each token like an egg to inspect it—but it’s fast and usually close enough. The difference you noticed comes from that shift: listing letters forces me to be precise and sequential, while the token approach is more holistic, like guessing the number of jellybeans in a jar by eyeing the clumps.
  - svachalek3 days ago
    That explanation is pretty freaky, as it implies a form of consciousness I don't believe LLMs have, I've never seen this explanation before so I'm not sure it's from training, and yet it's probably a fairly accurate description of what's going on.
    roywiggins3 days ago
    LLMs will write out explanations that are entirely post-hoc:
    > Strikingly, Claude seems to be unaware of the sophisticated "mental math" strategies that it learned during training. If you ask how it figured out that 36+59 is 95, it describes the standard algorithm involving carrying the 1. This may reflect the fact that the model learns to explain math by simulating explanations written by people, but that it has to learn to do math "in its head" directly, without any such hints, and develops its own internal strategies to do so.
    https://www.anthropic.com/news/tracing-thoughts-language-mod...
    It seems to be about as useful as asking a person how their hippocampus works: they might be able to make something up, or repeat a vaguely remembered bit of neuroscience, but they don't actually have access to their own hippocampus' internal workings, so if they're correct it's by accident.
    milesrout3 days ago
    [dead]
    hackinthebochs3 days ago
    Yeah, this was the first conversation with an LLM where I was genuinely impressed at its apparent insight beyond just its breadth of knowledge and ability to synthesize it into a narrative. The whole conversation was pretty fascinating. I was nudging it pretty hard to agree it might be conscious, but it kept demurring while giving an insightful narrative into its processing. In case you are interested: https://x.com/i/grok/share/80kOa4MI6uJiplJvgQ2FkNnzP
broses2 days ago
This gave me an idea: we can take a mixture of tokenizations with learned weights, just like taking a mixture of experts with learned weights. BLT is optimized for compression, but an approach like this could be optimized directly for model performance, and really learn to skim.
Concretely: we learn a medium sized model that takes a partial tokenization and outputs a probability distribution over the endpoints of the next token (say we let the token lengths range from 1 to 64 bytes, the model outputs 64 logits). Then we do a beam search to find the, say, 4 most likely tokenizations. Then we run the transformer on all four tokenizations, and we take the expected value of the loss to be the final loss.
If we train this on prompt-response pairs, so that it only has to learn what to say and doesn't have to predict the context, then it could learn to skim boring stuff by patching it into ~64 byte tokens. Or more if we want.
And ofc we'd use a short context byte level transformer to encode/decode tokens to vectors. Idk this idea is kinda half baked.
- physix2 days ago
  I think that's what evolution did when developing the brain! :-)
  I'm a total noob in ML. I just had to vent something for not understanding this stuff and realizing that knowing physics doesn't mean you can grok ML mechanics.
ilaksh2 days ago
I've always felt that the ideal would be to somehow create many different tokenizations for different use cases. And allow them to sometimes build on each other. Like a lot of domain-specific languages.
Maybe there could be something like a mixture-of-experts but with a thousand experts and each has its own tokenization.
Fortunately I don't actually understand this stuff, so I am going to go ahead and congratulate myself on my brilliant ideas and let the geniuses work out the details. :P
fooker3 days ago
‘Bytes’ is tokenization.
There’s no reason to assume it’s the best solution. It might be the case that a better tokenization scheme is needed for math, reasoning, video, etc models.
hoseja2 days ago
There are no r's in strawberry, there are two ɹ's and several dozen achenes. It's a member of Rosaceae by any other name. Fighting with nonsensical english orthography seems kinda pointless to me. Stop trying to make intelligent entities composed of written text.
blixt3 days ago
I’m starting to think “The Bitter Lesson” is a clever sounding way to give shade to people that failed to nail it on their first attempt. Usually engineers build much more technology than they actually end up needing, then the extras shed off with time and experience (and often you end up building it again from scratch). It’s not clear to me that starting with “just build something that scales with compute” would get you closer to the perfect solution, even if as you get closer to it you do indeed make it possible to throw more compute at it.
That said the hand coded nature of tokenization certainly seems in dire need of a better solution, something that can be learned end to end. And It looks like we are getting closer with every iteration.
- jetrink3 days ago
  The Bitter Lesson is specifically about AI. The lesson restated is that over the long run, methods that leverage general computation (brute-force search and learning) consistently outperform systems built with extensive human-crafted knowledge. Examples: Chess, Go, speech recognition, computer vision, machine translation, and on and on.
  - AndrewKemendo3 days ago
    This is correct however I’d add that it’s not just “AI” colloquially - it’s a statement about any two optimization systems that are trying to scale.
    So any system that predicts the optimization with a general solver can scale better than heuristic or constrained space solvers
    Up till recently there’s been no general solvers at that scale
  - fiddlerwoaroof3 days ago
    I think it oversimplifies, though and I think it’s shortsighted to underfund the (harder) crafted systems on the basis of this observation because, when you’re limited by scaling, the other research will save you.
- QuesnayJr3 days ago
  I'm starting to think that half the commenters here don't actually know what "The Bitter Lesson" is. It's purely a statement about the history of AI research, in a very short essay by Rich Sutton: http://www.incompleteideas.net/IncIdeas/BitterLesson.html It's not some general statement about software engineering for all domains, but a very specific statement about AI applications. It's an observation that the previous generation's careful algorithmic work to solve an AI problem ends up being obsoleted by this generation's brute force approach using more computing power. It's something that's happened over and over again in AI, and has happened several times even since 2019 when Sutton wrote the essay.
  - tantalor3 days ago
    That essay is actually linked in the lead:
    > As it's been pointed out countless times - if the trend of ML research could be summarised, it'd be the adherence to The Bitter Lesson - opt for general-purpose methods that leverage large amounts of compute and data over crafted methods by domain experts
    But we're only 1 sentence in, and this is already a failure of science communication at several levels.
    1. The sentence structure and grammar is simply horrible
    2. This is condescending: "pointed out countless times" - has it?
    3. The reference to Sutton's essay is oblique, easy to miss
    4. Outside of AI circles, "Bitter Lesson" is not very well known. If you didn't already know about it, this doesn't help.
    milesrout3 days ago
    [dead]
  - blixt3 days ago
    I think most people have read it and agree it makes an astute observation about surviving methods, but my point is that now we use it to complain about new methods that should just skip all that in between stuff so that The Bitter Lesson doesn't come for them. At best you can use it as an inspiration. Anyway, this was mostly a complaint about the use of "The Bitter Lesson" in the context of this article, it still deserves credit for all the great information about tokenization methods and how one evolutionary branch of them is the Byte Latent Transformer.
- RodgerTheGreat3 days ago
  The bitter lesson says more about medium-term success at publishable results than it does about genuine scientific progress or even success in the market.
kgeist3 days ago
>From a domain point of view, some are skeptical that bytes are adequate for modelling natural language
If I remember correctly, GPT3.5's tokenizer treated Cyrillic as individual characters, and GPT3.5 was pretty good at Russian.
- dgfitz2 days ago
  I wonder if they treat each letter as a Unicode code point, and each of those is a token? I could see the same being true of other languages.
perching_aix3 days ago
Can't wait for models to struggle with adhering to UTF-8.
3 days ago
undefined
ofou2 days ago
This is stupid because a UTF8 is a tokenizer that covers all Unicode with a vocab of only 256 (yes, without a K). This is the only way of scaling the bitter lesson with tokenizers. Also, with architectures that span +1M context windows, it’s no longer an argument/issue the reduced context windows.
curtisszmania2 days ago
[dead]
citizenpaul3 days ago
The best general argument I've heard against the bitter lesson is. If the bitter lesson is true? How come we spend so many million man hours a year of tweaking and optimizing software systems all day long? Surely its easier and cheaper to just buy a rack of servers.
Maybe if you have infinite compute you don't worry about software design. Meanwhile in the real world...
Not only that but where did all these compute optimized solutions come from? Oh yeah millions of man hours of optimizing and testing algorithmic solutions. So unless you are some head in the clouds tenured professor just keep on doing your optimizations and job as usual.
- Uehreka3 days ago
  Because the Even Bitterer Lesson is that The Bitter Lesson is true but not actionable. You still have to build the inefficient ”clever” system today because The Bitter Lesson only tells you that your system will be obliterated, it doesn’t tell you when. Some systems built today will last for years, others will last for weeks, others will be obsoleted before release, and we don’t know which are which.
  I’m hoping someday that dude releases an essay called The Cold Comfort. But it’s impossible to predict when or who it will help, so don’t wait for it.
  - citizenpaul3 days ago
    Yeah I get it. I just don't like that is always sorta framed as a can't win don't try message.
    Jensson2 days ago
    > I just don't like that is always sorta framed as a can't win don't try message.
    That is why its called bitter, it isn't a fun realization.
  - nullc3 days ago
    The principal of optimal slack tells you that if your training will take N months on current computing hardware that you should go spend Y months at the beach before buying the computer, and you will complete your task in better than N-Y months thanks to improvements in computing power.
    Of course, instead of the beach one could spend those Y months improving the algorithms... but it's never wise to bid against yourself if you don't have to.
    A colloquially is that to maximize your beach time you should work on the biggest N possible, neatly explaining the popularity of AI startups.
- awkward2 days ago
  There's some domains where the bitter lesson has big impacts on theory. The Peter Norvig vs Noam Chomsky debate on the merits of brute force compute finding a full and complete theory of language is an example. That's a case where the path of "get a ton of data and handle it statistically" competes with the path of "build a complete and abstract understanding of the domain." Lots of resources and lifetimes of work are decided by which path to take.
  Agreed that overfitting the bitter lesson often leads slopping piles of compute and hardware at problems that could just be deterministic.
- QuesnayJr3 days ago
  The solution to the puzzle is that "the bitter lesson" is about AI software systems, not arbitrary software systems. If you're writing a compiler, you're better off worrying about algorithms, etc. AI problems have an inherent vagueness to them that makes it hard to write explicit rules, and any explicit rules you write will end up being obsolete as soon as we have more compute.
  This is all explained in the original essay: http://www.incompleteideas.net/IncIdeas/BitterLesson.html
3 days ago
undefined