> As a consequence of this succinctness, we show that basic verification problems for transformers, such as emptiness and equivalence, are provably intractable: specifically, EXPSPACE-complete.
If you were hoping to formally prove the correctness of a large transformer, it turns out that you're going to need an exponentially larger amount of space to do your verification, more than you could possibly afford.
> As a consequence of this succinctness, we show that basic verification problems for transformers, such as emptiness and equivalence, are provably intractable: specifically, EXPSPACE-complete.
This doesn't imply you can't get aid from a LLM to e.g. implement a function that has a formal specification (an application I think is very promising), but surely it has some profound implications on how much of a large system can be understood by a LLM at once, without supervision.
Transformers Are Inherently Succinct (2025) - https://news.ycombinator.com/item?id=48014197 - May 2026 (9 comments)
Authors used LTL (linear temporal logic) to express, basically, non-reduced non-ordered binary decision diagrams. Or just binary decision diagrams, BDDs.
BDDs are almost guaranteed to have exponential size because they do not employ reduction (sharing of common expressions). Reduced BDDs are more succinct and reduced ordered BDDs are even more succinct.
Also, transformers in the paper are constructed, not trained. Training any model to express some truth table is very hard. They also did not perform comparison with, say, Kolmogorov-Arnold representation, which is also universal approximator.
So this paper is not as deep as one may think it is.
Just nit picking a bit:
> Training any model to express some truth table is very hard
What kind of models are you including here? Truth tables can be modeled in regular code very easily and reliably. And I’m sure there are many deterministic models that could do the same. Are you talking about LLMs in particular or a certain category/type of models?
Reduced Ordered BDDs are likely not as succinct as an arbitrary BDD.
The famous Hidden Weight Bit problem can be more succinctly expressed in arbitrary BDDs (changing the order and revisiting nodes), but are provably EXPSPACE in ROBDDs.
-------
We study ROBDDs because they uniquely reduce to a canonical form. All functions have exactly one (!!!!) ROBDD.
BDDs in general however are really arbitrary, as arbitrary as any codebase can basically get. That makes BDDs in general to difficult to study or do math upon.
Results based on the studies of arbitrary BDDs do NOT apply to the simpler, easier to understand world of ROBDDs. And vice versa.
It's exactly as related to real models as computer science is to real computers.
> doesn't that mean we're approaching optimality?
No.Transformers are Markov chains [1]. Somewhere around this fascinating site [2] I read that stateful models have an advantage. Author provided an example, a state machine with two states A and B, where at state A transitions are to state A (output 0) and to state B (output 1) with equal probability and at state B the transition is always to state A and output is always 1.
For this state machine just one bit of memory can make an optimal prediction that ones always go in pairs, whereas Markov chain will approximate this prediction and never reach optimality.
[1] https://arxiv.org/abs/2410.02724
[2] https://bactra.org/If you include all the information the LLM uses to produce the next token as part of the state, then of course the LLM is a Markov chain.
So would be any other process for sampling continuations of a text, with finite memory.
i.e. the blowup is only exponential in one direction.
Edit: Actually nevermind. If UHAT could be compiled into LTL w/ polynomial overhead then that would also work for the languages that have exponential overhead in LTL but since they don't there is a strict separation.