Until those are addressed, closed-system nondeterminism doesn't really help except in cases where a lookup table would do just as well. You can't use "correct" unit tests or evaluation sets to prove anything about inputs you haven't tested.
If you were to obtain exactly the same output for a given input prompt, regardless of context, then that would mean that the context is being ignored, which is indistinguishable from the session not maintaining any context such that each prompt is in a brand new empty context.
Now what some people want is requirements like:
- The different wording of a prompt with exactly the same meaning should not change anything in the output; e.g. whether you say "What is the capital of France" or "What is France's capital" the answer should be verbatim identical.
- Prior context should not change responses in ways that don't have any interaction with the context. For instance, a prompt is given "what is 2 + 2", then the answer should always be the same, except if the context instructs the LLM that 2 + 2 is to be five.
These kinds of requirements betray a misunderstanding of what these LLMs are.
„The context is the input“ betrays a misunderstanding of what (artificial) intelligence systems are aiming for.
We have observed situations where agentic LLM traces on verifiable problems with deterministic (greedy) decoding lead to either completely correct or completely wrong solutions depending on the minutes on the clock which are printed as coincidental output of some tool that the LLM used.
I think there may be some mild fixes to current models available , for example it is worrying that the attention mechanism can never fully disregard any token in the input, because the softmax will always assign a > 0 weight everywhere (and the NN has no way of setting a logit to -infinity). This directly causes that it is extremely difficult for the LLM to fully ignore any part of the context reliably.
However Yann LeCun actually offers some persuasive arguments that autoregressive decoding has some limitations and we may need something better.
I see this a lot. I kinda' doubt the "simple" part, but even beyond that, is there any evidence that statistical predictor can't be a universal answering machine? I think there's plenty of evidence that our thinking is at least partially a statistical predictor (e.g. when you see a black sheep you don't think "at least one side of this sheep is black", you fully expect it to be black on both sides)
I'm not saying that LLMs _are_ universal answering machines. I'm wondering why people question that they are/they can become one, based on the argument that "fundamentally they are statistical predictors". So they are. So what?
If it does, statistical predictors can't help you because they're not always correct or even meaningful (correlation does not imply causation).
If it doesn't then, by all means, enjoy your infinite monkeys
They do not. Refusing to bend your requirements to a system that can't satisfy them is not evidence of misunderstanding the system.
And if you tack on "with X 9s of reliability" then it is something LLMs can do. And in the real world every system has a reliability factor like that.
There are going to be false positives: text that is subtly different from a previous response is misidentified as a duplicate such that the previous response is substituted for it, frustrating the user.
Why and how is this a problem?
If 'preceding context' doesn't cause different results, it means you can simply discard the context. Why do I want that? It's not how I expect a tool to work (I expect vim responds differently to my input after I switch to the insert mode). It's absolutely not how I expect intelligence to work either. It sounds like the most extreme form of confirmation bias.
This is a common AI benchmark and has been for years before GPT-2 even existed. LLMs need to not get distracted by irrelevant facts and there are tests that measure this. It's the motivation for attention mechanisms, which are the breakthrough that enabled LLMs to scale up.
What LLMs need is the ability to guarantee semantically-equivalent outputs for all semantically-equivalent inputs, but that's very different from "determinism" as we understand it from other algorithms.
If you take an LLM that makes 10 tool calls in a row for an evaluation, any reduction in unpredictable drift is welcome. Same applies to running your prompt through DSPy Optimizer. [0] Countless other examples. Basically any situation where you are in control of the prompt, the token level input to the LLM, so there's no fuzziness.
In this case, if you would've eliminated token level fuzziness and can yourself guarantee that you're not introducing it from your own end, you can basically map out a much more reliable tree or graph structure of your system's behavior.
[0]: https://dspy.ai/#2-optimizers-tune-the-prompts-and-weights-o...
why use an ambiguous natural language for a specific technical task? i get that its a cool trick but surely they can come up with another input method by now?
Since I'm really looking to sample the only the top ~10 tokens, and I mostly test on CPU-based inference of 8B models, there's probably not a lot of worries getting a different order of the top tokens based on hardware implementation, but I'm still going to take a look at it eventually, and build in guard conditions against any choice that would be changed by an epsilon of precision loss.
This nonlinear and chaotic behavior regardless of implementation details of the black box makes LLM seem to be nondeterministic. But LLM is just a pseudo random number generator with a probability distribution.
(As I am writing this on my iPhone with text completion, I can see this nondeterministic behavior)
Today we have a extremely hacky workaround by ensuring that at least the desired chunk from the RAG is selected, but it's far from ideal and our code is not well written (a temporary POC written by AI that has been there for quite some months now ...)
If i want to covert "how do I x" to `api.howTo("x")` it is very important that i get the exact same result every time.
For many applications, non-determinism implies "useless". This has been a long standing issue with LDA topic models. In particular in the legal, financial and regulatory domains, if a method is not deterministic, it may be illegal to use it or it may lead to follow-on requirements that one does not want (e.g. all screens shown to humans must be preserved to be able to go back and reconstruct what exactly happened to a particular user in a particular second).
If you're old enough, you might remember Danny Hillis' Thinking Machines from the late 80s. I wish they had chosen a different name (I say this for nostalgic reasons, having been in front of one of those cubes glowing with red LEDs back in the late 80s at MIT's AI Lab" (renamed to CSAIL at some point). Feynman did some amazing work on that, too: https://longnow.org/ideas/richard-feynman-and-the-connection...
In the U.S., the “THINKING MACHINES” trademarks were owned by Thinking Machines Corporation (the company Hillis co-founded), not Hillis personally, and those registrations were cancelled in 1998–1999. USPTO Report +1
The company itself went bankrupt in 1994 and its assets were dispersed (e.g., to Sun Microsystems, later Oracle).
There’s a new, pending USPTO application for “THINKING MACHINES” filed in 2025 by Thinking Machines Lab Inc., the company founded by Amira Murati.
There's a similar situation in other scientific disciplines. People want source code and data so they can reproduce results - that basically tells you someone didn't cheat and they documented everything. But it does not tell you whether a real phenomenon was observed.
It's much more interesting to know if roughly the same cause and effect relationships exist so we can predict behavior.
Concretely, there are studies that show e.g. randomly capitalizing letters can lead to completely different responses from and LLM. That speaks to a fragility that doesn't have anything to do with deterministic reproduction.
Discussions of this type are going to eventually morph into better understanding of how to accept ambiguity and randomness in language, and further shape it with other larger sub-patterns beyond the little proto-grammars that the QKV projection matrices extract.
If I ask the same model the same question I should be able to deterministically get the same answer.
Now if we phrase the same question slightly differently we would expect to get a slightly different answer.
You wouldn't get this from an LLM though, a tiny change in starting point gets a massive change in output, its a chaotic system.
LLM: 1
“Language ambiguity with determinism”? Sure I can juxtapose the terms but if it’s semantically inconsistent, then what we mean by that is not a deterministic, definitive thing. You’re chasing your tail on this ‘goal’.
Determinism: If a model is given the exact same request/prompt twice, its two responses will also be identical. Whether or not the consistent response qualifies as correct.
The two concepts are very different.
(Ambiguous vs. precise prompt) x (Deterministic vs. Non-deterministic model) = 4 different scenarios.
A model itself can be non-deterministic without being ambiguous. If you know exactly how it functions, why it is non-deterministic (batch sensitive for instance), that is not an ambiguous model. Its operation is completely characterized. But it is non-deterministic.
An ambiguous model would simply be model whose operation was not characterized. A black box model for instance. A black box model can be deterministic and yet ambiguous.
Ambiguity is what happens when you change the prompt slightly, e.g. by adding a word: "Give an example of a single dice roll". Now as a human our expectation would be that this is the same question and should thus (in a deterministic system) receive the same answer. But to an LLM it may not be.
Yes, and thanks. That was my intended point - but you point out a better example. Slightly different prompts may also produce highly varied responses.
(My subsequent comments on ambiguous models was in case I was misinterpreting the comment I was replying to. I also generally think of ambiguity as a property of input. Either way, ambiguity is not the same as non-deterministic.)
A perfectly acceptable answer.
If it answers 1 every time it's still a perfectly acceptable answer.
What is the reasoning behind these schemes? The hope that bits of the properties of legendary companies will rub off onto the new venture?
As if naming the next best venture PARC will inevitably create a breakthrough in networking just by the arrangement of four letters.
“We are building a machine that will be proud of us” was their corporate motto. And that was in 1983.
One of those Machines is on view at the Computer History Museum in Mountain View. Back then, they could be ordered in “Darth Vader Black”, no kidding here. You can also see a couple of them (the CM-5) as the stereotypical supercomputer in the original Jurassic Park.
More here: https://en.m.wikipedia.org/wiki/Thinking_Machines_Corporatio...
I'm honored to see that Mira and co. appreciated my feedback on the very topic I made 7 months ago here :D
> You don't need RNG since the whole transformer is an extremely large floating-point arithmetic unit. A wild guess - how about the source of non-determinism is coming from the fact that, on the HW level, tensor execution order is not guaranteed and therefore (T0 * T1) * T2 can produce slightly different results than T0 * (T1 * T2) due to rounding errors?
Nondeterminism is what currently keeps me from working with other developers.
As I wrote in "Prompt Coding" [1], these days I am not looking for good code. I am looking for prompts that create good code. But how do you share prompts among developers when they produce different code every time? You cannot simply state "Here, I found a prompt that makes gpt-5-2025-08-07 output a solution with all the desired attributes".
Similar with images. At the moment, for most image models, you cannot outsource the task of writing prompts that create the desired images. Because most image models will not create the same image when given the same prompt and parameters.
A deterministic LLM isn't going to behave appreciably differently from a non deterministic one if your input or context varies by even a tiny bit (pun intended) each time.
i'm hoping that it becomes more useful as models improve and become more reliable at producing working code (though determinism would be great for improving prompts).
Really? If you include the seed as one of the parameters most produce pixel identical output.
E.g. "Generate deterministic images" https://cloud.google.com/vertex-ai/generative-ai/docs/image/...
It took me ages to get the prediction for the second token after "hello" to match the same as the prediction for the second token when running the model on the string "hello world", despite the fact that I was using a causal model. I tried all kinds of things before discovering that `quantized: false` was the important setting.
I dont fully understand what you said but I guess higher probability logits are encoded with fewer bits. If your text is the LLM output then you may need a bit or two per token?
In terms of performance, I've not done any serious testing, but e.g. the wikipedia article on volcanos compresses to about 20% using GPT2. I've seen other strings compress even further.
The big issue is that while encoding is not unreasonable, decoding any significant amount of data is incredibly slow, since I'm doing a model run for every token in the output. It's bad enough that the scheme is probably unworkable as it is. I'm thinking about changing my code so that it streams out the tokens as it decodes them, so you're not just left there waiting for ages.
I supervised a student's project whose goal was exactly that : implement compression with LLMs using AC.
Since AC is optimal, if your LLM has an average cross entropy x on some dataset, you can expect that the compression will compress data using x nats per token on average!
> For example, you might observe that asking ChatGPT the same question multiple times provides different results.
even with 0.0 temperature due to MOE models routing at a batch level, and you're very unlikely to get a deterministic batch.
> Not because we’re somehow leaking information across batches — instead, it’s because our forward pass lacks “batch invariance”, causing our request’s output to depend on the batch size of our forward pass.
The router also leaks batch-level information across sequences.
I don’t think this is correct - MoE routing happens at per token basis. It can be non deterministic and batch related if you try to balance out your experts load in a batch but that’s performance optimization (just like all of the blogpost) and not the way models are trained to work.
What I would have loved is a discussion around collectives/multi-node setups. And showing how to get determinism at low performance penalty for multi-node reduction collectives.
Focus on correctness, not determinism.
Compilers can also reorder operations but in practice this is rarely an issue because kernels typically synchronize frequently and this limits the ability for compilers to reorder things. This isn’t to say it doesn’t happen, but even if it does happen it’s likely because the compiler changed because the code they generate is generally run-to-run identical.
With revisions, you're trying to ensure a consistent floating point environment where the operations used are deterministic, and used in the same order with the same inputs. The best way to do that is to use operations that adhere to a mostly deterministic standard like IEEE-754.
Valid point. Floating point summation is not always commutative.
A = torch.randn(2048, 2048, device='cuda', dtype=torch.bfloat16)
B = torch.randn(2048, 2048, device='cuda', dtype=torch.bfloat16)
ref = torch.mm(A, B)
for _ in range(1000):
assert (torch.mm(A, B) - ref).abs().max().item() == 0
I’m sort of surprised that Torch doesn’t have some kind of lazy evaluation thing to avoid computing anything here. I thought that was one of the nice things about all these fancy frameworks (if I wanted the computer to actually do silly things when I asked it to, I would use BLAS directly, right?).What would hope to be achieved by making this case lazy? If you wanted these to run in parallel, with a multi-gpu system, you would use the appropriate parallel interface.
.abs().max().item()
of something that can be identified as definitionally zero.The parallel interface, which is async, is probably what you're lookin for.
If evaluation is lazy, then the subtraction operator gets fed two unevaluated matrix multiplies.
If it's a dumb subtraction operator, this gives us no benefit. Eventually it evaluates both and then subtracts. And it has some extra overhead like you said.
But if it's a smart subtraction operator, it can realize that both parameters are the same equation, and then it can return all 0s without evaluating anything.
And even better than just skipping the matrix math, "all 0s" can be a stub object that takes O(1) time to set up. And then .abs().max() will be instant too.
So, even if will be achieved progress just now, I think in predictable future this will be constant dead-end.
Great post.
I hope all the model providers adopt this.
TL;DR
Seed your PRNGs and call torch.use_deterministic_algorithms(True) to get the deterministic kernels. They may be slightly slower, but in practice, you probably will not notice.
Note that results will still differ between different drivers and GPUs. It would be great if NVIDIA tried harder in that regard.
I had no problem getting deterministic LLM outputs when I experimented with this 6 months ago.
Run two of these with the same prompts and same seed and you get the same results.
Obviously in GPU clusters with different hardware things get more complicated.
"I had no problem getting deterministic LLM outputs when I experimented with this 6 months ago" looks like you're using llama-cpp in that repo. This is about vllm serving many requests at once, at long sequence lengths.
> As it turns out, our request’s output does depend on the parallel user requests. Not because we’re somehow leaking information across batches — instead, it’s because our forward pass lacks “batch invariance”, causing our request’s output to depend on the batch size of our forward pass.
Your situation isn't really comparable.
Seems a buried lede is that on-policy RL is unlocked by bitwise identical results between training and sampling. I'm not an expert here but my understanding is that this would allow for stronger guarantees about deployment/training alignment for the RL training that the labs already do.
I don't fully understand the BigMath example though. They show that off-policy RLVR requires off-policy correction, which avoids divergence, but is suboptimal because it results in noisy rewards. Then they say "we fixed the sampler and trainer numerical mismatch, which allows for on-policy RL, look how much better it is." It's not clear to me whether this is an artificial example that deliberately uses different trainer/sampler setups, or if it's actually impossible to have the same numerics between trainer/sampler without their fixes (even if we use same batch size, no atomics, etc.).
I've seen this play out dozens of times. So many startups that have come and go in the bay area were composed of extremely talented individuals, but almost all of them failed.
This is literally one of the most knowledgeable person on the topic. I think you are the one that hasn’t peeled enough layers to connect with what they are saying.
If you say so.
> the author has nothing to do with the original comment
Except for the part of the comment that was assuming the author had no idea how this all works, has only used LLMs through API and has never run a local model, you mean?
Not really, LLMs give you a distribution over possible next tokens. You are free to then sample from this distribution how you want. There is no need to hack RNG or whatever, for example you can simply just take a greedy approach and always output the most likely token, in which case the LLM becomes deterministic (mathematically). This is equivalent to setting the temperature to 0.