The reason "temperature" is called such is because softmax is mathematically identical to the Boltzmann distribution [1] from thermodynamics, which describes the probability distribution of energy states of an ensemble of particles in equilibrium. In terminology more well understood by ML folks, the particles' energies will be distributed as the softmax of their negative energies divided by their temperatures (in Kelvin). Units are scaled by the Boltzmann constant (k_B).
Setting an LLM's temperature to zero is mathematically the same thing as cooling an ensemble of particles to absolute zero: in physics, the particles are all forced to their lowest energy state, in LLMs, the model is forced to deterministically predict the single most likely logit/token.
Now to drow another analogy for what happens at high temperatures: the reason a heating element glows red when it is hot is because if you take the expectation value (mean) of energies under this softmax distribution, that mean goes up with temperature, and when the energy gets high enough, the particles start shaking off energy in the form of photons that are now high energy enough to be in the visible spectrum. Incandescent bulbs with tungsten filaments are even hotter than that heating element, and glow white because as temperature T is even higher, the softmax distribution's mean energy moves higher and flattens out, and it roughly covers the whole visible spectrum somewhat more uniformly. In the case of the bulb, photons of all sorts of wavelengths are being spewed out, that's white light. Likewise, if you set an LLM's temperature to an absurdly high number, it spews out a very wide spectrum of mostly nonsense tokens.
"We take the exponential of each input and normalize by the sum of all exponentials. This transforms a vector of arbitrary real numbers into values between 0 and 1 that sum to 1, it technically this is a pseudo-probability distribution (they're not derived from a probability space), but it's close enough to a probability distribution and for practical purposes they work just fine."
Why is this a "pseudo-probability distribution?"
This is exactly the sense that it comes up for old school LMs and why it appears in thermodynamics.
Of course it is entirely possible that newfangled ML people use it without understanding that it is derived from first principles - i.e. see article.
https://docs.pytorch.org/docs/2.11/generated/torch.nn.CrossE...
You have a relatively small dictionary of tokens, each prediction has a neural network score that goes into the final token prediction layer, and they are trained based on a log-softmax (i.e. the above function) to predict their next token.
This is exactly how anyone in any field does conditional multinomial/categorical (i.e. one of a bunch of distinct tokens) distributions, and AFAIK what LLMs generally use as their loss functions on the output layer, though I have not deeply investigated all of them, since this has been how you do that since time immemorial.
I am extremely confused by all of the people screaming it's not a probability distribution?!?!?
I have seen computer vision tasks use binomial training objectives (one-vs-all) and then use the multinomial only at inference time, and that could be fair that that is not a probability distribution induced by training (while technically a probability distribution only in the sense it is \ge 0 and sums to 1).
But afaik token prediction LLMs that I am aware of use the softmax for the probability in their loss function, i.e. the maximize log softmax.
The author's essay on the sigmoid similarly lacks the deep understanding that it comes from somewhere and isn't an arbitrary choice.
It's true that the PyTorch API conflates cross entropy and softmax, but they are separate concepts.
Actually I believe that most of the times even after softmax, sampling is ways too permissive, seldom accepting low quality candidates. We all have the experience of seeing frontier LLMs sometimes putting a word in a different language that is really off-putting and almost impossible to explain, or other odd errors in just a single word of the output: most of the times, this is not what the model wanted to say, but sampling that casually selected a low quality token. I believe a better approach is to have a strong filter on which candidates are acceptable, like in the example here: https://antirez.com/news/142
If you take instances of softmax that you find in training / inference and there turn out to be a few, and use other things like entmax or sparsemax you see across the board improvements. And like top1 often is just the best answer too, there's a reason why when you're doing tool calls temp=0 is the way to go. Like do you really want creative unicode tokens when writing bash commands. From what I can tell, most of the time softmax is the worst answer that works.
Training on cost-to-go loss is good enough. Perfect cost-to-go eliminates the need for global algorithms and allows local decision making. Given “natural” datasets it is probably the best thing to attempt to learn. The fact that probabilistic graphical models never really worked proves it somewhat.
I am vaguely aware of some stuff, but would love to study more, I don't quite understand what this is all about (but I do see how LLMs can do attention to all prior tokens so you don't have the single-point-of-failure HMMs do which more necessitates Viterbi decodes)
How do you identify what the model wanted to say?
On a tangential note, I keep noticing "why x matters", "it's crucial here" that just remind me of Claude. Recently Claude has been gaslighting me in complex problems with such statements and seeing them on an article is low-key infuriating at this point. I can't trust Claude anymore on the most complex problems where it sometimes gets the answer right but completely misses the point and introduces huge complex blocks of code and logic with precisely "why it matters", "this is crucial here".
This sycophancy is a serious problem and exploits a weakness in the human psyche (flattery) which may be easier for the RLHF to find reward in than genuinely correct responses.
From a theory POV you get softmax like distributions (Gibbs distributions) by trying to balance following some energy E(x) and the entropy of the distribution. In essence the softmax is the answer to "I try to follow the maximum of a function E(x) but I need to maintain some level of uncertainy".
The balancing coefficient between entropy and picking the maximum of the function is called "temperature" (following the behavior of particles in a physical system: The colder the system, the lower the chance of having particles randomly walk away from the minimal energy state).
specifically, the temperature is
softmax(x/temp)
if you draw temp->0, your softmax slowly becomes an argmax (with temp=0 being a literal argmax). If you increase the temperature, you are closer to the "random fluctuations" leaving more room for sampling x values that are not the maximum of x. (this is why e.g. LLMs become deterministic as you decrease temp->0)
Using a different base other than e implicitly changes the temperature:
N^x = exp(ln(N) x)
The normalization works the same since you are still dividing a positive value N^x by the sum of all alternatives sum(N^x_i), which is a normalization by design
Also, so long as the function is non-negative for all inputs and positive for at least one you'll always get a valid probability distribution.
The sketch of the justification is something like this. We first need a function that maps from (-inf, inf) to a unique positive value, and then we need to normalize the resulting values. Setting aside the normalizing step, we imagine a f(x) that needs to fit the following properties:
1. It should be strictly positive, so that we can normalize it into a (0, 1) probability.
2. It should preserve the relative ordering of the logits to allow them to be interpreted as scores. Thus $f(x)$ should be monotonically increasing.
3. It should be continuous and differentiable everywhere, since we are interested in learning through this function via backpropagation.
4. It should have shift-invariance with respect to the input, as we don't want the model to have to learn some preferred logit-space where there is a stronger learning signal. For example, applying softmax on the values `(-1, 1, 3, 5)` would yield the same result as applying it to `(9, 11, 13, 15)`. This property can also be restated as a "scale invariance of probability ratios", where the ratio between $f(x)$ and $f(x+c)$ for a given $c$ is a constant. One useful interpretation of this property is that the learning domain or "gradient-learning surface" is stable, and high-magnitude initializations won't impede the learning process.
Taken at face value, these properties uniquely define e^x. The last property is actually pretty debatable, because in the context of machine learning, we actually do have a "preferred logit-space", namely closer to zero, for numerical stability. But there are other ways to enforce this in a post-hoc manner (e.g. weight initialization, normalization layers, etc.)
Another property that is uniquely justifies e^x and thus softmax is IIA (independence of irrelevant alternatives), which states that the odds for two classes, p_i / p_j, only depend on the logits/inputs for i and j, and an irrelevant class k has no impact. For example, for Softmax([5, 7, 1]) and Softmax([5, 7, 10]), the resulting odds for the first two values (p_i/p_j) should be the same from both distributions, regardless of the third value.
Finally, if the "desired properties" approach is not satisfying, a more theoretical route for justifying the form of the softmax uses the framework of maximum entropy (E. T. Jaynes published this in 1957 to justify the Boltzmann distribution).
TL;DR, softmax is not a the only solution to mapping function of unnormalized values to a probability distribution, but it can be justified through axiomatic properties.
(1) one could say that the exponential shows up from the Boltzmann distribution, but then the same question applies.
Luckily, there are more axiomatic reasons for why softmax is the preferred way to map inputs to a probability distribution.
Often there isn't any more to it than that. For example, the entire justification for least-squares error measurement is that it has convenient derivatives.
Not really, softmax transforms logits (logariths of probabilities) into probabilities.
Probabilities → logits → back again.
Start with p = [0.6, 0.3, 0.1]. Logits = log(p) = [-0.51, -1.20, -2.30]. Softmax(logits) = original p.
NN prefer to output logits because they are linear and go from -inf to +inf.
(also, log(p) is not the formal definition of a logit)
In your example you'll also get the original `p` with just `exp(logits)`. Softmax normalizes the output to sum to one, so it can output a probability vector even if the input is _not_ simply `log(p)`.