Softmax, can you derive the Jacobian? And should you care?(idlemachines.co.uk)

128 pointsby smaddrellmander6 days ago9 comments

qurren2 days ago
One thing extremely worth noting that the article does not:
The reason "temperature" is called such is because softmax is mathematically identical to the Boltzmann distribution [1] from thermodynamics, which describes the probability distribution of energy states of an ensemble of particles in equilibrium. In terminology more well understood by ML folks, the particles' energies will be distributed as the softmax of their negative energies divided by their temperatures (in Kelvin). Units are scaled by the Boltzmann constant (k_B).
Setting an LLM's temperature to zero is mathematically the same thing as cooling an ensemble of particles to absolute zero: in physics, the particles are all forced to their lowest energy state, in LLMs, the model is forced to deterministically predict the single most likely logit/token.
Now to drow another analogy for what happens at high temperatures: the reason a heating element glows red when it is hot is because if you take the expectation value (mean) of energies under this softmax distribution, that mean goes up with temperature, and when the energy gets high enough, the particles start shaking off energy in the form of photons that are now high energy enough to be in the visible spectrum. Incandescent bulbs with tungsten filaments are even hotter than that heating element, and glow white because as temperature T is even higher, the softmax distribution's mean energy moves higher and flattens out, and it roughly covers the whole visible spectrum somewhat more uniformly. In the case of the bulb, photons of all sorts of wavelengths are being spewed out, that's white light. Likewise, if you set an LLM's temperature to an absurdly high number, it spews out a very wide spectrum of mostly nonsense tokens.
[1] https://en.wikipedia.org/wiki/Boltzmann_distribution
- jmalicki2 days ago
  The article also does not discuss the awesomeness of LLMs at negative temperatures!
  https://cavendishlabs.org/blog/negative-temperature/
- throwway1203852 days ago
  Good old thermalization.
ComplexSystems2 days ago
Good article, but
"We take the exponential of each input and normalize by the sum of all exponentials. This transforms a vector of arbitrary real numbers into values between 0 and 1 that sum to 1, it technically this is a pseudo-probability distribution (they're not derived from a probability space), but it's close enough to a probability distribution and for practical purposes they work just fine."
Why is this a "pseudo-probability distribution?"
- WCSTombs2 days ago
  Mathematically, it is literally a probability distribution, because it fits the definition of a measure whose total mass is one, so I think the language is just imprecise. What they may be trying to say is that semantically it doesn't arise in a principled way from an uncertainty model, such as from Bayesian or frequentist statistics.
  - jmalicki2 days ago
    Hogwash. If you get into deriving maximum entropy distributions via the calculus of variations, the multinomial is the maximum entropy distribution among categorical distributions.
    This is exactly the sense that it comes up for old school LMs and why it appears in thermodynamics.
    Of course it is entirely possible that newfangled ML people use it without understanding that it is derived from first principles - i.e. see article.
    WCSTombs2 days ago
    That definitely could be the case. I was also a bit surprised by what the article said, so I was simply trying to interpret it, but I'm not extremely well versed in ML so I could be missing some details. My main point was that contrary to what the article said, they do in fact have a probability distribution on their hands.
    jmalickia day ago
    This is literally the probability distribution ML models are trained on.
    https://docs.pytorch.org/docs/2.11/generated/torch.nn.CrossE...
    You have a relatively small dictionary of tokens, each prediction has a neural network score that goes into the final token prediction layer, and they are trained based on a log-softmax (i.e. the above function) to predict their next token.
    This is exactly how anyone in any field does conditional multinomial/categorical (i.e. one of a bunch of distinct tokens) distributions, and AFAIK what LLMs generally use as their loss functions on the output layer, though I have not deeply investigated all of them, since this has been how you do that since time immemorial.
    I am extremely confused by all of the people screaming it's not a probability distribution?!?!?
    I have seen computer vision tasks use binomial training objectives (one-vs-all) and then use the multinomial only at inference time, and that could be fair that that is not a probability distribution induced by training (while technically a probability distribution only in the sense it is \ge 0 and sums to 1).
    But afaik token prediction LLMs that I am aware of use the softmax for the probability in their loss function, i.e. the maximize log softmax.
- rhdunn2 days ago
  The comment in parenthesis mentions "they're not derived from a probability space" [1]. I don't know about probability spaces nor softmax to know what part of a probability space this is missing compared to other probability distributions, nor how other probability distributions satisfy probability spaces.
  [1] https://en.wikipedia.org/wiki/Probability_space
  - ghkbrew2 days ago
    Sounds like they're saying that since the distribution doesn't come from measuring or calculating the probability of something, it has the form of a probability distribution but isn't really one. Like saying 5 feet is a height that a person can have, but since I just made up that number it's not actually a person's height.
    jmalicki2 days ago
    The soft max is the probability of the next token being whatever in the training data conditioned on the inputs. The author just doesn't know that apparently and thinks it was an arbitrary choice.
    The author's essay on the sigmoid similarly lacks the deep understanding that it comes from somewhere and isn't an arbitrary choice.
    canjobear2 days ago
    The softmax, after the network has been trained, yields an estimate of the probability in the training data, but it is not that probability itself.
    jmalicki2 days ago
    Which models are not trained with the log softmax as the loss function?
    canjobear2 days ago
    Softmax isn't a loss function. It is used to transform model outputs into positive numbers that sum to 1, so that they can be interpreted as probabilities, and then those numbers are passed into (typically) the cross entropy loss function. I think you mean, which models are trained using some function other than softmax to transform the model outputs. There are a number of alternatives to softmax, such as the ones described here https://www.emergentmind.com/topics/sparsemax
    jmalicki2 days ago
    The cross entropy loss function is softmax. They are one and the same.
    canjobear2 days ago
    They’re not. Cross entropy loss is E[-log q] where q is a probability. You could convert the model outputs x into probabilities using some other function like q = 1/Z x^2, and compute cross entropy loss just fine.
    jmalicki2 days ago
    Behold the softmax: https://docs.pytorch.org/docs/2.11/generated/torch.nn.CrossE...
    canjobear2 days ago
    Behold the actual definition of cross entropy: https://en.wikipedia.org/wiki/Cross-entropy
    It's true that the PyTorch API conflates cross entropy and softmax, but they are separate concepts.
    mojosmojo2 days ago
    iirc, there is a bunch of formal machinery you need to define probability distributions for situations such as infinite outcomes (eg what is the probability that a random real number between 0 and 10 is less than 3?)
antirez2 days ago
> The relative differences between values get exaggerated, which means the largest logit value dominates the output, while smaller values are squashed. This is exactly what we want for confident predictions, but it also explains why softmax can be problematic when you want uncertainty estimates
Actually I believe that most of the times even after softmax, sampling is ways too permissive, seldom accepting low quality candidates. We all have the experience of seeing frontier LLMs sometimes putting a word in a different language that is really off-putting and almost impossible to explain, or other odd errors in just a single word of the output: most of the times, this is not what the model wanted to say, but sampling that casually selected a low quality token. I believe a better approach is to have a strong filter on which candidates are acceptable, like in the example here: https://antirez.com/news/142
- hashmap2 days ago
  Yeah, softmax may have useful applications, but anytime you find yourself using the same hammer for everything that looks like a nail it's a bit of a red flag.
  If you take instances of softmax that you find in training / inference and there turn out to be a few, and use other things like entmax or sparsemax you see across the board improvements. And like top1 often is just the best answer too, there's a reason why when you're doing tool calls temp=0 is the way to go. Like do you really want creative unicode tokens when writing bash commands. From what I can tell, most of the time softmax is the worst answer that works.
- jmalicki2 days ago
  I would love to see more work on beam search/Viterbi decodes rather than just greedy next token output.
  - vjerancrnjak2 days ago
    Regret analysis in bandit and similar algorithms shows how inference is connected to loss function. If your loss function is good, greedy inference is as good as joint inference.
    Training on cost-to-go loss is good enough. Perfect cost-to-go eliminates the need for global algorithms and allows local decision making. Given “natural” datasets it is probably the best thing to attempt to learn. The fact that probabilistic graphical models never really worked proves it somewhat.
    jmalickia day ago
    Are there any good papers on this you would suggest/specific search terms?
    I am vaguely aware of some stuff, but would love to study more, I don't quite understand what this is all about (but I do see how LLMs can do attention to all prior tokens so you don't have the single-point-of-failure HMMs do which more necessitates Viterbi decodes)
- thaumasiotes2 days ago
  > most of the times, this is not what the model wanted to say, but sampling that casually selected a low quality token.
  How do you identify what the model wanted to say?
hibijibies2 days ago
Nice article and explanations!
On a tangential note, I keep noticing "why x matters", "it's crucial here" that just remind me of Claude. Recently Claude has been gaslighting me in complex problems with such statements and seeing them on an article is low-key infuriating at this point. I can't trust Claude anymore on the most complex problems where it sometimes gets the answer right but completely misses the point and introduces huge complex blocks of code and logic with precisely "why it matters", "this is crucial here".
- snovv_crash2 days ago
  I've seen many posts on Reddit in this AI-induced 'psychosis' when people end up believing the words that get generated for them without applying sufficient critical thought.
  This sycophancy is a serious problem and exploits a weakness in the human psyche (flattery) which may be easier for the RLHF to find reward in than genuinely correct responses.
  - SecretDreams2 days ago
    This problem is super pervasive in companies where the less technical individuals (that also happen to be decision makers) are using AI to fight/challenge the technical knowledge of their SMEs. It's super annoying. SMEs have some real gold in the form of niche/tribal knowledge that, by the grace of html Jesus, is not always sufficiently documented for an AI to absorb it into its pseudo-aggregate data sphere.
  - 2 days ago
    undefined
  - cyanydeez2 days ago
    billionaires in psychosis is another level (shepards tone) of concerning thoughts.
khelavastra day ago
This is solved by MS research in 2018.. https://www.microsoft.com/en-us/research/blog/microsoft-rese...
Glyptodon2 days ago
What happens if you use an integer like 2 or 3 instead of e in the softmax equation? Is e what makes it so they end up summing to 1? (I have not done real math in yearssss.)
- mattalex2 days ago
  It works the same way: softmax is essentially just applying the normalization to the vector exp(x). From an "engineering" POV this effectively ensures that the vector you normalize has strictly positive entries, so the result ends up being a proper distribution.
  From a theory POV you get softmax like distributions (Gibbs distributions) by trying to balance following some energy E(x) and the entropy of the distribution. In essence the softmax is the answer to "I try to follow the maximum of a function E(x) but I need to maintain some level of uncertainy".
  The balancing coefficient between entropy and picking the maximum of the function is called "temperature" (following the behavior of particles in a physical system: The colder the system, the lower the chance of having particles randomly walk away from the minimal energy state).
  specifically, the temperature is
  softmax(x/temp)
  if you draw temp->0, your softmax slowly becomes an argmax (with temp=0 being a literal argmax). If you increase the temperature, you are closer to the "random fluctuations" leaving more room for sampling x values that are not the maximum of x. (this is why e.g. LLMs become deterministic as you decrease temp->0)
  Using a different base other than e implicitly changes the temperature:
  N^x = exp(ln(N) x)
  The normalization works the same since you are still dividing a positive value N^x by the sum of all alternatives sum(N^x_i), which is a normalization by design
- lifis2 days ago
  It's equivalent to multiplying all inputs by log b. And multiplying all inputs by a value changes how much the probabilities are extremized. This is easy to see because adding a value to everything doesn't change the output, so the biggest input can be assumed to be 0 and others negative. So multiplying by 0 makes all outputs equal while as the multiplier tends to infinity, all other inputs tend to -infinity and thus the biggest output tends to 1 and others to 0. Multiplying by negative numbers results in the lowest becoming the highest.
- hansvm2 days ago
  That's equivalent to changing the temperature.
  Also, so long as the function is non-negative for all inputs and positive for at least one you'll always get a valid probability distribution.
bjourne2 days ago
So softmax is e^x projection followed by l1 norm. Why is e^x projection useful?
- aesthesia2 days ago
  It maps (-inf, inf) to (0, inf) in about as nice a way as you could expect (addition turns into multiplication). When you want to constrain a value to be positive, parameterizing it with exp is usually a good option.
  - aincha day ago
    And importantly it's got nice properties like being differentiable and monotonic, unlike eg. taking |x|.
dkislyuk2 days ago
Something that really helped me grasp the foundational relevance of the softmax is to justify from first principles why e^x shows up in the preferred mapping function in the numerator (1). The stated problem of mapping raw inputs/scores/logits to a probability distribution can be solved by a bunch of arbitrary functions, and the usual justification given for a softmax is "it has nice derivatives" which is empirically useful but not satisfying.
The sketch of the justification is something like this. We first need a function that maps from (-inf, inf) to a unique positive value, and then we need to normalize the resulting values. Setting aside the normalizing step, we imagine a f(x) that needs to fit the following properties:
1. It should be strictly positive, so that we can normalize it into a (0, 1) probability.
2. It should preserve the relative ordering of the logits to allow them to be interpreted as scores. Thus $f(x)$ should be monotonically increasing.
3. It should be continuous and differentiable everywhere, since we are interested in learning through this function via backpropagation.
4. It should have shift-invariance with respect to the input, as we don't want the model to have to learn some preferred logit-space where there is a stronger learning signal. For example, applying softmax on the values `(-1, 1, 3, 5)` would yield the same result as applying it to `(9, 11, 13, 15)`. This property can also be restated as a "scale invariance of probability ratios", where the ratio between $f(x)$ and $f(x+c)$ for a given $c$ is a constant. One useful interpretation of this property is that the learning domain or "gradient-learning surface" is stable, and high-magnitude initializations won't impede the learning process.
Taken at face value, these properties uniquely define e^x. The last property is actually pretty debatable, because in the context of machine learning, we actually do have a "preferred logit-space", namely closer to zero, for numerical stability. But there are other ways to enforce this in a post-hoc manner (e.g. weight initialization, normalization layers, etc.)
Another property that is uniquely justifies e^x and thus softmax is IIA (independence of irrelevant alternatives), which states that the odds for two classes, p_i / p_j, only depend on the logits/inputs for i and j, and an irrelevant class k has no impact. For example, for Softmax([5, 7, 1]) and Softmax([5, 7, 10]), the resulting odds for the first two values (p_i/p_j) should be the same from both distributions, regardless of the third value.
Finally, if the "desired properties" approach is not satisfying, a more theoretical route for justifying the form of the softmax uses the framework of maximum entropy (E. T. Jaynes published this in 1957 to justify the Boltzmann distribution).
TL;DR, softmax is not a the only solution to mapping function of unnormalized values to a probability distribution, but it can be justified through axiomatic properties.
(1) one could say that the exponential shows up from the Boltzmann distribution, but then the same question applies.
- yobbo2 days ago
  The reason for exp(x) is that its derivative is exp(x), which makes it possible to express the gradient of s(x) in terms of s(x), or both in terms of exp(x). This simplifies the computation of backward pass.
  - dkislyuk2 days ago
    I agree that "it has nice derivatives" is a great empirical reason to use a specific function in ML, but it doesn't sufficiently prove that it's the best function to use. And even if a derivative term looks more complex, that doesn't necessarily imply that it is more computationally expensive to compute, so that can't be the only criteria to select a function.
    Luckily, there are more axiomatic reasons for why softmax is the preferred way to map inputs to a probability distribution.
- thaumasiotes2 days ago
  > The stated problem of mapping raw inputs/scores/logits to a probability distribution can be solved by a bunch of arbitrary functions, and the usual justification given for a softmax is "it has nice derivatives" which is empirically useful but not satisfying.
  Often there isn't any more to it than that. For example, the entire justification for least-squares error measurement is that it has convenient derivatives.
  - jmalicki2 days ago
    The central limit theorem is an extremely powerful justification. That doesn't mean it's considered whenever it's used, but it absolutely can be strongly justified (to the degree that other error measurements are only needed in relatively small samples of the feature space where errors will not yet converge to Gaussian)
xchip2 days ago
"This transforms a vector of arbitrary real numbers into values between 0 and 1 that sum to 1"
Not really, softmax transforms logits (logariths of probabilities) into probabilities.
Probabilities → logits → back again.
Start with p = [0.6, 0.3, 0.1]. Logits = log(p) = [-0.51, -1.20, -2.30]. Softmax(logits) = original p.
NN prefer to output logits because they are linear and go from -inf to +inf.
- dkislyuk2 days ago
  Softmax is defined over an arbitrary vector of raw real numbers. Stating that those inputs are "logits" is applying post-hoc semantics to what the model is learning. One of the key properties of a softmax is scale invariance, (e.g. softmax([-1, 1, 3, 5]) == softmax([9, 11, 13, 15])) and so it is easiest to just think of it as operating on a vector of unnormalized raw scores, which is the more colloquial definition of logit.
  (also, log(p) is not the formal definition of a logit)
  - dkislyuk2 days ago
    (meant to say, scale-invariance of probability ratios, or shift-invariance of the inputs)
  - somebodytherea day ago
    [dead]
- ForceBru2 days ago
  It's still true that softmax transforms arbitrary vectors into probability vectors.
  In your example you'll also get the original `p` with just `exp(logits)`. Softmax normalizes the output to sum to one, so it can output a probability vector even if the input is _not_ simply `log(p)`.
- ploika2 days ago
  Logit is log odds, not log probability: logit(p) = log(p / 1 - p)