Six (and a half) intuitions for KL divergence(www.perfectlynormal.co.uk)

113 pointsby jxmorris126 days ago12 comments

abetusk4 days ago
Here's my explanation:
Let's say you're a company that's providing an internet connection to a business. The company trusts you, so there's only compression of bits over the wire, not encryption, and you're aware of the compression scheme the company is using to send their bits to you. You're charging the company a premium for using the line you manage but you also lease the line, so it's in your interest to compress what they give you as best as possible so as to make a profit.
Say the companies compression scheme is imperfect. They have a Huffman coding of their (imperfect) model of tokens they send, call it q(x) (that is, they think token x shows up with probability q(x)). You've determined the true distribution, p(x) (token x shows up with actual probability p(x)).
The business has tokens that show up with probability p(x) but they encode them with lg(q(x)) bits, giving an average token bit size of:
```
    -\sum _ x p(x) lg(q(x))
```
If you then use an optimal Huffman encoding, you will send tokens with average bit length of:
```
    -\sum _ x p(x) lg(p(x))
```
How many bits, on average, do you save? Just the difference:
```
    -\sum _ x p(x) lg(p(x)) - \sum _ x p(x) lg(q(x)) = -\sum _ x p(x) lg(p(x)/q(x))
```
Which is the Kullback-Leibler divergence.
To me, this is a much more intuitive explanation. I made a blog post about it [0], if anyone cares.
[0] https://mechaelephant.com/dev/Kullback-Leibler-Divergence.ht...
- dist-epoch4 days ago
  To rephrase what you wrote in plain English: you are Amazon, a client uses an S3 bucket to store .zip files in them, which they pay by the byte, you re-compress and store the data as .7z files, and the KL divergence is related to zip_file_size - 7z_file_size, your "win".
ttul4 days ago
This is great. I had only ever seen the expected surprise explanation. The others help to fill in the gaps.
notrealyme1234 days ago
Wow this is really great. I just realised last weak that MLE can be motivated with the KL divergence between true distribution and approximation. My mind was blown in how obvious that connection was.
- srean4 days ago
  Holy over the top almighty. Is this comment even real ? "Mind blown" and all. Tomorrow, the sun rose, "blown is my mind".
  Apologies for the snark but I can't fathom how someone who is aware of the definition of KL not see the likelihood in it.
usernametaken294 days ago
If you ask me the quickest way to explain KL divergence is like such: If two distributions are the same KL becomes 0. KL quantifies how many nats of difference there is between a target and a source. It’s always good to read through the original information theoretic work. Most of AI is copycats with more compute anyways.
- chermi4 days ago
  "Nats of difference" carries a lot of the load there. It's not incorrect, but I don't see how it's a superior explanation to op?
  - usernametaken293 days ago
    I think personally the unit you measure divergence in just doesn’t matter. Yes, nats is technically superior, but as long as you do it consistently, all that you really want to do is to measure how similar A is to B. In that sense I think many explanations of KL are very convoluted.
cubefox4 days ago
Unfortunately all these intuitions rely on a distinction between a "true" distribution P and a "false" distribution Q. So they don't work for a subjective probability interpretation where it doesn't make sense to speak of a true or false distribution.
- Majromax4 days ago
  The math doesn't need a 'true' or 'false' distribution; that just falls out of the use of a model ('false') to approximate reality ('true'). When the bard says "there are more things in heaven and earth, Horatio, than are dreamt of in your philosophy," he's also saying that the KL Divergence between Horatio's beliefs and reality is infinite.
  We can also apply the concept between two subjective distributions. If I'm indifferent to sports teams (very broad distribution) and you're a rabid fan of A (sharp, narrow distribution), then it might take you a long time to express a point in a way I'll understand – but conversely I might be able to express "team B is good actually" in a way that just does not compute for you.
  - cubefox3 days ago
    > The math doesn't need a 'true' or 'false' distribution; that just falls out of the use of a model ('false') to approximate reality ('true').
    Looks like a contradiction. If you identify reality with a probability distribution (rather than just plain facts), then that requires a "true" objective probability distribution.
    > If I'm indifferent to sports teams (very broad distribution) and you're a rabid fan of A (sharp, narrow distribution), then it might take you a long time to express a point in a way I'll understand – but conversely I might be able to express "team B is good actually" in a way that just does not compute for you.
    That sounds far too vague for me.
7777777phil4 days ago
KL(P||Q) penalizes Q heavily when it assigns low probability to things P considers likely, but barely cares when Q wastes probability on rare events. That's why KL regularization in RLHF pushes models toward typical, average-sounding outputs..
- krackersa day ago
  I think you probably meant this, but when used with RL it's usually KL(π || π_ref), which has high loss when the in-training policy π produces output that's unlikely in the reference. But yeah as you noted, I guess this also means that there is no penalty if π _does not_ produce output in π_ref, which leads to a form of mode-collapse.
  This collapse in variety matches with what I've seen some studies show that "sloppification" is not present in the base model, and is only introduced during the RL phase.
dist-epoch4 days ago
For those wondering where is this practically relevant - this is the basic metric used to compare quantization of various LLM models - what is the KL divergence of a 4-bit quantization versus an 8 bit one versus the original 16 bit one.
- abeppu4 days ago
  This is also the original way variational methods pick a parameterization of a model of known architecture which best matches some distribution which generated data but is not otherwise compactly expressible.
porridgeraisin3 days ago
> So minimising the cross entropy over theta is the same as maximising KL(P,Q)
Minimising*
cubefox4 days ago
> D(P||Q) = measure of how much our model Q differs from the true distribution P. In other words, we care about how much P and Q differ from each other in the world where P is true, which explains why KL-div is not symmetric.
I don't think this particular interpretation actually makes sense or would explain why KL divergence is not symmetric.
First of all, the "difference" between P and Q would be the same independently of whether P, Q, or some other distribution is the "true" distribution.
For example, assume we have a coin and P(Heads)=0.4 and Q(Heads)=0.6. Now the difference between the two distributions is clearly the same irrespective of whether P, Q or neither is "true". So this interpretation doesn't explain why the KL divergence is asymmetric.
Second, there are plausible cases where it arguably doesn't even make sense to speak of a "true" distribution in the first place.
For example, consider the probability that there was once life on Mars. Assume P(Life)=0.4 and Q(Life)=0.6. What would it even mean for P to be "true"? P and Q could simply represent the subjective beliefs of two different people, without any requirement of assuming that one of these probabilities could be "correct".
Clearly the KL divergence can still be calculated and presumably sensibly interpreted even in the subjective case. But the interpretations in this article don't help us here since they require objective probabilities where one distribution is the "true" one.
- seanhunter4 days ago
  > First of all, the "difference" between P and Q would be the same independently of whether P, Q, or some other distribution is the "true" distribution.
  I don't think this is the case in general because in D_{KL}(P||Q) the model is weighting the log probability ratio by P(x) whereas in D_{KL}(Q||P) it's weighting by Q(x).
  So let's think it through with an example. Say P is the true probability of frequencies of English words and Q is the output of a model that's attempting to estimate this.
  Say the model overestimates the frequency of some uncommon word (eg "ancillary"). D_{KL}(P||Q) weights by P(x), the actual frequency, so the divergence will be small, but since the model thinks the frequency of that word is high, when we take D_{KL}(Q||P) it weights by Q(x), the model estimated frequency, so it will weight that error highly and D_{KL}(Q||P) will be large.
  That's why it's not symmetric - it's weighting by the first distribution so the "direction" of the error matters.
  - cubefox4 days ago
    You misunderstood what I was saying. I was not suggesting that the KL divergence is symmetric. I was saying that it would be symmetric (and independent of the "truth" of a distribution) if it was interpreted as the quoted measure of "difference" between two distributions. So that proposed interpretation is wrong.
    seanhunter4 days ago
    I see. Got it.
- ahmism4 days ago
  To the first point, I think that the KL divergence is indeed symmetric in this case, 0.4 * ln(0.4 / 0.6) + 0.6 * ln(0.6 / 0.4) no matter which direction you go.
  Still, there's no avoiding the inherent asymmetry in KL divergence. To my mind, the best we can do is to say that from P's perspective, this is how weird the distribution Q looks.
  - cubefox4 days ago
    > To the first point, I think that the KL divergence is indeed symmetric in this case, 0.4 * ln(0.4 / 0.6) + 0.6 * ln(0.6 / 0.4) no matter which direction you go.
    But my argument also works for any other probability distribution, e.g. P(heads)=0.5 vs Q(heads)=0.99.
    > Still, there's no avoiding the inherent asymmetry in KL divergence.
    I wasn't suggesting otherwise, I was talking about his interpretation.
RickHull4 days ago
Is there a gentler intro to this topic?
- krackersa day ago
  See this video, beautiful explanation that doesn't already assume familiarity with entropy https://www.youtube.com/watch?v=ErfnhcEV1O8
- jey4 days ago
  Try the textbook Elements of Information Theory by Cover and Thomas (2006)
  - srean4 days ago
    I wouldn't say it's gentle but it certainly is a great book. Great exercise problems. Some of the proofs are so elegantly done, especially the way calculus of variations is avoided.
    David Mackay's book hand holds a little more than Cover and Thomas, although it's remit is more than just information theory.
  - mememememememo4 days ago
    Found an excerpt online. Seems like a gem of a book.
- chermi4 days ago
  I recommend every where I get a chance : mackay's book. https://www.inference.org.uk/mackay/itila/book.html
  - Igrom3 days ago
    Which chapters have you found the most enlightening or useful?
    (off-topic: here's my own "recommend everywhere" book, "Attacking Faulty Reasoning" by T. Edward Damer, https://en.wikipedia.org/wiki/Attacking_Faulty_Reasoning).
fedorsapronov4 days ago
Nice writeup. One thing I've been exploring is how information-theoretic measures connect to physics — specifically, the KL divergence between a "true" vacuum distribution and a perturbed one gives you coupling constants. In the Fibonacci- structured potential V(s) = v⁴(s−s₀)²/(1−s−s²), the strong coupling αₛ = 1/(2φ³) emerges exactly as the curvature at the vacuum divided by 2. The information- geometric interpretation is that αₛ measures how "distinguishable" the vacuum is from the pole — a Fisher metric on the space of potentials.
Probably a stretch, but it's interesting how divergence measures keep showing up in unexpected places.
- chermi4 days ago
  There's a lot of work, good and some really bad, about fisher information and physics. vijay balasubramanian and Jim Sethna may interest you, though sethna is more condensed matter. And of course amari.
- 4 days ago
  undefined
darioterror4 days ago
[dead]