Let's say you're a company that's providing an internet connection to a business. The company trusts you, so there's only compression of bits over the wire, not encryption, and you're aware of the compression scheme the company is using to send their bits to you. You're charging the company a premium for using the line you manage but you also lease the line, so it's in your interest to compress what they give you as best as possible so as to make a profit.
Say the companies compression scheme is imperfect. They have a Huffman coding of their (imperfect) model of tokens they send, call it q(x) (that is, they think token x shows up with probability q(x)). You've determined the true distribution, p(x) (token x shows up with actual probability p(x)).
The business has tokens that show up with probability p(x) but they encode them with lg(q(x)) bits, giving an average token bit size of:
-\sum _ x p(x) lg(q(x))
If you then use an optimal Huffman encoding, you will send tokens with average bit length of: -\sum _ x p(x) lg(p(x))
How many bits, on average, do you save? Just the difference: -\sum _ x p(x) lg(p(x)) - \sum _ x p(x) lg(q(x)) = -\sum _ x p(x) lg(p(x)/q(x))
Which is the Kullback-Leibler divergence.To me, this is a much more intuitive explanation. I made a blog post about it [0], if anyone cares.
[0] https://mechaelephant.com/dev/Kullback-Leibler-Divergence.ht...
Apologies for the snark but I can't fathom how someone who is aware of the definition of KL not see the likelihood in it.
We can also apply the concept between two subjective distributions. If I'm indifferent to sports teams (very broad distribution) and you're a rabid fan of A (sharp, narrow distribution), then it might take you a long time to express a point in a way I'll understand – but conversely I might be able to express "team B is good actually" in a way that just does not compute for you.
Looks like a contradiction. If you identify reality with a probability distribution (rather than just plain facts), then that requires a "true" objective probability distribution.
> If I'm indifferent to sports teams (very broad distribution) and you're a rabid fan of A (sharp, narrow distribution), then it might take you a long time to express a point in a way I'll understand – but conversely I might be able to express "team B is good actually" in a way that just does not compute for you.
That sounds far too vague for me.
This collapse in variety matches with what I've seen some studies show that "sloppification" is not present in the base model, and is only introduced during the RL phase.
Minimising*
I don't think this particular interpretation actually makes sense or would explain why KL divergence is not symmetric.
First of all, the "difference" between P and Q would be the same independently of whether P, Q, or some other distribution is the "true" distribution.
For example, assume we have a coin and P(Heads)=0.4 and Q(Heads)=0.6. Now the difference between the two distributions is clearly the same irrespective of whether P, Q or neither is "true". So this interpretation doesn't explain why the KL divergence is asymmetric.
Second, there are plausible cases where it arguably doesn't even make sense to speak of a "true" distribution in the first place.
For example, consider the probability that there was once life on Mars. Assume P(Life)=0.4 and Q(Life)=0.6. What would it even mean for P to be "true"? P and Q could simply represent the subjective beliefs of two different people, without any requirement of assuming that one of these probabilities could be "correct".
Clearly the KL divergence can still be calculated and presumably sensibly interpreted even in the subjective case. But the interpretations in this article don't help us here since they require objective probabilities where one distribution is the "true" one.
I don't think this is the case in general because in D_{KL}(P||Q) the model is weighting the log probability ratio by P(x) whereas in D_{KL}(Q||P) it's weighting by Q(x).
So let's think it through with an example. Say P is the true probability of frequencies of English words and Q is the output of a model that's attempting to estimate this.
Say the model overestimates the frequency of some uncommon word (eg "ancillary"). D_{KL}(P||Q) weights by P(x), the actual frequency, so the divergence will be small, but since the model thinks the frequency of that word is high, when we take D_{KL}(Q||P) it weights by Q(x), the model estimated frequency, so it will weight that error highly and D_{KL}(Q||P) will be large.
That's why it's not symmetric - it's weighting by the first distribution so the "direction" of the error matters.
Still, there's no avoiding the inherent asymmetry in KL divergence. To my mind, the best we can do is to say that from P's perspective, this is how weird the distribution Q looks.
But my argument also works for any other probability distribution, e.g. P(heads)=0.5 vs Q(heads)=0.99.
> Still, there's no avoiding the inherent asymmetry in KL divergence.
I wasn't suggesting otherwise, I was talking about his interpretation.
David Mackay's book hand holds a little more than Cover and Thomas, although it's remit is more than just information theory.
(off-topic: here's my own "recommend everywhere" book, "Attacking Faulty Reasoning" by T. Edward Damer, https://en.wikipedia.org/wiki/Attacking_Faulty_Reasoning).
Probably a stretch, but it's interesting how divergence measures keep showing up in unexpected places.