There's something wrong with KL divergece




It seems like KL is not doing a good job at approximating this toy example or am I missing something?

Just to answer this, I’ve come to the conclusion that there’s sth wrong with the KL, here’s why.

Verbatim from KLDivLoss:

As with NLLLoss, the input given is expected to contain log-probabilities
and is not restricted to a 2D Tensor. The targets are interpreted as probabilities by default,
but could be considered as log-probabilities with log_target set to True.

The above plots where generated when P = F.log_softmax(*) and Q = F.softmax(*). To this point KL is formed as F.kl_div(P, Q, reduction='batchmean') according to the exact definition of the docs.

The problem is the P = F.log_softmax(*) which basically destroys the approximation of Q and KL cannot recover.

The same problem is persistent even with the flag log_target=True where both P and Q are transformed via F.log_softmax(*).

And here’s the correct result when both P and Q are just outputs from F.softmax(*).


Can someone from the pytorch devs verify and correct this if that’s the case?