Torch.nn.functional.kl_div result different from TF/Scipy implementation

sc21 · March 27, 2021, 6:45am

I thought torch.nn.functional.kl_div should compute KL divergence in Kullback–Leibler divergence - Wikipedia (the same as scipy.stats.entropy and tf.keras.losses.KLDivergence), but I cannot get the same results from a simple example. Does anyone know why?

 from scipy.stats import entropy
entropy([0.5,0.5],[0.7,0.3])
0.08717669357238891
kl_tf=tf.keras.losses.KLDivergence()
y_true=[0.5,0.5]
y_pred=[0.7,0.3]
kl_tf(y_true,y_pred)
<tf.Tensor: shape=(), dtype=float32, numpy=0.08717668>

Those two above gives the same kl divergence value. But when I tried to use torch.nn.functional.kl_div, the result is not the same.

y_true=[0.5,0.5]
y_pred=[0.7,0.3]
torch.nn.functional.kl_div(torch.FloatTensor(y_true),torch.FloatTensor(y_pred),reduction='sum')
tensor(-1.1109)

tom · March 28, 2021, 5:00am

So if we look at the documentation for kl_div we are instructed to See KLDivLoss for details.

There it says:

the input given is expected to contain log-probabilities and is not restricted to a 2D Tensor. The targets are interpreted as probabilities by default, but could be considered as log-probabilities with log_target set to True.

In other words, the first argument should be log probs.

Personally, I’d not use FloatTensor, it’s been superseded as the preferred way to create tensors longer than it had been the way to create them.

Best regards

Thomas

sc21 · March 29, 2021, 12:12am

Thanks – that makes sense. I must have missed the log probability part, although it seems weird not giving an option to specify probabilities directly on input.

And thanks for pointing out the FloatTensor issue. I guess torch.tensor() is the recommended way now.

tom · March 29, 2021, 6:34am

So the definition that it is log probs has historically grown, but as a rule of thumb, probabilities very close to 1 are difficult to work with (try evaluating 1-1e-20, and it’s a lot less for float vs. double). Now if you know that your probabilities are close to 1 but not close to 0, you can avoid this by using 1-p consistently, but it seems that using log-probs as the default representation of probabilities is a good idea in general.

Now this of distinctly practical relevance and crops up quite frequently

it is much more sane to use log_probs as neural network output if you want to train with negative log likelihood aka cross entropy,
When computing p-values that can be very small, it is a bad idea to compute 1-p first, see e.g. Bug in scipy.stats.ks_2samp for two-sided auto and exact modes on a few thousand samples (when n1!=n2) · Issue #12999 · scipy/scipy · GitHub , (where my solution is to compute p directly instead of going via 1-p).
Scipy has separate implementations for the lower (=usual) and upper incomplete gamma function (1-CDF for the gamma distribution) for stability reasons.

So to me, it makes a lot of sense to have log probs whenever actually doing something with it, but I would agree that it might be better to mention it explicitly also in the kl_div documentation (personally, I try to name my parameters log_probs or so, but I guess that’s not trivial to change).

Best regards

Thomas