KL divergence different results from tf

I’ve noticed that the pytorch implementation of KL divergence yells different results from the tensorflow implementation. The results differ significantly (0.20, and 0.14) and I was curios what could be the reason. Below you can find a small example. Any help will be more than appreciated.

import tensorflow as tf
import numpy as np
import torch
from torch.distributions.kl import kl_divergence
tf.enable_eager_execution()

preds = np.array([1.9417487e-03, 9.9999997e-10, 5.8252434e-03, 9.9999997e-10, 3.8834962e-03,
 8.1553400e-02, 3.6893204e-01, 5.2427185e-01, 7.7669914e-03, 5.8252434e-03,
 9.9999997e-10]).astype('float32')

labels = np.array([1.0362695e-02, 9.9999997e-10, 9.9999997e-10, 9.9999997e-10, 3.1088084e-02,
 9.0673573e-02, 3.4974092e-01, 5.1036268e-01, 2.5906744e-03, 9.9999997e-10,
 5.1813480e-03]).astype('float32')

preds_tf = tf.distributions.Categorical(probs=tf.convert_to_tensor(preds))
labels_tf = tf.distributions.Categorical(probs=tf.convert_to_tensor(labels))
tf_res = tf.distributions.kl_divergence(preds_tf, labels_tf)

preds_torch = torch.distributions.Categorical(probs=torch.from_numpy(preds))
labels_torch = torch.distributions.Categorical(probs=torch.from_numpy(labels))
torch_res = kl_divergence(preds_torch, labels_torch)

print(tf_res.numpy(), torch_res.item())

Hi,
I have not read the distribution package source code, but from what I know from the C++ source code, I prefer using torch.nn.functional.kl_div function to calculate the divergence.

Based on the source code, you should provide log_probs for the target.
Notice that PyTorch use kl_div like this: kl_div(b, a) for kl_div(a||b), so it means you need to use following code to get the same result as Tensorflow.

preds_torch = torch.Tensor(preds)
labels_torch = torch.Tensor(labels)
out = F.kl_div(labels_torch.log(), preds_torch, reduction='sum')
print(out.item())  #0.2038460671901703

Also, it is equivalent to:

out = (preds_torch * (preds_torch / labels_torch).log()).sum()
print(out.item())

In the end, I am really not sure about distribution package yet. I will check it out and let you know if you are interested.

Further reading:

Good luck
Nik

1 Like

@razvanc92

I just found the solution using distribution package too.
As I mentioned in the previous post, the target should be log_probs, so based on, we must have these:

preds_torch = torch.distributions.Categorical(probs=torch.from_numpy(preds))
labels_torch = torch.distributions.Categorical(logits=torch.from_numpy(np.log(labels)))
torch_res = kl_divergence(preds_torch, labels_torch)

Note that for target(labels_torch) we use logits not probs and also provide log(labels) rather than labels itself.

Good luck
Nik

1 Like

Could you also help me with the differences between tf/pytorch and numpy. It seems to be working fine when the input is 2d, but when the input has more than 2 dimensions it doesn’t. For example now I’m trying with a 4d array where the distributions are on the last axis. This is my implementation:

 np.mean(np.sum(preds * np.log(preds / labels), axis=-1))

Thanks in advance.

@razvanc92 Sorry for late reply, I was dealing with a bunch of problems.
If I want to be frank with you, I could not get same output for random generated numbers using both nn.kl_div or formula itself. Can you state your last post as a separate question?
And please mention me there too, so I can understand what is really happening there.
Maybe other experienced could help us too.