KL divergence different results from tf

razvanc92 · September 26, 2019, 2:11pm

I’ve noticed that the pytorch implementation of KL divergence yells different results from the tensorflow implementation. The results differ significantly (0.20, and 0.14) and I was curios what could be the reason. Below you can find a small example. Any help will be more than appreciated.

import tensorflow as tf
import numpy as np
import torch
from torch.distributions.kl import kl_divergence
tf.enable_eager_execution()

preds = np.array([1.9417487e-03, 9.9999997e-10, 5.8252434e-03, 9.9999997e-10, 3.8834962e-03,
 8.1553400e-02, 3.6893204e-01, 5.2427185e-01, 7.7669914e-03, 5.8252434e-03,
 9.9999997e-10]).astype('float32')

labels = np.array([1.0362695e-02, 9.9999997e-10, 9.9999997e-10, 9.9999997e-10, 3.1088084e-02,
 9.0673573e-02, 3.4974092e-01, 5.1036268e-01, 2.5906744e-03, 9.9999997e-10,
 5.1813480e-03]).astype('float32')

preds_tf = tf.distributions.Categorical(probs=tf.convert_to_tensor(preds))
labels_tf = tf.distributions.Categorical(probs=tf.convert_to_tensor(labels))
tf_res = tf.distributions.kl_divergence(preds_tf, labels_tf)

preds_torch = torch.distributions.Categorical(probs=torch.from_numpy(preds))
labels_torch = torch.distributions.Categorical(probs=torch.from_numpy(labels))
torch_res = kl_divergence(preds_torch, labels_torch)

print(tf_res.numpy(), torch_res.item())

Nikronic · September 26, 2019, 9:24pm

Hi,
I have not read the distribution package source code, but from what I know from the C++ source code, I prefer using torch.nn.functional.kl_div function to calculate the divergence.

github.com

pytorch/pytorch/blob/35fed93b1ef05175143f883c6f89f06c6dd9429b/aten/src/ATen/native/Loss.cpp#L71


      
            return apply_loss_reduction(output, reduction);
          }
          
          Tensor margin_ranking_loss(const Tensor& input1, const Tensor& input2, const Tensor& target, double margin, int64_t reduction) {
            auto output =  (-target * (input1 - input2) + margin).clamp_min_(0);
            return apply_loss_reduction(output, reduction);
          }
          
          Tensor kl_div(const Tensor& input, const Tensor& target, int64_t reduction) {
            auto zeros = at::zeros_like(target);
            auto output_pos = target * (at::log(target) - input);
            auto output = at::where(target > 0, output_pos, zeros);
            return apply_loss_reduction(output, reduction);
          }
          
          Tensor kl_div_backward_cpu(const Tensor& grad, const Tensor& input, const Tensor& target, int64_t reduction) {
            auto grad_input = at::zeros_like(input);
            auto grad_expand = grad.expand_as(input);
            AT_DISPATCH_FLOATING_TYPES(input.scalar_type(), "kl_div_backward_cpu", [&]() {
              at::CPU_tensor_apply3<scalar_t, scalar_t, scalar_t>(
                  grad_input,

Based on the source code, you should provide log_probs for the target.
Notice that PyTorch use kl_div like this: kl_div(b, a) for kl_div(a||b), so it means you need to use following code to get the same result as Tensorflow.

preds_torch = torch.Tensor(preds)
labels_torch = torch.Tensor(labels)
out = F.kl_div(labels_torch.log(), preds_torch, reduction='sum')
print(out.item())  #0.2038460671901703

Also, it is equivalent to:

out = (preds_torch * (preds_torch / labels_torch).log()).sum()
print(out.item())

In the end, I am really not sure about distribution package yet. I will check it out and let you know if you are interested.

Further reading:

Good luck
Nik

Nikronic · September 26, 2019, 9:32pm

@razvanc92

I just found the solution using distribution package too.
As I mentioned in the previous post, the target should be log_probs, so based on, we must have these:

preds_torch = torch.distributions.Categorical(probs=torch.from_numpy(preds))
labels_torch = torch.distributions.Categorical(logits=torch.from_numpy(np.log(labels)))
torch_res = kl_divergence(preds_torch, labels_torch)

Note that for target(labels_torch) we use logits not probs and also provide log(labels) rather than labels itself.

Good luck
Nik

razvanc92 · September 27, 2019, 12:35pm

Could you also help me with the differences between tf/pytorch and numpy. It seems to be working fine when the input is 2d, but when the input has more than 2 dimensions it doesn’t. For example now I’m trying with a 4d array where the distributions are on the last axis. This is my implementation:

 np.mean(np.sum(preds * np.log(preds / labels), axis=-1))

Thanks in advance.

Nikronic · September 29, 2019, 4:57pm

@razvanc92 Sorry for late reply, I was dealing with a bunch of problems.
If I want to be frank with you, I could not get same output for random generated numbers using both nn.kl_div or formula itself. Can you state your last post as a separate question?
And please mention me there too, so I can understand what is really happening there.
Maybe other experienced could help us too.