# KL divergence loss pytorch implementation

According to the theory kl divergence is the difference between cross entropy (of inputs and targets) and the entropy (of targets).

I compared the kl div loss implementation in pytorch against the custom implementation based on the above theory. But the results are not the same, I am not sure why there is a difference.

import torch

m = nn.Softmax(dim=1)
i_tensor_before_softmax = torch.Tensor([0.1, 0.2, 0.4, 0.3])
i_tensor = m(p_tensor.view(-1,4))
o_tensor_before_softmax = torch.Tensor([0.7, 0.1, 0.1, 0.1])
o_tensor = m(q_tensor.view(-1,4))

import torch.nn.functional as F
kl_loss = nn.KLDivLoss(reduction=â€śbatchmeanâ€ť)
loss = torch.nn.CrossEntropyLoss()
kl_output = kl_loss(input=F.log_softmax(i_tensor_before_softmax, dim=-1), target=o_tensor)
cross_ent = loss(input=i_tensor_before_softmax.view(-1,4), target=o_tensor)
ent = loss(input=o_tensor_before_softmax.view(-1,4), target=o_tensor)
kl_output_using_ce = cross_ent - ent
print(kl_output, kl_output_using_ce)

output: (tensor(0.0179), tensor(0.0716))

Help me out in identifying the issue
Thanks,
Krishnan

Hi Krishnan!

The short story is that your results differ by a factor of four because
`KLDivLoss` takes the mean over the class dimension, while
`CrossEntropyLoss` sums.

This is correct.

For what are presumably historical reasons (`CrossEntropyLoss` did not
used to support probabilistic targets.), `KLDivLoss` computes what it calls
the â€śpointwiseâ€ť KL-divergence, while pytorch does not use a notion of
â€śpointwiseâ€ť cross entropy.

Letâ€™s call the first dimension of your tensors the â€śbatchâ€ť dimension and
the second the â€śclassâ€ť dimension. So in your example your tensors have
shape `[nBatch = 1, nClass = 4]`.

`KLDivLoss`â€™s default `reduction` of `mean` takes the mean over all
dimensions of the pointwise-KL-divergence tensor, so it takes the mean
over the `nClass` dimension, dividing by four (and then takes the trivial
mean over the `nBatch = 1` dimension).

`CrossEntropyLoss`, by definition, sums over the `nClass` dimension, not
dividing by four (and then, with its default `reduction = mean`, takes the
trivial mean over the `nBatch` dimension).

Therefore your two results differ by a factor of four.

Consider:

``````>>> import torch
>>> print (torch.__version__)
1.13.0
>>>
>>> m = torch.nn.Softmax (dim=1)
>>> i_tensor_before_softmax = torch.Tensor ([0.1, 0.2, 0.4, 0.3])
>>> i_tensor = m (i_tensor_before_softmax.view (-1,4))
>>> o_tensor_before_softmax = torch.Tensor ([0.7, 0.1, 0.1, 0.1])
>>> o_tensor = m (o_tensor_before_softmax.view (-1,4))
>>>
>>> import torch.nn.functional as F
>>> kl_loss = torch.nn.KLDivLoss (reduction = 'batchmean')
>>> loss = torch.nn.CrossEntropyLoss()
>>> kl_output = kl_loss (input = F.log_softmax (i_tensor_before_softmax, dim=-1), target = o_tensor)
>>> cross_ent = loss (input = i_tensor_before_softmax.view (-1,4), target = o_tensor)
>>> ent = loss (input=o_tensor_before_softmax.view (-1,4), target = o_tensor)
>>> kl_output_using_ce = cross_ent - ent
>>> print (kl_output, kl_output_using_ce)
tensor(0.0179) tensor(0.0716)
>>>
>>> kl_loss = torch.nn.KLDivLoss (reduction = 'none')
>>> loss = torch.nn.CrossEntropyLoss (reduction = 'none')
>>> kl_output_unr = kl_loss (input = F.log_softmax (i_tensor_before_softmax, dim=-1), target = o_tensor)
>>> cross_ent_unr = loss (input = i_tensor_before_softmax.view (-1,4), target = o_tensor)
>>> ent_unr = loss (input=o_tensor_before_softmax.view (-1,4), target = o_tensor)
>>> kl_output_using_ce_unr = cross_ent_unr - ent_unr
>>> print (kl_output_unr)
tensor([[ 0.2151, -0.0271, -0.0686, -0.0478]])
>>> print (kl_output_using_ce_unr)
tensor([0.0716])
>>> print (kl_output_unr.mean(), kl_output_unr.sum(), kl_output_using_ce_unr)
tensor(0.0179) tensor(0.0716) tensor([0.0716])
``````

Best.

K. Frank

1 Like