I am using torch.nn.functional.kl_div() to calculate the KL divergence between the outputs of two networks. However, it seems the output of F.kl_div() is not consistent with the definition.

For example, let assume the normalized pred = torch.Tensor([[0.2, 0.8]]), and target = torch.Tensor([[0.1, 0.9]]).

Then the output of F.kl_div() would be:

F.kl_div(pred, target, reduction=‘sum’, log_target=False) —> -1.0651

or

F.kl_div(pred, target, reduction=‘sum’, log_target=True) —> 0.1354

However, if I calculate the KL divergence according to the definition:

(pred * torch.log(pred/target)).sum() —> 0.0444

Does anyone know what is the reason for the difference (the torch version is 1.8)?