KL divergence loss

Asad_Ullah · December 30, 2019, 4:53pm

Trying to implement KL divergence loss but got nan always.

p = torch.randn((100,100))
q = torch.randn((100,100))
kl_loss = torch.nn.KLDivLoss(size_average= False)(p.log(), q)
output = nan
p_soft = F.softmax( p )
q_soft = F.softmax( q )
kl_loss = torch.nn.KLDivLoss(size_average= False)(p_soft.log(), q_soft)
output = 96.7017
Do we have to pass the distributions (p, q) through softmax function always?

ptrblck · December 31, 2019, 3:34am

According to the docs:

As with NLLLoss , the input given is expected to contain log-probabilities and is not restricted to a 2D Tensor. The targets are given as probabilities (i.e. without taking the logarithm).

your code snippet looks alright. I would recommend to use log_softmax instead of softmax().log(), as the former approach is numerically more stable.

Hmrishav_Bandyopadhy · September 17, 2020, 5:41am

Hi,
As i guess, KL divergence is supposed to return a numerical value representing the distance between 2 probability distributions in feature space. However, this is what I got using log softmax…

>>> k=torch.rand(256)
>>> k1=k.clone()
>>> F.kl_div(F.log_softmax(k),k1,reduction="none").mean()
tensor(2.4483)

On the other hand when I use simple log, the answer is zero, which is expected. Can you let me know what i should use when comparing 2 layers in pytorch with KL divergence?

ptrblck · September 17, 2020, 5:44am

The target should be given as probabilities:

k = torch.rand(256)
k1 = k.clone()
F.kl_div(F.log_softmax(k, 0), F.softmax(k1, 0), reduction="none").mean()
> tensor(6.2333e-10)

Hmrishav_Bandyopadhy · September 17, 2020, 6:12am

Thank you ! I was missing that out

pascal_notsawo · April 22, 2021, 2:44pm

Thank you @ptrblck
There is something I don’t understand.
Let’s assume that: shape(target) = shape(input) = (batch_size, N)

The log_softmax / softmax must be in dimension 0?
F.kl_div(F.log_softmax(logits, dim = 0), F.softmax(target, dim = 0), reduction="none").mean()

Or in dimension 1 ?
F.kl_div(F.log_softmax(logits, dim = 1), F.softmax(target, dim = 1), reduction="none").mean()

Personally I think in dimension 1 (N here is a bit like the number of classes for a classification)

ptrblck · April 22, 2021, 5:49pm

You would apply the log_softmax in the class dimension, so usually in dim1.
Note that my example is not a really representative one and I was just reusing the posted code.

Lezcano · October 25, 2021, 6:38pm

The following issue is relevant for people using kl_div or its nn module, as its current behaviour is wrong

github.com/pytorch/pytorch

KLDivLoss and F.kl_div compute KL(Q || P) rather than KL(P || Q)

opened 08:24AM - 03 May 21 UTC

BenoitDalFerro

high priority module: docs module: nn triaged module: numpy module: correctness (silent) module: special

## 🐛 Bug ### Executive summary: The inputs of `KLDivLoss` and `F.kl_div` are i…nverted. `target` should be `input` and `input` should be `target`. This is bad, as this function is not symmetric. ### Further points torch.functional.kl_div() is inverting positional argument of source (P) and target (Q) distribution, incorrect development of logarithm fractions for the visible part, resulting in various problems up to and including negative Kl divergence (as reported in here (https://github.com/pytorch/pytorch/issues/32520) and Shannon Jensen Divergence not bounded at ]0,1] (bits) or ]0,log2(e)=1.4426950408889634...] (nats) and eventually for computation of the Shannon Jensen metric (square root of the JSD) to NaN issues (as a result of Kl underflow and JSM computing square root of negative numbers...) The [documentation seems](https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html) wrong too (and explains partly the errors experienced) l(x,y)=L={l1,…,lN},ln=yn⋅(logyn−xn) Which actually should be : l(x,y)=L={l1,…,lN},ln=xn⋅(log(xn)−log(yn))=xn⋅log(xn)−xn⋅log(yn) Because l(x,y)= 1/NΣ xn⋅log(xn/ yn),ln=xn⋅(log(xn/yn)=xn⋅(log(xn)-logyn))=xn⋅log(xn)−xn⋅log(yn) Now taking into account that xn and yn are inverted and that xn is a log_softmax l(x,y)=L={l1,…,lN},ln=yn⋅(logyn−xn), xn=log(wn) (=) l(x,y)=L={l1,…,lN},ln=yn⋅(logyn−xn)=yn⋅(logyn−logwn) and here we fall back on the equation but with inversion of yn and xn terms or more exactly inversion of one and the antilog of the other, any reader will understand it as Kl(x||y) but in fact is Kl(y||w) with x=log(w). Actually naming is misleading, it is not a Kl divergence since F.log_softmax has to be computed beforehand, I understand numercial stability issues in computing log of F.softmax() but that's something explicit in the documentation, the reader should be forewarned as this not so trivial to grasp issue, made very clear. ## To Reproduce Steps to reproduce the behavior: ``` M =0.5*(p+q) JSD = 0 JSD += 0.5*F.kl_div(F.log_softmax(M,dim=dim).log(), F.softmax(p, dim=dim), None, None, 'none') JSD += 0.5*F.kl_div(F.log_softmax(M,dim=dim).log(), F.softmax(q, dim=dim), None, None, 'none') return torch.sqrt(JSD) ``` ## Expected behavior First two positional arguments in wrong order as per mathematical error in the documentation, for the general Kullback-Leibner it should be something like F.kl_div(F.softmax(p, dim=dim),F.log_softmax(q,dim=dim), None, None, 'none') and in the context of the JSD's specifics : F.kl_div(F.softmax(p, dim=dim),F.log_softmax(M,dim=dim), None, None, 'none') Second question is why do we need to input a F.log_softmax for the first (which should be in fact the second) positional argument in the first place ? I understand that F.softmax(M, dim=dim),log() has numerical stability compared to computing directly F.log_softmax yet could not be part of a routine treatment of the second term ? Otherwise it is not KL_divergence and the name is misleading ! ## Proposed fix for Kl underflow The Kl numerical underflow issue is actually documented in Cover & al (2006), 2.3 Relative Entropy and Mutual Information p.45 : "we use the convention that 0 log 0/0 =0 and the convention (based on continuity arguments) that 0 log 0/q = 0 and p log p /0 = inf. Therefore postprocessing fix is : D_pM = (p*(torch.log(p)-F.log_softmax(M,dim=dim))) D_qM = (q*(torch.log(q)-F.log_softmax(M,dim=dim))) D_pM[torch.isnan(D_pM)] = 0 D_qM[torch.isnan(D_qM)] = 0 ## Environment Collecting environment information... PyTorch version: 1.7.0 Is debug build: True CUDA used to build PyTorch: 10.2 ROCM used to build PyTorch: N/A OS: Microsoft Windows 10 Famille GCC version: Could not collect Clang version: Could not collect CMake version: Could not collect Python version: 3.7 (64-bit runtime) Is CUDA available: True CUDA runtime version: 10.2.89 GPU models and configuration: GPU 0: GeForce GTX 1050 Nvidia driver version: 457.63 cuDNN version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\bin\cudnn64_7.dll HIP runtime version: N/A MIOpen runtime version: N/A Versions of relevant libraries: [pip3] numpy==1.19.1 [pip3] numpydoc==1.1.0 [pip3] pytorch-metric-learning==0.9.98 [pip3] torch==1.7.0 (UPDATED TO 1.8.1 problem remains) [pip3] torchaudio==0.7.0 [pip3] torchvision==0.8.1 [conda] blas 2.16 mkl conda-forge [conda] cudatoolkit 10.2.89 hb195166_8 conda-forge [conda] libblas 3.8.0 16_mkl conda-forge [conda] libcblas 3.8.0 16_mkl conda-forge [conda] liblapack 3.8.0 16_mkl conda-forge [conda] liblapacke 3.8.0 16_mkl conda-forge [conda] mkl 2020.1 216 [conda] numpy 1.19.1 py37hae9e721_0 conda-forge [conda] numpydoc 1.1.0 py_1 conda-forge [conda] pytorch 1.7.0 py3.7_cuda102_cudnn7_0 pytorch [conda] pytorch-metric-learning 0.9.98 pyh39e3cac_0 metric-learning [conda] torchaudio 0.7.0 py37 pytorch [conda] torchvision 0.8.1 py37_cu102 pytorch cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @brianjo @mruberry @albanD @walterddr @rgommers @heitorschueroff