## 🐛 Bug
### Executive summary:
The inputs of `KLDivLoss` and `F.kl_div` are i…nverted. `target` should be `input` and `input` should be `target`. This is bad, as this function is not symmetric.
### Further points
torch.functional.kl_div() is inverting positional argument of source (P) and target (Q) distribution, incorrect development of logarithm fractions for the visible part, resulting in various problems up to and including negative Kl divergence (as reported in here (https://github.com/pytorch/pytorch/issues/32520) and Shannon Jensen Divergence not bounded at ]0,1] (bits) or ]0,log2(e)=1.4426950408889634...] (nats) and eventually for computation of the Shannon Jensen metric (square root of the JSD) to NaN issues (as a result of Kl underflow and JSM computing square root of negative numbers...)
The [documentation seems](https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html) wrong too (and explains partly the errors experienced)
l(x,y)=L={l1,…,lN},ln=yn⋅(logyn−xn)
Which actually should be :
l(x,y)=L={l1,…,lN},ln=xn⋅(log(xn)−log(yn))=xn⋅log(xn)−xn⋅log(yn)
Because
l(x,y)= 1/NΣ xn⋅log(xn/ yn),ln=xn⋅(log(xn/yn)=xn⋅(log(xn)-logyn))=xn⋅log(xn)−xn⋅log(yn)
Now taking into account that xn and yn are inverted and that xn is a log_softmax
l(x,y)=L={l1,…,lN},ln=yn⋅(logyn−xn), xn=log(wn)
(=)
l(x,y)=L={l1,…,lN},ln=yn⋅(logyn−xn)=yn⋅(logyn−logwn) and here we fall back on the equation but with inversion of yn and xn terms or more exactly inversion of one and the antilog of the other, any reader will understand it as Kl(x||y) but in fact is Kl(y||w) with x=log(w).
Actually naming is misleading, it is not a Kl divergence since F.log_softmax has to be computed beforehand, I understand numercial stability issues in computing log of F.softmax() but that's something explicit in the documentation, the reader should be forewarned as this not so trivial to grasp issue, made very clear.
## To Reproduce
Steps to reproduce the behavior:
```
M =0.5*(p+q)
JSD = 0
JSD += 0.5*F.kl_div(F.log_softmax(M,dim=dim).log(), F.softmax(p, dim=dim), None, None, 'none')
JSD += 0.5*F.kl_div(F.log_softmax(M,dim=dim).log(), F.softmax(q, dim=dim), None, None, 'none')
return torch.sqrt(JSD)
```
## Expected behavior
First two positional arguments in wrong order as per mathematical error in the documentation, for the general Kullback-Leibner it should be something like
F.kl_div(F.softmax(p, dim=dim),F.log_softmax(q,dim=dim), None, None, 'none')
and in the context of the JSD's specifics :
F.kl_div(F.softmax(p, dim=dim),F.log_softmax(M,dim=dim), None, None, 'none')
Second question is why do we need to input a F.log_softmax for the first (which should be in fact the second) positional argument in the first place ? I understand that F.softmax(M, dim=dim),log() has numerical stability compared to computing directly F.log_softmax yet could not be part of a routine treatment of the second term ? Otherwise it is not KL_divergence and the name is misleading !
## Proposed fix for Kl underflow
The Kl numerical underflow issue is actually documented in Cover & al (2006), 2.3 Relative Entropy and Mutual Information p.45 : "we use the convention that 0 log 0/0 =0 and the convention (based on continuity arguments) that 0 log 0/q = 0 and p log p /0 = inf. Therefore postprocessing fix is :
D_pM = (p*(torch.log(p)-F.log_softmax(M,dim=dim)))
D_qM = (q*(torch.log(q)-F.log_softmax(M,dim=dim)))
D_pM[torch.isnan(D_pM)] = 0
D_qM[torch.isnan(D_qM)] = 0
## Environment
Collecting environment information...
PyTorch version: 1.7.0
Is debug build: True
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A
OS: Microsoft Windows 10 Famille
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Python version: 3.7 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 10.2.89
GPU models and configuration: GPU 0: GeForce GTX 1050
Nvidia driver version: 457.63
cuDNN version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\bin\cudnn64_7.dll
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.1
[pip3] numpydoc==1.1.0
[pip3] pytorch-metric-learning==0.9.98
[pip3] torch==1.7.0 (UPDATED TO 1.8.1 problem remains)
[pip3] torchaudio==0.7.0
[pip3] torchvision==0.8.1
[conda] blas 2.16 mkl conda-forge
[conda] cudatoolkit 10.2.89 hb195166_8 conda-forge
[conda] libblas 3.8.0 16_mkl conda-forge
[conda] libcblas 3.8.0 16_mkl conda-forge
[conda] liblapack 3.8.0 16_mkl conda-forge
[conda] liblapacke 3.8.0 16_mkl conda-forge
[conda] mkl 2020.1 216
[conda] numpy 1.19.1 py37hae9e721_0 conda-forge
[conda] numpydoc 1.1.0 py_1 conda-forge
[conda] pytorch 1.7.0 py3.7_cuda102_cudnn7_0 pytorch
[conda] pytorch-metric-learning 0.9.98 pyh39e3cac_0 metric-learning
[conda] torchaudio 0.7.0 py37 pytorch
[conda] torchvision 0.8.1 py37_cu102 pytorch
cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @brianjo @mruberry @albanD @walterddr @rgommers @heitorschueroff