Dear Community,

I found that some people use the Kullback leibler divergence loss for training their model when the output layer consists of a one-hot-vector using pytorch.

I actually tried it myself, and it works well. However I do not understand **why** that works. As given in the documentation, the loss over the output layer L (of length n) to my understanding is:

```
loss = \sum_{_i}^{n} y_i log(y_i/x_i)
```

with i = 0…n, y being the label and x being the estimated prob of the model. But that assumes that y_i is always > 0. Otherwise the loss would be undefined due to log(0). Therefore I do not understand how I get good results using the KLdiv loss.

Can someone explain whats going on inside the KLdiv function of pytorch? is the label smoothed such that no zeros occur anymore?

I am thankful for any advice.

Cheers,

Dennis

edit: i don’t know why the latex formatting is not working, maybe someone can fix it