LogSoftmax vs Softmax

I’d assume that nn.LogSoftmax would give the same performance as nn.Softmax given that it is simply log on top, but it seems to provide much better results.
Is there any explanation to this?

I would say that this could be due to numerical stability reasons. This is related but not similar to negative log likelihood, where the multiplications becomes a summation. In both cases though you could prevent numerical over-/underflow.

The conversion for the softmax is basically

softmax = e^{…} / [sum_k e^{…, class_k, …}]

logsoftmax = log(e^{…}) - log [sum_k e^{…, class_k, …}]

So, you can see that this could be numerically more stable since you don’t have the division there.

10 Likes