Justification for LogSoftmax being better than Log(Softmax)

Hi KFrank!

Thanks a lot for the code example you gave, I gained a much better understanding of this issue. I’m sharing my results and interpretations below for you and others.

Try alpha=100 gives:

tensor([-200., -100.,    0.], dtype=torch.float64)  # log + softmax
tensor([-200., -100.,    0.])  # logsoftmax

Try alpha=1000 gives:

tensor([-inf, -inf, 0.])  # log + softmax
tensor([-2000., -1000.,     0.])  # logsoftmax

This immediately suggests to me that, if we apply log and softmax separately, when the output of softmax becomes very close to zero, then log would yield negative infinity.

For an even more succinct example, where the input of log is very close to zero (exp is just one way to achieve this):

torch.log(torch.exp(torch.tensor([-2000])))  # -inf

but if we adopt the log-softmax idea then the answer is clearly just -2000.

*Numerical overflow might not be relevant in this context though, since it’s ruled out by the max-trick in softmax implementation.

Merry Christmas,
Zhihan