Hi KFrank!
Thanks a lot for the code example you gave, I gained a much better understanding of this issue. I’m sharing my results and interpretations below for you and others.
Try alpha=100
gives:
tensor([-200., -100., 0.], dtype=torch.float64) # log + softmax
tensor([-200., -100., 0.]) # logsoftmax
Try alpha=1000
gives:
tensor([-inf, -inf, 0.]) # log + softmax
tensor([-2000., -1000., 0.]) # logsoftmax
This immediately suggests to me that, if we apply log and softmax separately, when the output of softmax becomes very close to zero, then log would yield negative infinity.
For an even more succinct example, where the input of log is very close to zero (exp is just one way to achieve this):
torch.log(torch.exp(torch.tensor([-2000]))) # -inf
but if we adopt the log-softmax idea then the answer is clearly just -2000.
*Numerical overflow might not be relevant in this context though, since it’s ruled out by the max-trick in softmax implementation.
Merry Christmas,
Zhihan