Error occurs when using "softmax and log" replace "log_softmax"

When I using

Blockquote ```
x = F.softmax(x, dim=1)
x = torch.log(x)

to replace

> Blockquote ```
x = F.log_softmax(x, dim=1)

something wrong happend. But if I use " F.log_softmax" directly, everything went well. Then I subtracted the former and the latter, and found that they were not exactly the same. I want to know how to divide “F.log_softmax” into “F.softmax” and “torch.log” correctly, I need the result of “F.softmax”. Thank you very much!
(PS:I had subtracted the maximum to prevent the softmax overflow, but the training still reported an error)

it might have to do with logsumexp trick applied in F.log_softmax,
I carried following experiment,

import torch.nn as nn, torch, torch.nn.functional as F
from math import exp, log
x = torch.randn(5); x
tensor([ 2.229876756668091,  0.264560282230377, -0.100190632045269,
         0.228291451931000, -0.119905993342400])
a = torch.softmax(x, dim=0); a
tensor([0.681239962577820, 0.095449574291706, 0.066277280449867,
        0.092049762606621, 0.064983405172825])
tensor([-0.383840680122375, -2.349157094955444, -2.713908195495605,
        -2.385426044464111, -2.733623266220093])


F.log_softmax(x, dim=0)


tensor([-0.383840620517731, -2.349157094955444, -2.713907957077026,
        -2.385425806045532, -2.733623266220093])

we see a difference in values obtained,

torch.log(a) - F.log_softmax(x, dim=0)
tensor([-5.960464477539062e-08,  0.000000000000000e+00, -2.384185791015625e-07,
        -2.384185791015625e-07,  0.000000000000000e+00])

the first case (for 3rd value) is equivalent to,

log(exp(x[2])/(exp(x[0]) + exp(x[1]) + exp(x[2]) + exp(x[3]) + exp(x[4])))

while the second case is equivalent to,

x[2] - x[0] - log((exp(0) + exp(x[1] - x[0]) + exp(x[2] - x[0]) + exp(x[3] - x[0]) + exp(x[4] - x[0])))

I think it done to avoid exponential of a large number

1 Like

Hello lidaiyu!

You most likely have two separate issues here.

log (softmax()) can have numerical issues. That’s why
you are better off using log_softmax(). It is mathematically
the same, but numerically different, and, in fact, numerically
more stable.

However, the result of the subtraction that you posted in your
screenshot (Note, please post text instead of screenshots for
textual information.) shows that the two differ only by a small
round-off error. This is to be expected.

Try converting your tensor a to DoubleTensor
(dtype = torch.float64) before performing the rest of
the computation, and the discrepancy will get smaller (although
not necessarily go to zero).

My assumption is that you are using log_softmax() for training,
but need softmax() for some other purpose. If this is the case,
keep using log_softmax() for training, and compute softmax()
separately. (Or you can re-exponentiate log_softmax() to
recover softmax().)

Good luck.

K. Frank

1 Like

Thank you for your reply!It is really helpful and gives me a deeper understanding of the details of “F.log_softmax”. I print the last linear layer’s parameters when error occurs, and the parameters are ‘nan’. It may have overflowed when executing ‘F.softmax’ or ‘log’, I will add ‘assert’ to debug the code.

Hello KFrank!
Thank you for your comprehensive reply! It is really helpful. I am really sorry for my unclear and unfriendly expression. I think the small round-off error is acceptable and not the cause of the error. It may have overflowed or some other error when executing ‘F.softmax’ or ‘log’ (But I have subtracted the maximum to prevent the softmax overflow). I will continue to debug the code for the real reason. In fact, I need to apply an attention map to the result of ‘F.softmax’ to calculate a new probability distribution. And then use the new probability distribution to calculate the loss. I willl try to re-exponentiate log_softmax() to recover softmax(). Thank you again for your reply, it helps me a lot.