Using log-softmax with cross entropy

Hello, I am doing some tests using different loss function, usually we use log-softmax + nll loss or just cross-entropy loss with original output, but I found log-softmax + cross-entropy sometimes provides better results, I know this combination is not correct, because it actually has two times log scale computation, and for backward it may have some problems, but for some datasets, whatever the learning rate I used for log-softmax + nll loss, it’s still worse than log-softmax + cross-entropy, I am wondering the reason or the potential problem for this result, and also wondering if I can use log-softmax + cross-entropy? Thanks.

Hi J_B!

This doesn’t make sense, as the two are the same.

Roughly speaking, log() undoes softmax(), so two log_softmax()s in
a row are the same as just one:

>>> import torch
>>> torch.__version__
'1.13.0'
>>> t = torch.arange (10.0)
>>> t
tensor([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
>>> t.log_softmax (0)
tensor([-9.4586, -8.4586, -7.4586, -6.4586, -5.4586, -4.4586, -3.4586, -2.4586,
        -1.4586, -0.4586])
>>> t.log_softmax (0).log_softmax (0)
tensor([-9.4586, -8.4586, -7.4586, -6.4586, -5.4586, -4.4586, -3.4586, -2.4586,
        -1.4586, -0.4586])

Best.

K. Frank

Hi K. Frank,

Thanks for your reply. I also made the same test as yours, and I got the same results. But I am not sure about the backward, since log-softmax + cross-entropy actually does two times log-softmax, is the gradient still same as log-softmax + nll? Thanks again for your help.

Hi J_B!

Yes, it’s the same function in the forward pass so you’ll get the same
gradient from the backward pass (up to possible floating-point round-off
error).

But it’s easy enough to test this sort of thing:

>>> import torch
>>> print (torch.__version__)
1.13.0
>>>
>>> _ = torch.manual_seed (2022)
>>>
>>> lin = torch.nn.Linear (3, 5)
>>> input = torch.randn (7, 3)
>>> target = torch.randint (5, (7,))
>>>
>>> # log_softmax() + nll_loss()
>>> pred = lin (input)
>>> lossA = torch.nn.functional.nll_loss (pred.log_softmax (1), target)
>>> lossA
tensor(1.9197, grad_fn=<NllLossBackward0>)
>>> lossA.backward()
>>> lin.weight.grad
tensor([[-0.0151,  0.2068, -0.0595],
        [-0.0683, -0.3701, -0.2444],
        [ 0.0995, -0.0853,  0.1083],
        [ 0.0341,  0.2123,  0.0165],
        [-0.0503,  0.0362,  0.1791]])
>>> lin.bias.grad
tensor([ 0.0741, -0.1813,  0.1026,  0.0301, -0.0255])
>>>
>>> lin.zero_grad()
>>>
>>> # log_softmax() + cross_entropy()
>>> pred = lin (input)
>>> lossB = torch.nn.functional.cross_entropy (pred.log_softmax (1), target)
>>> lossB
tensor(1.9197, grad_fn=<NllLossBackward0>)
>>> lossB.backward()
>>> lin.weight.grad
tensor([[-0.0151,  0.2068, -0.0595],
        [-0.0683, -0.3701, -0.2444],
        [ 0.0995, -0.0853,  0.1083],
        [ 0.0341,  0.2123,  0.0165],
        [-0.0503,  0.0362,  0.1791]])
>>> lin.bias.grad
tensor([ 0.0741, -0.1813,  0.1026,  0.0301, -0.0255])
>>>
>>> lin.zero_grad()
>>>
>>> # plain cross_entropy()
>>> pred = lin (input)
>>> lossC = torch.nn.functional.cross_entropy (pred, target)
>>> lossC
tensor(1.9197, grad_fn=<NllLossBackward0>)
>>> lossC.backward()
>>> lin.weight.grad
tensor([[-0.0151,  0.2068, -0.0595],
        [-0.0683, -0.3701, -0.2444],
        [ 0.0995, -0.0853,  0.1083],
        [ 0.0341,  0.2123,  0.0165],
        [-0.0503,  0.0362,  0.1791]])
>>> lin.bias.grad
tensor([ 0.0741, -0.1813,  0.1026,  0.0301, -0.0255])

If you do seem to be getting different results (other than possible round-off
error) with your two methods, then you’ve got some sort of bug hiding
somewhere that you should track down.

Best.

K. Frank

Hi K. Frank,

Really appreciate your reply. so does this mean if the model output first go through log-softmax, then the loss function can be nll or cross entropy, it doesn’t matter, they are actually same(if the input is the log-softmax), am I right? Thanks again.