When to use Softmax and CrossEntropyLoss together

python_Dev · July 21, 2022, 8:14am

I am trying to understand why some code works even though when training we are not supposed to use softmax as the output and use crossentropyloss to train the model.
Here it seems that the softmax is used as output and the crossentropyloss as the loss function and the model gives good results.
Why is that?Are there cases where we can use the two together?

thecho7 · July 21, 2022, 8:35am

I think there is no reason to combine softmax and CrossEntropyLoss in the referred repo.

In most of cases, CrossEntropyLoss forms a serial of LogSoftmax and NLLLoss.
We can easily check what happens when the tensor get softmaxed twice.

>>> loss = nn.Softmax(dim=0)
>>> a = torch.rand(10)
>>> a
tensor([0.9364, 0.9396, 0.1604, 0.8733, 0.0822, 0.5598, 0.1300, 0.7800, 0.5182,
        0.9108])
>>> b = loss(a)
>>> b
tensor([0.1342, 0.1347, 0.0618, 0.1260, 0.0571, 0.0921, 0.0599, 0.1148, 0.0884,
        0.1309])
>>> bb = loss(b)
>>> bb
tensor([0.1034, 0.1035, 0.0962, 0.1026, 0.0958, 0.0992, 0.0960, 0.1014, 0.0988,
        0.1031])

The values inside of the given tensor is so flattened (there is no dominant class at all)
https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss

python_Dev · July 21, 2022, 8:55am

This is what I thought and I was surprised that the model worked. Do you know why this is the case? I saw another post and they said that it is possible that the values become too similar after using softmax and cross entropy loss function

thecho7 · July 21, 2022, 9:54am

Of course the model works with double softmax but,
I guess the model output before the softmax might have somewhat dramatic values like

out = torch.tensor([1000., 0.02, 0.1,., 0.0015, ..., 0.00001]) # indicates class 0

It probably becomes the unstable learning:(
This is a sample test

>>> loss = nn.Softmax(dim=0)
>>> a = torch.tensor([1000, 0.1, 0.1, 0.2, 0.5])
>>> a
tensor([1.0000e+03, 1.0000e-01, 1.0000e-01, 2.0000e-01, 5.0000e-01])
>>> b = loss(a)
>>> b
tensor([1., 0., 0., 0., 0.])
>>> bb = loss(b)
>>> bb
tensor([0.4046, 0.1488, 0.1488, 0.1488, 0.1488])

python_Dev · July 21, 2022, 10:42am

I see, so in such cases which give weird results using softmax twice helps to tackle it.
Also, I assume this can arise even when the values are normalised, correct?

shivammehta007 · July 21, 2022, 10:46am

I think one use case could be something similar to Temparature softmax? i.e you don’t want your model to be very confident in predicting something, so you flatten the predicted distribution.

python_Dev · July 21, 2022, 10:52am

I see, seems to agree with thecho7

thecho7 · July 22, 2022, 12:32am

I agree.
I think normalization cannot solve the problem.
As @shivammehta007 mentioned, if we need a smoothed label, I’d rather use Temperature softmax than double softmax lol.

So, is everything clear?

python_Dev · July 24, 2022, 1:06am

yes, things seems to make sense