When to use Softmax and CrossEntropyLoss together

I am trying to understand why some code works even though when training we are not supposed to use softmax as the output and use crossentropyloss to train the model.
Here it seems that the softmax is used as output and the crossentropyloss as the loss function and the model gives good results.
Why is that?Are there cases where we can use the two together?

I think there is no reason to combine softmax and CrossEntropyLoss in the referred repo.

In most of cases, CrossEntropyLoss forms a serial of LogSoftmax and NLLLoss.
We can easily check what happens when the tensor get softmaxed twice.

>>> loss = nn.Softmax(dim=0)
>>> a = torch.rand(10)
>>> a
tensor([0.9364, 0.9396, 0.1604, 0.8733, 0.0822, 0.5598, 0.1300, 0.7800, 0.5182,
>>> b = loss(a)
>>> b
tensor([0.1342, 0.1347, 0.0618, 0.1260, 0.0571, 0.0921, 0.0599, 0.1148, 0.0884,
>>> bb = loss(b)
>>> bb
tensor([0.1034, 0.1035, 0.0962, 0.1026, 0.0958, 0.0992, 0.0960, 0.1014, 0.0988,

The values inside of the given tensor is so flattened (there is no dominant class at all)

This is what I thought and I was surprised that the model worked. Do you know why this is the case? I saw another post and they said that it is possible that the values become too similar after using softmax and cross entropy loss function

Of course the model works with double softmax but,
I guess the model output before the softmax might have somewhat dramatic values like

out = torch.tensor([1000., 0.02, 0.1,., 0.0015, ..., 0.00001]) # indicates class 0

It probably becomes the unstable learning:(
This is a sample test

>>> loss = nn.Softmax(dim=0)
>>> a = torch.tensor([1000, 0.1, 0.1, 0.2, 0.5])
>>> a
tensor([1.0000e+03, 1.0000e-01, 1.0000e-01, 2.0000e-01, 5.0000e-01])
>>> b = loss(a)
>>> b
tensor([1., 0., 0., 0., 0.])
>>> bb = loss(b)
>>> bb
tensor([0.4046, 0.1488, 0.1488, 0.1488, 0.1488])

I see, so in such cases which give weird results using softmax twice helps to tackle it.
Also, I assume this can arise even when the values are normalised, correct?

I think one use case could be something similar to Temparature softmax? i.e you don’t want your model to be very confident in predicting something, so you flatten the predicted distribution.

I see, seems to agree with thecho7

I agree.
I think normalization cannot solve the problem.
As @shivammehta007 mentioned, if we need a smoothed label, I’d rather use Temperature softmax than double softmax lol.

So, is everything clear?

1 Like

yes, things seems to make sense