Loss criterion for classification task (argmax(softmax) vs. label or softmax vs. label)

L_Z · August 8, 2024, 9:16pm

So I am using cross entropy loss function in my CNN classification task. But I am not sure if that’s appropriate to compare between my labels which are integers labeling starting from 0 (e.g. 0, 1, 2) and the outputs which are as softmax in the range of 0 to 1.
So if we compare theses two to find the losses, won’t that be really inaccurate? Like I should apply argmax to my outputs first in order to switch to integers before comparing with my actual labels? Thanks in advance

KFrank · August 9, 2024, 12:58am

Hi L!

Pytorch’s CrossEntropyLoss takes the raw output of your model,
that is, the output of your model’s final Linear layer without any
following softmax() (or other “activation”). These are to be understood
as the unnormalized log-probabilities of each of the three classes.
(CrossEntropyLoss has log_softmax() built into it.)

CrossEntropyLoss knows how to compare your integer class labels
with the (unnormalized log-probabilities) outputs of your model.

Just to emphasize, you should not have any sort of softmax()
between your model’s last Linear layer and CrossEntropyLoss.

No, independent of the previous discussion, those integers are discrete
and therefore “not differentiable.” So using argmax() in this way will
“break the computation graph” and prevent backpropagation. (You
would certainly do this to compute a performance metric like accuracy,
but not to compute a loss function that needs to be differentiable in
order to backpropagate.)

Best.

K. Frank

L_Z · August 9, 2024, 2:10pm

Hi Frank, thank you for answering to me question. So if I need to input the output from my Linear layer directly to my CrossEntropyLoss function, then when should I apply the softmax? Or I do not actually need the softmax to find out which class I should classify into?

KFrank · August 9, 2024, 6:05pm

Hi L!

Correct, you do not need softmax() to predict a specific class. This
is because we (usually) predict a specific class by taking the specific
class for which the predicted probability – the result of softmax() – is
largest.

But softmax() doesn’t change the relative ordering of its inputs.
That is, letting pred be the output of your model (and thus the
unnormalized log-probabilities predicted by your model),
pred.softmax (-1).argmax() = pred.argmax(). So you can
simply apply argmax() to the output of your model to get the
specific predicted class without first applying softmax() (but you
can apply softmax() – it doesn’t hurt anything except for taking
a tiny bit of extra time).

Best.

K. Frank

L_Z · August 12, 2024, 4:16pm

I have printed out the output right after nn.Linear and I have realized it is like a neg number and a positive number for classification of 2 classes. Will the crossEntropyLoss works fine with negative values?

KFrank · August 13, 2024, 4:59am

Hi L!

The output of your final Linear layer should be understood as
unnormalized log-probabilities. What would a negative number
mean in this context?

What happens when you pass a negative number through
log_softmax(), as CrossEntropyLoss does internally? What
happens when you pass a negative number through softmax()?
How should the result of that be interpreted?

Let pred be the output of your final Linear layer.
What is the difference between pred.softmax (-1)
and (pred - 10).softmax (-1)?

Best.

K. Frank

L_Z · August 13, 2024, 8:39pm

Yeah, that’s why I am not sure what the negative values come from. Like the output before I apply the softmax is as tensor([6.0575, -5.3307]) for example, and then after I have applied softmax() it is as tensor([9.9999e-01, 1.1327e-05]). And log_softmax() provides me with the result as tensor([-1.1325e-05, -1.1389e+01]). But the result of argmax() is the same throughout. And the result for (pred - 10).softmax(-1) is the same as pred.softmax(-1). Thank you.