What classification loss should I choose when I have used a softmax function?

Just as the title, I must use the result of softmax,then I want to use a loss.
I found that NLLLoss must be after log_softmax,if I just compute a log for the result of softmax,is that right?
As for nn.CrossEntropyLoss(),there can’t be a softmax.
Could you please tell me which loss should I choose?

nn.CrossEntropyLoss combines log_softmax and NLLLoss which means you should not apply softmax at the end of your network output.
So you are not required to apply softmax since the criterion takes care of it.

If you want to use softmax at the end, then you should apply log after that(as you mentioned above) and use NLLLoss as the criterion.

If I do that, wouldn’t back propagation be a problem?

If I do that, wouldn’t back propagation be a problem?

Doing what will cause problem during backprop?

I have no idea if I seperately use the softmax and log function instead of log_softmax,whether there will be a problem when BP.

I don’t think it will cause any problem. It’s still same as using log_softmax. Maybe you can test your custom function just to make sure if it is consistent with log_softmax.

Hi Raghul and Chunchun!

Just to clarify:

log (softmax()) is mathematically the same as log_softmax(),
but they differ numerically. softmax() calculates exponentials that
can “blow numbers up.” The log() then undoes this, but the damage
can already be done. So log (softmax()) can be numerically unstable,
leading to reduced precision and nans, and can cause problems.

log_softmax() (largely) avoids this by reorganizing the calculation
so that the intermediate blow-up doesn’t occur. (That’s why pytorch
(and other packages) include it as separate function.)

There is usually no reason to use softmax(). Just feed the last linear
layer of your network (that you would have fed into softmax()) into
cross_entropy() as your loss function (or use log_softmax()
followed by nll_loss()).

If somebody forces you to use softmax() then you’re stuck, and
have to deal with the potential numerical instability of of softmax()
followed by log()

Good luck!

K. Frank


Thanks a lot for your answer.
That’s very clear,but I must use the layer which can supply propabilities.
Maybe I can use sigmoid+BCELoss?

Hello Chunchun!

In general, there is no particular need to use probabilities to feed
into your loss function.

If your use case requires probabilities for some other reason,
perhaps you could explain why you need them and what you
need to use them for.

For training, you should use (based on what you’ve said so far)
a linear layer that outputs numbers from -inf to +inf (that are
to be understood as logits) fed into cross_entropy() as your
loss function. This will all be part of “autograd” and you will
back-propagate through it.

Then, if you need actual probabilities for some other reason,
take the outputs of your linear layer, and using
with torch. no_grad(): so you don’t affect your gradient
calculation, run them through softmax() to convert the logits
to the probabilities you want.

BCELoss (binary cross-entropy) is, in essence, the special two-class
case of the multi-class cross_entropy() loss.

sigmoid() --> BCELoss has the same numerical problems as
softmax() --> log() --> nll_loss(). If you are performing a
binary (two-class) classification problem, you will want to feed
the (single) output of your last linear layer into
binary_cross_entropy_with_logits() (BCEWithLogitsLoss).
(This is the binary analog of cross_entropy() (CrossEntropyLoss).)

And again, if you need the actual probability (which you don’t for
training), you would run the output of your last linear layer through
sigmoid() (under with torch. no_grad():) to get the probability.

Good luck!

K. Frank


Thank you very much!