What classification loss should I choose when I have used a softmax function?

KFrank · January 2, 2020, 2:18am

Hello Chunchun!

In general, there is no particular need to use probabilities to feed
into your loss function.

If your use case requires probabilities for some other reason,
perhaps you could explain why you need them and what you
need to use them for.

For training, you should use (based on what you’ve said so far)
a linear layer that outputs numbers from -inf to +inf (that are
to be understood as logits) fed into cross_entropy() as your
loss function. This will all be part of “autograd” and you will
back-propagate through it.

Then, if you need actual probabilities for some other reason,
take the outputs of your linear layer, and using
with torch. no_grad(): so you don’t affect your gradient
calculation, run them through softmax() to convert the logits
to the probabilities you want.

BCELoss (binary cross-entropy) is, in essence, the special two-class
case of the multi-class cross_entropy() loss.

sigmoid() → BCELoss has the same numerical problems as
softmax() → log() → nll_loss(). If you are performing a
binary (two-class) classification problem, you will want to feed
the (single) output of your last linear layer into
binary_cross_entropy_with_logits() (BCEWithLogitsLoss).
(This is the binary analog of cross_entropy() (CrossEntropyLoss).)

And again, if you need the actual probability (which you don’t for
training), you would run the output of your last linear layer through
sigmoid() (under with torch. no_grad():) to get the probability.

Good luck!

K. Frank