Why have a model output raw logits instead of softmax?

A General question in mind. If I was training a classification model that I would then want to use for inference, wouldn’t it always be preferable to have a softmax layer at the end, instead of a regular linear that is input into a cross entropy loss function?

Wouldn’t the former make it easier to understand NN outputs, and also make it easier during inference, since all you need is a forward call and nothing else, unlike in the latter case where you need to process the NN output with some code?

I think it’s because that logit.argmax() equals to softmax(logit).argmax() during inference.

but a logit is just a tensor. Why would that happen like that? Is there some change that happens when you mark the model to be eval() ?

Hi Vishak!

No. For reasons of numerical stability, it is better to have your model
output the logits from its final Linear layer so that pytorch can use the
log-sum-exp trick, either in CrossEntropyLoss or in LogSoftmax.

Yes, looking at probabilities rather than logits could be easier to
understand. But it’s not worth the potential damage to the numerical
stability of the computation. It’s also easy enough enough to run the
logits through softmax() outside of the model (and loss function) if
you want to look at probabilities.

If you were only going to use this version of your model for inference
(that is, not train it), you could include softmax() in your model. You
would still have the potential for reduced numerical stability, but it’s
less likely to cause trouble than when you’re training.

But, again, if you really need probabilities for something (and usually
you don’t), it’s really not that hard or confusing to add a call to
softmax() outside of your model.

Best.

K. Frank

3 Likes

That was great. Thanks!