A General question in mind. If I was training a classification model that I would then want to use for inference, wouldn’t it always be preferable to have a softmax layer at the end, instead of a regular linear that is input into a cross entropy loss function?
Wouldn’t the former make it easier to understand NN outputs, and also make it easier during inference, since all you need is a forward call and nothing else, unlike in the latter case where you need to process the NN output with some code?
No. For reasons of numerical stability, it is better to have your model
output the logits from its final Linear layer so that pytorch can use the log-sum-exp trick, either in CrossEntropyLoss or in LogSoftmax.
Yes, looking at probabilities rather than logits could be easier to
understand. But it’s not worth the potential damage to the numerical
stability of the computation. It’s also easy enough enough to run the
logits through softmax() outside of the model (and loss function) if
you want to look at probabilities.
If you were only going to use this version of your model for inference
(that is, not train it), you could include softmax() in your model. You
would still have the potential for reduced numerical stability, but it’s
less likely to cause trouble than when you’re training.
But, again, if you really need probabilities for something (and usually
you don’t), it’s really not that hard or confusing to add a call to softmax() outside of your model.