I am new to pytorch and while going through the MNIST example i saw that in the last layer we had provided no activation in the forward function . Would there be any difference if i add a softmax activation function at the output layer?
The last non-linearity depend on the loss function you are using.
For a classical multi-class classification use case, you could use e.g. nn.CrossEntropyLoss
as your criterion. This loss function expects logits (not probabilities!), as internally F.log_softmax
will be applied.
If you use softmax manually in combination with nn.CrossEntropyLoss
, most likely your model won’t learn properly or get stuck after a few iterations.
Alternatively, you could also use F.log_softmax
with nn.NLLLoss
, which is equivalent to logits + nn.CrossEntropyLoss
.
1 Like