Activation function for last layer

I am new to pytorch and while going through the MNIST example i saw that in the last layer we had provided no activation in the forward function . Would there be any difference if i add a softmax activation function at the output layer?

The last non-linearity depend on the loss function you are using.
For a classical multi-class classification use case, you could use e.g. nn.CrossEntropyLoss as your criterion. This loss function expects logits (not probabilities!), as internally F.log_softmax will be applied.
If you use softmax manually in combination with nn.CrossEntropyLoss, most likely your model won’t learn properly or get stuck after a few iterations.

Alternatively, you could also use F.log_softmax with nn.NLLLoss, which is equivalent to logits + nn.CrossEntropyLoss.

1 Like