Is softmax mandatory in a neural network?

I was wondering if softmax is a must-have in a multi-class(more than 2) classification neural network? I was reading some stack-overflow topics and I saw some people talking that it’s necessary to have softmax at the last layer and others saying it’s not necessary to include it as the crossEntropyloss applies softmax itself, so I am not sure if it really is necessary or not? (here is the link for the discussion pytorch - Do I need to apply the Softmax Function ANYWHERE in my multi-class classification Model? - Stack Overflow! )
as far as i know, what softmax does is just weighted scaling the outputs to range between 0 and 1 and the sum to be 1. so i would appreciate if you can clear the following points:

  1. is is a must to have a softmax in the last layer in case of mutli-class classification?
  2. if we have softmax in the last layer, would it affect the calculation of cross entropy loss
  3. if we don’t include it in the last layer, how would it affect the whole accuracy calculation and classification in general? thanks for your answers in advance.

As you said, the softmax function will turn the raw output of a net (logits) into a probability distribution with a sum of 1. For multi-label classification this is required as long as you expect the model to predict a single class, as you would typically calculate the loss with a negative log likelihood loss function (NLLoss). The expected (target) tensor would be a one-hot tensor (whose sum is 1), and the NLLoss will calculate the difference between it and your model’s prediction (normalized with softmax).

Logits can have negative values and their sum be <= 0. The exponential in the softmax formula allows to have positive values and a sum of 1, this is very convenient for the vast majority of cases.

Now multi-label classification can also be tackled with a sigmoid activation function, which also converts negative values of logits into positive, but independently. The sum of the result will not be equal to 1. With softmax a slight difference between two logits values can results in probabilities with high differences (as they are all normalized) whereas this could lead to close probabilities when a sigmoid is used instead.

  1. I would say that sigmoid could be preferred when you expect the results to have high probabilities for several classes (eg for medical applications where an x-ray could present several diseases), otherwise (ie most cases) if you always expect a single class then softmax is the way to go.
  2. The CrossEntropyLoss is the combination of LogSoftmax and NLLoss, which means that you don’t need to have a softmax layer or apply softmax at the output of you model, the loss function will take the logits and handle it.
  3. During training, if you want to compute an accuracy value you will probably need to apply softmax on your model’s output (logits). Then you can compute an accuracy value with any sampling procedure (argmax, top-k or nucleus). Here is an example.
criterion = CrossEntropyLoss()
y = model(x)  # y are logits, no activation were applied
loss = criterion(y, target)  # target is the expected result
# accuracy is computed by applying softmax then argmax on y (logits), comparing with the target tensor / expected result
acc = (argmax(softmax(y, dim=-1), dim=-1) == target).sum().item() / (y.size(0) * y.size(1))
4 Likes

NIT for 3:
torch.argmax will return the same class predictions for logits and probabilities, so F.softmax is not needed in this case.

4 Likes