VGG output layer - no softmax?

This may seem like an extremely stupid question, but I was curious about something:

  1. In other implementations of VGG, the last layer is always put through softmax; however, in the torchvision implementation here, the last layer is the following:

nn.Linear(4096, num_classes),

There is no softmax layer after this and the VGG documentation also states that the last layer “prob” is a softmax layer.

I was wondering how this was possible and why this works. I realize this might be a more general math question, but I couldn’t think of anywhere else to post it, so I posted here.



I didn’t double check, but it feels to me that it is trained with softmax. It might is just removed in the model zoo because the 4096 vector is more interesting in most use cases of a pretrained model. I still can be totally wrong though. :slight_smile:


Thanks for your reply. I figured that it’s trained with softmax and that it’s probably mostly used for transfer learning, so the last layer(s) will be thrown away anyway.

However, what confused me is it’s total absence. If what you say is true and it’s been left off intentionally with the knowledge that it will mostly be used for transfer learning, then that does, at the very least, tell me that my understanding is correct and that I can stop thinking I’m an idiot :).

Upon digging into the details, it appears that the models are trained with CrossEntropyLoss which has SoftMax built in.

Imagenet training script:
Criterion defined at:
Loss defined at:
CrossEntropyLoss doc:



Again, thank you very much for your response. I feel there’s still a gap in my understanding, however. I understand that they are trained with CrossEntropyLoss which includes Softmax, but I’m trying to understand the output from the network when a test image is passed through it. In that particular case, wouldn’t the output be different from VGG output, which has the softmax layer as part of the architecture? Yes, we will throw this away in the transfer learning use case, but as a hypothetical, say I wanted to just use VGG as a classifier; in that case, I wouldn’t get pseudo-probabilities that sum to 1, correct? If that is the case, is the main reason the softmax layer was left off merely because no one would be using it?

And once more, thank you for the help in understanding this.


1 Like

In case you want to use the VGG network to classify new samples, you can just call argmax on the logits to get the most likely class. The softmax won’t change the classification.
However, if you need the probabilities, you can always call softmax on the net’s output.


Hi @ptrblck,

Thank you very much! That clears up everything for me.

The reason why this is done is because you only need the softmax layer at the time of inferencing. While training, to calculate the loss you don’t need to softmax and just calculate loss without it. This way the number of computations get reduced!


I try to apply softmax on output it returns probability for the single record only instead of returning the probability for whole test data,
it returns probability correctly for the single record torch.softmax(output[0], dim=0).
but this one not working torch.softmax(output, dim=1) its returning some other value instead of probability

If you output is returned as [batch_size, nb_classes] (which would be the default for a classification use case), then softmax(output, dim=1) is the right approach, since the sum in dim1 will be 1.
Each row (which corresponds to a sample in the batch) will contain the probabilities for each class.


Yeah understood, Thank u so much👍 @ptrblck