Is it OK to use sigmoid in the output layer and using PyTorch cross entropy loss?

Let’s consider I have a multi layer neural network that is doing multi label image segmentation. So each pixel belongs to one of N classes.

Now, let’s say the last layer has Sigmoid activation units. The used loss then is PyTorch cross entropy which includes already softmax inside it. My question that I am getting better performance when I added the sigmoid in the last layer compare to removing it from the last layer.
I am not sure if it is OK to use sigmoid and if it makes sence in this situation, any idea why I may get better results with using Sigmoid.

Hi Mary!

Let me answer this carefully:

I can’t rule out the possibility that you have an unusual, perverse use
case for which adding Sigmoid as the final layer improves performance.
If your model trains well and you are also getting good results on your
validation and / or test datasets, then use whatever works best for your
use case.

Having said that, including Sigmoid as your final layer (when using
pytorch’s CrossEntropyLoss as your loss criterion) is almost certainly
a mistake. Doing so won’t fully “break” your model, but it limits how
well your model can perform.

As you note, CrossEntropyLoss includes softmax() (actually
log_softmax()) internally. But when you add the Sigmoid your model
now only outputs values between zero and one and, when converted
to probabilities by CrossEntropyLoss’s internal softmax(), the
probabilities seen internally by CrossEntropyLoss will no longer be
able to span their full range from zero to one. So, in particular, your
model can no longer predict your correct class with 100% probability.

Consider this two-class illustration:

>>> import torch
>>> torch.__version__
>>> t = torch.arange (1., 11).unsqueeze (-1).repeat (1, 2)
>>> t[:, 0] *= -1
>>> t
tensor([[ -1.,   1.],
        [ -2.,   2.],
        [ -3.,   3.],
        [ -4.,   4.],
        [ -5.,   5.],
        [ -6.,   6.],
        [ -7.,   7.],
        [ -8.,   8.],
        [ -9.,   9.],
        [-10.,  10.]])
>>> t.softmax (1)
tensor([[1.1920e-01, 8.8080e-01],
        [1.7986e-02, 9.8201e-01],
        [2.4726e-03, 9.9753e-01],
        [3.3535e-04, 9.9966e-01],
        [4.5398e-05, 9.9995e-01],
        [6.1442e-06, 9.9999e-01],
        [8.3153e-07, 1.0000e+00],
        [1.1254e-07, 1.0000e+00],
        [1.5230e-08, 1.0000e+00],
        [2.0612e-09, 1.0000e+00]])
>>> t.sigmoid().softmax (1)
tensor([[0.3865, 0.6135],
        [0.3183, 0.6817],
        [0.2880, 0.7120],
        [0.2761, 0.7239],
        [0.2716, 0.7284],
        [0.2699, 0.7301],
        [0.2693, 0.7307],
        [0.2691, 0.7309],
        [0.2690, 0.7310],
        [0.2690, 0.7310]])

When you pass the output of your model through Sigmoid, you
convert the output of your model (which should be understood as
unnormalized log-probabilities) to things that look like probabilities.
As you can see, it is easy to predict some class has having 100%
probability if you don’t have the Sigmoid. But when you insert the
Sigmoid, the best you can do (in the two-class case) is to predict
your favored class as having about 73% probability. With the Sigmoid
your model can never learn to predict the correct class with near-100%

So it is very unlikely that your added Sigmoid improves your model’s
best performance.

Check your model and training code for bugs. Check your results on
a validation dataset as well as on the dataset you train with. Also try
training significantly longer because if you don’t train long enough, you
won’t be able to see the Sigmoid-version’s performance top out.

Your goal, of course, is not to overfit, but to verify the point I’m making,
try overfitting by training a whole lot on a smallish subset of your training
data. You should see that with the Sigmoid you can’t overfit your model
to the point that it predicts your ground-truth target class with 100%


K. Frank

1 Like

I will remove it, since what you said makes sense, I am working on unsupervised task so I don’t have train, validation tests.