Hi Mary!

Let me answer this carefully:

I can’t rule out the possibility that you have an unusual, perverse use

case for which adding `Sigmoid`

as the final layer improves performance.

If your model trains well and you are also getting good results on your

validation and / or test datasets, then use whatever works best for your

use case.

Having said that, including `Sigmoid`

as your final layer (when using

pytorch’s `CrossEntropyLoss`

as your loss criterion) is almost certainly

a mistake. Doing so won’t fully “break” your model, but it limits how

well your model can perform.

As you note, `CrossEntropyLoss`

includes `softmax()`

(actually

`log_softmax()`

) internally. But when you add the `Sigmoid`

your model

now only outputs values between zero and one and, when converted

to probabilities by `CrossEntropyLoss`

’s internal `softmax()`

, the

probabilities seen internally by `CrossEntropyLoss`

will no longer be

able to span their full range from zero to one. So, in particular, your

model can no longer predict your correct class with 100% probability.

Consider this two-class illustration:

```
>>> import torch
>>> torch.__version__
'2.1.1'
>>> t = torch.arange (1., 11).unsqueeze (-1).repeat (1, 2)
>>> t[:, 0] *= -1
>>> t
tensor([[ -1., 1.],
[ -2., 2.],
[ -3., 3.],
[ -4., 4.],
[ -5., 5.],
[ -6., 6.],
[ -7., 7.],
[ -8., 8.],
[ -9., 9.],
[-10., 10.]])
>>> t.softmax (1)
tensor([[1.1920e-01, 8.8080e-01],
[1.7986e-02, 9.8201e-01],
[2.4726e-03, 9.9753e-01],
[3.3535e-04, 9.9966e-01],
[4.5398e-05, 9.9995e-01],
[6.1442e-06, 9.9999e-01],
[8.3153e-07, 1.0000e+00],
[1.1254e-07, 1.0000e+00],
[1.5230e-08, 1.0000e+00],
[2.0612e-09, 1.0000e+00]])
>>> t.sigmoid().softmax (1)
tensor([[0.3865, 0.6135],
[0.3183, 0.6817],
[0.2880, 0.7120],
[0.2761, 0.7239],
[0.2716, 0.7284],
[0.2699, 0.7301],
[0.2693, 0.7307],
[0.2691, 0.7309],
[0.2690, 0.7310],
[0.2690, 0.7310]])
```

When you pass the output of your model through `Sigmoid`

, you

convert the output of your model (which should be understood as

unnormalized log-probabilities) to things that look like probabilities.

As you can see, it is easy to predict some class has having 100%

probability if you don’t have the `Sigmoid`

. But when you insert the

`Sigmoid`

, the best you can do (in the two-class case) is to predict

your favored class as having about 73% probability. With the `Sigmoid`

your model can never learn to predict the correct class with near-100%

certainty.

So it is very unlikely that your added `Sigmoid`

improves your model’s

best performance.

Check your model and training code for bugs. Check your results on

a validation dataset as well as on the dataset you train with. Also try

training significantly longer because if you don’t train long enough, you

won’t be able to see the `Sigmoid`

-version’s performance top out.

Your goal, of course, is not to overfit, but to verify the point I’m making,

try overfitting by training a whole lot on a smallish subset of your training

data. You should see that with the `Sigmoid`

you *can’t* overfit your model

to the point that it predicts your ground-truth target class with 100%

probability.

Best.

K. Frank