Hi Mary!
Let me answer this carefully:
I can’t rule out the possibility that you have an unusual, perverse use
case for which adding Sigmoid
as the final layer improves performance.
If your model trains well and you are also getting good results on your
validation and / or test datasets, then use whatever works best for your
use case.
Having said that, including Sigmoid
as your final layer (when using
pytorch’s CrossEntropyLoss
as your loss criterion) is almost certainly
a mistake. Doing so won’t fully “break” your model, but it limits how
well your model can perform.
As you note, CrossEntropyLoss
includes softmax()
(actually
log_softmax()
) internally. But when you add the Sigmoid
your model
now only outputs values between zero and one and, when converted
to probabilities by CrossEntropyLoss
’s internal softmax()
, the
probabilities seen internally by CrossEntropyLoss
will no longer be
able to span their full range from zero to one. So, in particular, your
model can no longer predict your correct class with 100% probability.
Consider this two-class illustration:
>>> import torch
>>> torch.__version__
'2.1.1'
>>> t = torch.arange (1., 11).unsqueeze (-1).repeat (1, 2)
>>> t[:, 0] *= -1
>>> t
tensor([[ -1., 1.],
[ -2., 2.],
[ -3., 3.],
[ -4., 4.],
[ -5., 5.],
[ -6., 6.],
[ -7., 7.],
[ -8., 8.],
[ -9., 9.],
[-10., 10.]])
>>> t.softmax (1)
tensor([[1.1920e-01, 8.8080e-01],
[1.7986e-02, 9.8201e-01],
[2.4726e-03, 9.9753e-01],
[3.3535e-04, 9.9966e-01],
[4.5398e-05, 9.9995e-01],
[6.1442e-06, 9.9999e-01],
[8.3153e-07, 1.0000e+00],
[1.1254e-07, 1.0000e+00],
[1.5230e-08, 1.0000e+00],
[2.0612e-09, 1.0000e+00]])
>>> t.sigmoid().softmax (1)
tensor([[0.3865, 0.6135],
[0.3183, 0.6817],
[0.2880, 0.7120],
[0.2761, 0.7239],
[0.2716, 0.7284],
[0.2699, 0.7301],
[0.2693, 0.7307],
[0.2691, 0.7309],
[0.2690, 0.7310],
[0.2690, 0.7310]])
When you pass the output of your model through Sigmoid
, you
convert the output of your model (which should be understood as
unnormalized log-probabilities) to things that look like probabilities.
As you can see, it is easy to predict some class has having 100%
probability if you don’t have the Sigmoid
. But when you insert the
Sigmoid
, the best you can do (in the two-class case) is to predict
your favored class as having about 73% probability. With the Sigmoid
your model can never learn to predict the correct class with near-100%
certainty.
So it is very unlikely that your added Sigmoid
improves your model’s
best performance.
Check your model and training code for bugs. Check your results on
a validation dataset as well as on the dataset you train with. Also try
training significantly longer because if you don’t train long enough, you
won’t be able to see the Sigmoid
-version’s performance top out.
Your goal, of course, is not to overfit, but to verify the point I’m making,
try overfitting by training a whole lot on a smallish subset of your training
data. You should see that with the Sigmoid
you can’t overfit your model
to the point that it predicts your ground-truth target class with 100%
probability.
Best.
K. Frank