Hi Amit!
The short answer is that softmax()
and sigmoid()
are used for different things.
It’s not that one is true and the other false or that one is more stable and the other
less stable – they’re just different.
Let me give you my perspective on this. (I haven’t looked at the links you posted.)
For some context, my number-one rule is that the output of a neural network means
whatever you train it to mean.
Let’s say that you have a problem with three classes and the final layer of your network
is a Linear
with out_features = 3
(not followed by any additional “activation” layer).
The output of your network will be three numbers will run from -inf
to inf
.
What does the output of such a network mean? Whatever you train it to mean.
If you pass that output through softmax()
, you will get three probabilities (therefore
each between 0.0
and 1.0
) that sum to 1.0
. This would be used for a multi-class
problem where your input is in exactly one class and the result of softmax()
is the
set of probabilities for the input being in each of the three classes. (Because the input
is in exactly in one of the three classes, these probabilities sum to one.)
On the other hand, If you pass your output through sigmoid()
, you will get three
independent probabilities, that is, each probability is between zero and one, but
they don’t sum to anything in particular. (They might sum to zero or one or anything
in between.) This would be used for a so-called multi-label problem where the input
can independently be in or not in any of the three classes. That is, the input might
be in none of the classes, in all three classes, or just in one or two of the classes.
In the multi-class case you would typically use CrossEntropyLoss
(which has
logsoftmax()
built into it) and the output of your network (which is the input to
CrossEntropyLoss
) would typically be interpreted as unnormalized log-probabilities
(that are converted into probabilities by softmax()
), because you’ve trained your
network to predict unnormalized log-probabilities.
In the multi-label case, you would typically use BCEWithLogitsLoss
(which has
log_sigmoid()
built into it) and the output of your network would be interpreted as
logits (that are converted into probabilities by sigmoid()
), because you’ve trained
your network to predict logits.
Last, if you use integer categorical class labels as your ground-truth target
when
training with CrossEntropyLoss
, you will be training your network to predict a single
class – that is, you will be training your network to predict 100% probability for the single
correct class and 0% for the others. If your network trains well, it will typically predict a
value that (after being converted by softmax()
) is very close to 1.0
for one class
and values that (after being converted) are very close to 0.0
for the the other classes.
But if you use floating-point probabilistic “soft labels” as your ground-truth target
for CrossEntropyLoss
, you will train your network to predict a probability distribution
for your classes. For example, you might predicts 25% for classes A and B and 50%
for class C. This would be an exactly-correct prediction for an input whose ground-truth
target was [0.25, 0.25, 0.50]
.
Note, these comments should also answer the question in your other post:
Best.
K. Frank