Classic cross-entropy loss measures the mismatch between
two (discrete) probability distributions. So, for the binary case,
you compare (Q(“no” state), Q(“yes” state)) with (P(“no” state),
P(“yes” state)), where P(“no” state) is the actual (“ground
truth”) probability that your sample is in the “no” state, while
Q(“no” state) is your model’s prediction of this probability.
(As probabilities, they are all between 0 and 1, and P(“no”) +
P(“yes”) = 1, and similarly for the Qs.)
Pytorch’s CrossEntropyLoss has a built-in Softmax that coverts
your model’s predicted “strengths” (relative log-odds-ratios)
into probabilities that sum to one. It also one-hots your labels
so that (in the binary case) label = 1 turns into P(“no”) = 0,
and P(“yes”) = 1. It then calculates the cross-entropy of these
two probability distributions.
BCELoss calculates this same cross-entropy, but it knows that
it’s the binary case, so you only give it one of the two
probabilities, Q(“yes”), and you can understand the 0 and 1
labels as simply being the values of P(“yes”).
This is illustrated by further running the following code:
Just as CrossEntropyLoss has a built-in Softmax (to convert
“strengths” to probabilities), BCEWithLogitsLoss has a built-in
logistic function (Sigmoid) to convert the “strength” of the “yes”
state into the probability Q(“yes”). More precisely, the “strength”
is the log-odds-ratio of the “yes” state, also called the “logit”.
That is, BCEWithLogitsLoss expects logit(Q(“yes”)) as its input,
and the built-in Sigmoid converts it back to Q(“yes”).