Why is BECLossWithLogits compute different value from CrossEntropyLoss


I am trying the nn.BCELossWithLogits now, and this is my code:

    logits = torch.randn(1, 2, 4, 4)
    label = torch.randint(0, 2, (1, 4, 4))

    criteria_ce = nn.CrossEntropyLoss()
    loss = criteria_ce(logits, label)

    criteria_bce = nn.BCEWithLogitsLoss()
    lb_one_hot = logits.data.clone().zero_().scatter_(1, label.unsqueeze(1), 1)
    loss = criteria_bce(logits, lb_one_hot)

In theory, these two loss should have same value, since they are both binary classification loss. Why the loss value is actually different, with one to be 2.01 and the other to be 0.77 ?

Hi coincheung!

First the what:

BCEWithLogitsLoss expects a single (real) number per sample
that indicates the “strength” of that sample being in the “1”
state (the “yes” state, if you will).

To recover the loss you get with CrossEntropyLoss you need to
pass in the difference of your state-1 and state-0 strengths.

This code performs the calculation I think you want:
(For simplicity, I’ve removed two of your dimensions; the labels
are now a vector of five samples, with labels.shape = [5].)

import torch
import torch.nn as nn

preds = torch.randn (5, 2)
labels = torch.randint (0, 2, (5, ))

logits = preds[:, 1] - preds[:, 0]
bcelogitsloss = nn.BCEWithLogitsLoss()(logits, labels.float())

celoss = nn.CrossEntropyLoss()(preds, labels)

print (bcelogitsloss, celoss)

Now the why:

Classic cross-entropy loss measures the mismatch between
two (discrete) probability distributions. So, for the binary case,
you compare (Q(“no” state), Q(“yes” state)) with (P(“no” state),
P(“yes” state)), where P(“no” state) is the actual (“ground
truth”) probability that your sample is in the “no” state, while
Q(“no” state) is your model’s prediction of this probability.

(As probabilities, they are all between 0 and 1, and P(“no”) +
P(“yes”) = 1, and similarly for the Qs.)

Pytorch’s CrossEntropyLoss has a built-in Softmax that coverts
your model’s predicted “strengths” (relative log-odds-ratios)
into probabilities that sum to one. It also one-hots your labels
so that (in the binary case) label = 1 turns into P(“no”) = 0,
and P(“yes”) = 1. It then calculates the cross-entropy of these
two probability distributions.

BCELoss calculates this same cross-entropy, but it knows that
it’s the binary case, so you only give it one of the two
probabilities, Q(“yes”), and you can understand the 0 and 1
labels as simply being the values of P(“yes”).

This is illustrated by further running the following code:

softmaxs = nn.Softmax (dim = 1)(preds)
bcesoftmaxloss = nn.BCELoss()(softmaxs[:, 1], labels.float())
print (bcesoftmaxloss)

Just as CrossEntropyLoss has a built-in Softmax (to convert
“strengths” to probabilities), BCEWithLogitsLoss has a built-in
logistic function (Sigmoid) to convert the “strength” of the “yes”
state into the probability Q(“yes”). More precisely, the “strength”
is the log-odds-ratio of the “yes” state, also called the “logit”.
That is, BCEWithLogitsLoss expects logit(Q(“yes”)) as its input,
and the built-in Sigmoid converts it back to Q(“yes”).

Best regards.

K. Frank