BCE Loss vs Cross Entropy

kliu · September 25, 2020, 3:33am

Hi all,

I am wondering what loss to use for a specific application.

I am trying to predict some binary image. For example, given some inputs a simple two layer neural net with ReLU activations after each layer outputs some 2x2 matrix [[0.01, 0.9], [0.1, 0.2]]. This prediction is compared to a ground truth 2x2 image like [[0, 1], [1, 1]] and the networks task is to get as close as possible.

I am currently using BCE loss with logits over the prediction and the ground truth, however I was wondering if I should be using the plain BCE loss instead (no logits)? If neither of these are suitable, should I be using CrossEntropyLoss? Not quite sure what the difference between these is in this context.

Any help is appreciated - thanks in advance

KFrank · September 25, 2020, 2:05pm

Hi kliu!

If I understand your use case, you should start with
BCEWithLogitsLoss as your loss function (and only change
to something else if you have a good reason and testing shows
that the change is for the better).

Just to be clear, for BCEWithLogitsLoss, the last layer of your
network should be a Linear, and it should not be followed by
an activation function (e.g., neither ReLU nor Sigmoid).

Note, these values – between 0.0 and 1.0 – are not values you
would typically get from a final Linear layer. They could occur,
but in general, you would expect values running from -inf to
inf (that would be interpreted as raw-score logits).

Yes, you should be using BCEWithLogitsLoss.

Sigmoid followed by BCELoss is mathematically equivalent to
BCEWithLogitsLoss, but numerically less stable.

CrossEntropyLoss (which would better be called
“CategoricalCrossEntropyWithLogitsLoss”) is essentially the same as
BCEWithLogitsLoss, but requires making some small modifications
to your network and your ground-truth labels that add a small amount
of unnecessary redundancy to your network.

Best.

K. Frank

kliu · September 25, 2020, 5:33pm

Hey KFrank!

Thanks for the quick detailed response. I was indeed having doubts about stacking the ReLU and Sigmoid - I will give this a shot today and I’ll follow up if things workout (fingers crossed!). In the meantime, do you happen to know why BCEWithLogitsLoss is numerically more stable than applying Sigmoid and BCELoss separately? I’m assuming this might be an implementation detail? Thanks for all your help.

Best,
K. Liu

KFrank · September 25, 2020, 8:41pm

Hi kliu!

Yes, it is an implementation detail. Mathematically, the two are the
same.

The issue is that Sigmoid (in particular, the so-call logistic function)
uses exponential to map (-inf, inf) to (0.0, 1.0). But then BCELoss
turns around and uses log() to map (0.0, 1.0) back to (-inf, inf).
Mathematically, this is fine, but numerically, using floating-point
arithmetic, the exponentials can start to saturate, leading to loss
of precision, and can underflow to 0.0 and overflow to inf, leading
to infs and nans in your loss function and backpropagation.

BCEWithLogitsLoss avoides this internally by rearranging the
computation. (Note that pytorch provides a LogSigmoid function
that does the analogous computation internally.)

A similar issue arises when feeding the results of Softmax to a
plain cross-entropy loss. Pytorch doesn’t even offer a plain
cross-entropy function. Instead, pytorch’s CrossEntropyLoss
requires logits as its inputs, and, in effect, applies the Softmax
internally.

Best.

K. Frank

kliu · September 26, 2020, 5:17am

Hey K. Frank,

I see, that makes sense. I’ll be sure to checkout the source code at some point. In other good news - your answer worked out for me. Thanks for all of your help! Have a great weekend.

Best,
K. Liu