Difference between BCELoss and BCEWithLogitsLoss when using large values


From my understanding, using the BCEWithLogitsLoss should yield the same results as BCELoss composed with sigmoid units. And the only difference between the two is that the former (BCEWithLogitsLoss) is numerically more stable.

However, when I test their behavior, I get significantly different results as soon as I deal with loggits with values over 10e2.

Minimal example:

import torch
from torch import nn

preds = torch.rand(10)
preds[0] = 1e2

labels = torch.zeros(10)

criterion = nn.BCELoss()
print(criterion(nn.Sigmoid()(preds), labels)) #outputs tensor(3.5969)
criterion = nn.BCEWithLogitsLoss()
print(criterion(preds, labels)) #outputs tensor(10.8338)

I am using pytorch 1.2.0.

Could someone please tell me whether I am doing anything wrong or this behavior is to be expected?

Thanks in advance!

1 Like

You are saturating the sigmoid with such a high number.
Have a look at this small test:

labels = torch.zeros([1])

for x in torch.linspace(10, 100):
    print((torch.sigmoid(x) - 1.) == 0, ' for ', x)
    x = x.view(1)
    ce = F.binary_cross_entropy(torch.sigmoid(x), labels)
    ce_logit = F.binary_cross_entropy_with_logits(x, labels)
    err = torch.abs(ce - ce_logit)
    print('error ', err)

As you can see, the first print statement will return True for logits of ~17, which means the limited floating point precision returns a value of 1. for torch.sigmoid(x).
Likewise the error will jump up by a large margin.