Hi!
From my understanding, using the BCEWithLogitsLoss should yield the same results as BCELoss composed with sigmoid units. And the only difference between the two is that the former (BCEWithLogitsLoss) is numerically more stable.
However, when I test their behavior, I get significantly different results as soon as I deal with loggits with values over 10e2.
Minimal example:
import torch
from torch import nn
preds = torch.rand(10)
preds[0] = 1e2
labels = torch.zeros(10)
criterion = nn.BCELoss()
print(criterion(nn.Sigmoid()(preds), labels)) #outputs tensor(3.5969)
criterion = nn.BCEWithLogitsLoss()
print(criterion(preds, labels)) #outputs tensor(10.8338)
I am using pytorch 1.2.0.
Could someone please tell me whether I am doing anything wrong or this behavior is to be expected?
Thanks in advance!