BCELoss vs BCEWithLogitsLoss

izaskr · July 29, 2021, 10:01am

Hi @ptrblck, I also have a similar question like @Alice_NL.

I’m using the BCEWithLogitsLoss. My model’s output layer is nn.Linear(n_features, 1). I’m also applying a method to ignore padding tokens (as described here) since some of my instances are “dummy” instances.

I am running into negative values of the loss. I don’t pass the model outputs though a sigmoid since this is done internally in the loss (as explained here). The target tensor is changed to .float() as advised here (otherwise an error is raised).

Maybe sharing an example will help.

criterion = nn.BCEWithLogitsLoss(reduction="none")
predictions = model(text)
predictions.flatten()  # now they look like this: 
[-0.0697, -0.1014, -0.1710, -0.1756, -0.2617, -0.1669, -0.0434,  0.0425,
         0.1301,  0.3244,  0.2333,  0.5780,  0.6034,  0.7815,  0.8425,  0.9130,
         1.1673,  1.1997,  1.2309,  1.1993,  1.2654,  1.4185,  1.6314,  1.7687,
         1.9572,  2.0371,  2.0445,  2.0647,  2.2613,  2.1460,  2.3093,  2.2494,
         2.1804,  2.1032,  2.0195,  1.7516,  1.5498,  1.2483,  0.9180,  0.8675,
         0.8975,  0.7890,  0.8383,  0.8216,  0.8925,  1.0214,  0.9266,  1.0895,
         0.9227,  0.9671,  0.7545,  0.8215,  0.8538,  0.5958,  0.5385,  0.6271,
         0.5543,  0.5031,  0.5726,  0.6811,  0.6685,  0.7003,  0.7954,  0.6352,
         0.9142,  0.7911,  0.8525,  1.0150,  0.9878,  1.1784,  1.1077,  1.0028,
         1.0299,  1.1480,  0.9583,  1.0223,  0.8234,  0.5116,  1.2303,  1.3809,
         1.2653,  1.2630,  1.2284,  1.2188,  1.0241,  1.1120,  0.9463,  0.7682,
         0.9089,  0.7657,  0.9760,  0.9888,  0.9637,  0.9657,  1.0535,  1.1614,
         0.9324,  0.9215,  0.9468,  0.8493,  0.9579,  0.9594,  0.8854,  0.5773,
         0.5589,  0.5986,  0.4733,  0.6161,  0.5088,  0.4822,  0.6536,  0.6084,
         0.6348,  0.6546,  0.5932,  0.6005,  0.3710, -0.4232, -0.3314, -0.1482,
        -0.2120, -0.0500,  0.0352,  0.0487,  0.1709,  0.2060,  0.3851,  0.3964,
         0.4985,  0.5045,  0.7283,  0.6332,  0.7792,  0.8024,  0.9375,  0.9375,
         0.9937,  0.8789,  1.0017,  1.0397,  0.9286,  0.9806,  0.8421,  0.7172,
         0.7602,  0.6581,  0.6481,  0.6109,  0.4873,  0.4827,  0.4468,  0.4083,
         0.3280,  0.3109,  0.3444,  0.2514]

tags = torch.tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 1,
        1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
        2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

tag_pad_idx = 2
bce_loss = criterion(predictions, tags.float())
loss_mask = tags != tag_pad_idx
bce_loss_masked = bce_loss.where(loss_mask, torch.tensor(0.0))

mean_loss = bce_loss_masked.sum() / loss_mask.sum()

The mean_loss will be tensor(-0.0083, device='cuda:2', grad_fn=<DivBackward0>).