BCELoss vs BCEWithLogitsLoss

Not necessarily, if you don’t need the probabilities.
To get the predictions from logits, you could apply a threshold (e.g. out > 0.0) for a binary or multi-label classification use case with nn.BCEWithLogitsLoss and torch.argmax(output, dim=1) for a multi-class classification with nn.CrossEntropyLoss.

On the other hand, if you need to print or process the probabilities, you need to apply sigmoid, softmax or exp depending on the model output.

Hi all,

Thanks for your detailed answers. It is very helpful.

Just to make sure that I understand things in the right way. Your m can be some other model as well. It is not necessary to have nn.Sigmoid. Whatever the output from the model, we have to use a sigmoid before passing into nn.BCELoss().

Am I right?

Thanks
Regards
Pranavan

Yes, that’s true that we should use sigmoid before passing it to nn.BCELoss(), but also note that the last layer of the model should not have any activation (in other words, it should be with linear activation).

1 Like

Thanks a lot for your reply. I understand. In the case of nn.BCELoss(), if I use a sigmoid layer as my final layer in the model m, does it mean that I do not have to pass through another sigmoid? Or this is not an ideal setting to learn?

Thanks

Sure, you definitely do not want to apply two sigmoid activation functions, vanishing gradients. As the logits are in theory in range (-\inf, +inf) but after applying one sigmoid, their range will change to (-1, 1), which will be the input of the second sigmoid.

1 Like

Hi I have a question for the model evaluation.

I am building a BERT model for binary classification with a linear classifer (nn.linear) as the last layer. To evaluate the model, I need to calculate the Precision/Recall/F1 as will as get the probability. Output is logits.
I see online, sometimes people directly use Torch.argmax(output, dim =1) as the predicted value. Sometimes, people pass the logits to a sigmoid or softmax first before doing the argmax to get the prediction.
How do I know if I should apply sigmoid, softmax or exp? I tried sigmoid and softmax separately. If I pass the logits to softmax first, it still doesn’t look like probability values (can be negative). If I pass the logits to sigmoid, the values look more like probability, (2 elements, both with in range of [0,1], however won’t sum up to 1.

torch.argmax is used for a multi-class classification to compute the prediction from a model output in the shape [batch_size, nb_classes, *], where the argmax is called on the nb_classes-dimension.

Passing the logits to a softmax and calling torch.argmax won’t make a difference, since the max. logit value will also have the highest probability.
Passing the logits to sigmoid and calling torch.argmax sounds wrong, since in this case you should use a threshold.

It depends on your use case and the model output.
For a binary classification you can define the model output in the shape [batch_size, 1] and output the logits (you would use nn.BCEWithLogitsLoss in this case). To get the predicted class you can use a threshold on the logits or the probability after passing the logits to a sigmoid function.
Here is a small code example:

output = model(input) # output shape is `[batch_size, 1]` and contains logits
output_prob = torch.sigmoid(output) # calculate probabilities
pred = output_prob > 0.5 # apply threshold to get class predictions

Alternatively, you can treat the binary classification as a multi-class classification, where the model output would be [batch_size, 2] (you would use nn.CrossEntropyLoss in this case).
To get the predictions you would use preds = torch.argmax(output, dim=1) on the logits or the softmax output.

torch.exp is used to compute the probabilities after you’ve applied e.g. log_softmax on the logits.

3 Likes

Thank you! These are really helpful!

@ptrblck
1)if my metric is just a log loss not any other regular metrics…like accuracy,precision
which loss is right choice
BCELoss or bce with logits loss

  1. if I use plain bce loss dont use sigmoid any where,will the model converge to targets…

Targets 0 and 1

  1. I would always recommend to use nn.BCEWithLogitsLoss and pass raw logits to this criterion instead of applying a sigmoid and use nn.BCELoss for better numerical stability. Besides that there won’t be any difference.

  2. You will get an error, if you are trying to pass model outputs, which are not in the range [0, 1] as seen here:

criterion = nn.BCELoss()
output = torch.randn(10, 1) * 10
target = torch.randint(0, 2, (10, 1)).float()

print(output.min(), output.max())
> tensor(-13.4234) tensor(14.9071)

loss = criterion(output, target)
> RuntimeError: all elements of input should be between 0 and 1
1 Like

Dear Vahid,

If the sigmoid is part of BCEWithLogitsLoss than I would expect the output to be between 0 and 1 (probabilities). While I am getting negative numbers. Can it be so?

Best,
Alice

Hi @ptrblck, I also have a similar question like @Alice_NL.

I’m using the BCEWithLogitsLoss. My model’s output layer is nn.Linear(n_features, 1). I’m also applying a method to ignore padding tokens (as described here) since some of my instances are “dummy” instances.

I am running into negative values of the loss. I don’t pass the model outputs though a sigmoid since this is done internally in the loss (as explained here). The target tensor is changed to .float() as advised here (otherwise an error is raised).

Maybe sharing an example will help.

criterion = nn.BCEWithLogitsLoss(reduction="none")
predictions = model(text)
predictions.flatten()  # now they look like this: 
[-0.0697, -0.1014, -0.1710, -0.1756, -0.2617, -0.1669, -0.0434,  0.0425,
         0.1301,  0.3244,  0.2333,  0.5780,  0.6034,  0.7815,  0.8425,  0.9130,
         1.1673,  1.1997,  1.2309,  1.1993,  1.2654,  1.4185,  1.6314,  1.7687,
         1.9572,  2.0371,  2.0445,  2.0647,  2.2613,  2.1460,  2.3093,  2.2494,
         2.1804,  2.1032,  2.0195,  1.7516,  1.5498,  1.2483,  0.9180,  0.8675,
         0.8975,  0.7890,  0.8383,  0.8216,  0.8925,  1.0214,  0.9266,  1.0895,
         0.9227,  0.9671,  0.7545,  0.8215,  0.8538,  0.5958,  0.5385,  0.6271,
         0.5543,  0.5031,  0.5726,  0.6811,  0.6685,  0.7003,  0.7954,  0.6352,
         0.9142,  0.7911,  0.8525,  1.0150,  0.9878,  1.1784,  1.1077,  1.0028,
         1.0299,  1.1480,  0.9583,  1.0223,  0.8234,  0.5116,  1.2303,  1.3809,
         1.2653,  1.2630,  1.2284,  1.2188,  1.0241,  1.1120,  0.9463,  0.7682,
         0.9089,  0.7657,  0.9760,  0.9888,  0.9637,  0.9657,  1.0535,  1.1614,
         0.9324,  0.9215,  0.9468,  0.8493,  0.9579,  0.9594,  0.8854,  0.5773,
         0.5589,  0.5986,  0.4733,  0.6161,  0.5088,  0.4822,  0.6536,  0.6084,
         0.6348,  0.6546,  0.5932,  0.6005,  0.3710, -0.4232, -0.3314, -0.1482,
        -0.2120, -0.0500,  0.0352,  0.0487,  0.1709,  0.2060,  0.3851,  0.3964,
         0.4985,  0.5045,  0.7283,  0.6332,  0.7792,  0.8024,  0.9375,  0.9375,
         0.9937,  0.8789,  1.0017,  1.0397,  0.9286,  0.9806,  0.8421,  0.7172,
         0.7602,  0.6581,  0.6481,  0.6109,  0.4873,  0.4827,  0.4468,  0.4083,
         0.3280,  0.3109,  0.3444,  0.2514]

tags = torch.tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 1,
        1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
        2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

tag_pad_idx = 2
bce_loss = criterion(predictions, tags.float())
loss_mask = tags != tag_pad_idx
bce_loss_masked = bce_loss.where(loss_mask, torch.tensor(0.0))

mean_loss = bce_loss_masked.sum() / loss_mask.sum()

The mean_loss will be tensor(-0.0083, device='cuda:2', grad_fn=<DivBackward0>).

The target values should be in [0, 1], while your target contains values outside of this range (2), so an undefined loss would be expected. I’m not familiar with your use case, but if you are working on a multi-class classification, you might want to use nn.CrossEntropyLoss instead.

I see, thank you!
I actually have 2 real classes (0 and 1), while the third class (2) just marks the instances that are padding sentences since I am doing document-level classification of individual sentences. In this case, is this a binary or multi-class problem?

I wanted to ignore the dummy instances by zeroing out their loss, but I can see this isn’t the best approach.

You could try to keep your current workflow using these (invalid) padding target indices, create the unreduced loss, filter out the padding losses, and reduce it afterwards.
Using nn.CrossEntropyLoss might be easier, as it provides an argument to ignore specific class indices.

Hello @ptrblck!
Thank you for your explanation on the same.
My question is that since nn.BCEwithLogitsLoss already applies softmax during it’s calculation why have you chosen to use Sigmoid for probability calculation. Could you please provide more information on the same?

nn.BCEWithLogitsLoss would apply sigmoid (or log_sigmoid) internally not softmax, as the latter would return a result of all ones for a binary classification output in the shape [batch_size, 1].

My bad. Thank you for the clarification. So just to clarify, the output that I had received for binary classification using nn.BCEWithLogitsLoss was tensor([-1.7795, -1.5024, -1.3843], device='cuda:0', grad_fn=<Unique2Backward>).
Given that the sigmoid function returns values between 0 to 1 or -1 to 1, could you please clarify in which case this could happen?

I guess you’ve printed the model output and thus the input to nn.BCEWithLogitsLoss?
If so, then you have printed the logits, which are not bounded to a specific range and can contain any value in [-Inf, Inf]. nn.BCEWithLogitsLoss applies the activation function internally (so not visible to you) and will return the loss value. If you want to get the probabilities, use torch.sigmoid(model_output), but don’t pass these values to the criterion.

1 Like

You will get the loss between your predicted value and label value.

As BCEWithLogitsLoss function is combined of sigmoid layer and BCELoss, reference.
It applies sigmoid then calculate the loss using BCELoss.