BCELoss vs BCEWithLogitsLoss

torch.argmax is used for a multi-class classification to compute the prediction from a model output in the shape [batch_size, nb_classes, *], where the argmax is called on the nb_classes-dimension.

Passing the logits to a softmax and calling torch.argmax won’t make a difference, since the max. logit value will also have the highest probability.
Passing the logits to sigmoid and calling torch.argmax sounds wrong, since in this case you should use a threshold.

It depends on your use case and the model output.
For a binary classification you can define the model output in the shape [batch_size, 1] and output the logits (you would use nn.BCEWithLogitsLoss in this case). To get the predicted class you can use a threshold on the logits or the probability after passing the logits to a sigmoid function.
Here is a small code example:

output = model(input) # output shape is `[batch_size, 1]` and contains logits
output_prob = torch.sigmoid(output) # calculate probabilities
pred = output_prob > 0.5 # apply threshold to get class predictions

Alternatively, you can treat the binary classification as a multi-class classification, where the model output would be [batch_size, 2] (you would use nn.CrossEntropyLoss in this case).
To get the predictions you would use preds = torch.argmax(output, dim=1) on the logits or the softmax output.

torch.exp is used to compute the probabilities after you’ve applied e.g. log_softmax on the logits.

3 Likes

Thank you! These are really helpful!

@ptrblck
1)if my metric is just a log loss not any other regular metrics…like accuracy,precision
which loss is right choice
BCELoss or bce with logits loss

  1. if I use plain bce loss dont use sigmoid any where,will the model converge to targets…

Targets 0 and 1

  1. I would always recommend to use nn.BCEWithLogitsLoss and pass raw logits to this criterion instead of applying a sigmoid and use nn.BCELoss for better numerical stability. Besides that there won’t be any difference.

  2. You will get an error, if you are trying to pass model outputs, which are not in the range [0, 1] as seen here:

criterion = nn.BCELoss()
output = torch.randn(10, 1) * 10
target = torch.randint(0, 2, (10, 1)).float()

print(output.min(), output.max())
> tensor(-13.4234) tensor(14.9071)

loss = criterion(output, target)
> RuntimeError: all elements of input should be between 0 and 1
1 Like

Dear Vahid,

If the sigmoid is part of BCEWithLogitsLoss than I would expect the output to be between 0 and 1 (probabilities). While I am getting negative numbers. Can it be so?

Best,
Alice

Hi @ptrblck, I also have a similar question like @Alice_NL.

I’m using the BCEWithLogitsLoss. My model’s output layer is nn.Linear(n_features, 1). I’m also applying a method to ignore padding tokens (as described here) since some of my instances are “dummy” instances.

I am running into negative values of the loss. I don’t pass the model outputs though a sigmoid since this is done internally in the loss (as explained here). The target tensor is changed to .float() as advised here (otherwise an error is raised).

Maybe sharing an example will help.

criterion = nn.BCEWithLogitsLoss(reduction="none")
predictions = model(text)
predictions.flatten()  # now they look like this: 
[-0.0697, -0.1014, -0.1710, -0.1756, -0.2617, -0.1669, -0.0434,  0.0425,
         0.1301,  0.3244,  0.2333,  0.5780,  0.6034,  0.7815,  0.8425,  0.9130,
         1.1673,  1.1997,  1.2309,  1.1993,  1.2654,  1.4185,  1.6314,  1.7687,
         1.9572,  2.0371,  2.0445,  2.0647,  2.2613,  2.1460,  2.3093,  2.2494,
         2.1804,  2.1032,  2.0195,  1.7516,  1.5498,  1.2483,  0.9180,  0.8675,
         0.8975,  0.7890,  0.8383,  0.8216,  0.8925,  1.0214,  0.9266,  1.0895,
         0.9227,  0.9671,  0.7545,  0.8215,  0.8538,  0.5958,  0.5385,  0.6271,
         0.5543,  0.5031,  0.5726,  0.6811,  0.6685,  0.7003,  0.7954,  0.6352,
         0.9142,  0.7911,  0.8525,  1.0150,  0.9878,  1.1784,  1.1077,  1.0028,
         1.0299,  1.1480,  0.9583,  1.0223,  0.8234,  0.5116,  1.2303,  1.3809,
         1.2653,  1.2630,  1.2284,  1.2188,  1.0241,  1.1120,  0.9463,  0.7682,
         0.9089,  0.7657,  0.9760,  0.9888,  0.9637,  0.9657,  1.0535,  1.1614,
         0.9324,  0.9215,  0.9468,  0.8493,  0.9579,  0.9594,  0.8854,  0.5773,
         0.5589,  0.5986,  0.4733,  0.6161,  0.5088,  0.4822,  0.6536,  0.6084,
         0.6348,  0.6546,  0.5932,  0.6005,  0.3710, -0.4232, -0.3314, -0.1482,
        -0.2120, -0.0500,  0.0352,  0.0487,  0.1709,  0.2060,  0.3851,  0.3964,
         0.4985,  0.5045,  0.7283,  0.6332,  0.7792,  0.8024,  0.9375,  0.9375,
         0.9937,  0.8789,  1.0017,  1.0397,  0.9286,  0.9806,  0.8421,  0.7172,
         0.7602,  0.6581,  0.6481,  0.6109,  0.4873,  0.4827,  0.4468,  0.4083,
         0.3280,  0.3109,  0.3444,  0.2514]

tags = torch.tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 1,
        1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
        2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

tag_pad_idx = 2
bce_loss = criterion(predictions, tags.float())
loss_mask = tags != tag_pad_idx
bce_loss_masked = bce_loss.where(loss_mask, torch.tensor(0.0))

mean_loss = bce_loss_masked.sum() / loss_mask.sum()

The mean_loss will be tensor(-0.0083, device='cuda:2', grad_fn=<DivBackward0>).

The target values should be in [0, 1], while your target contains values outside of this range (2), so an undefined loss would be expected. I’m not familiar with your use case, but if you are working on a multi-class classification, you might want to use nn.CrossEntropyLoss instead.

I see, thank you!
I actually have 2 real classes (0 and 1), while the third class (2) just marks the instances that are padding sentences since I am doing document-level classification of individual sentences. In this case, is this a binary or multi-class problem?

I wanted to ignore the dummy instances by zeroing out their loss, but I can see this isn’t the best approach.

You could try to keep your current workflow using these (invalid) padding target indices, create the unreduced loss, filter out the padding losses, and reduce it afterwards.
Using nn.CrossEntropyLoss might be easier, as it provides an argument to ignore specific class indices.

Hello @ptrblck!
Thank you for your explanation on the same.
My question is that since nn.BCEwithLogitsLoss already applies softmax during it’s calculation why have you chosen to use Sigmoid for probability calculation. Could you please provide more information on the same?

nn.BCEWithLogitsLoss would apply sigmoid (or log_sigmoid) internally not softmax, as the latter would return a result of all ones for a binary classification output in the shape [batch_size, 1].

My bad. Thank you for the clarification. So just to clarify, the output that I had received for binary classification using nn.BCEWithLogitsLoss was tensor([-1.7795, -1.5024, -1.3843], device='cuda:0', grad_fn=<Unique2Backward>).
Given that the sigmoid function returns values between 0 to 1 or -1 to 1, could you please clarify in which case this could happen?

I guess you’ve printed the model output and thus the input to nn.BCEWithLogitsLoss?
If so, then you have printed the logits, which are not bounded to a specific range and can contain any value in [-Inf, Inf]. nn.BCEWithLogitsLoss applies the activation function internally (so not visible to you) and will return the loss value. If you want to get the probabilities, use torch.sigmoid(model_output), but don’t pass these values to the criterion.

1 Like

You will get the loss between your predicted value and label value.

As BCEWithLogitsLoss function is combined of sigmoid layer and BCELoss, reference.
It applies sigmoid then calculate the loss using BCELoss.

I’m trying to run a multi label classifier and I have used nn.BCEWithLogitsLoss as my model’s loss. But when I want to use: accuracy_score(output_labels, input_labels) I got this error:
ValueError: Classification metrics can’t handle a mix of binary and multilabel-indicator targets.
What should i do?

I don’t know how your inputs look to this method, but this code snippet works for a multilabel classification:

from sklearn.metrics import accuracy_score

output = torch.tensor([[0., 1., 1., 0.],
                       [0., 1., 0., 1.]])
target = torch.tensor([[0., 1., 0., 1],
                       [0., 1., 0. ,1]])
accuracy_score(output, target)

and computes the accuracy as described in the docs:

In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.

I am implementing a UNET with a binary mask(0-black background and 1-white mask). I realized the black pixels are more than the white pixels in my training set. So I am planning to update my custom loss function.

class CustomBCELoss:

    def __init__(self):
        self.bce = nn.BCELoss()

    def __call__(self, yhat, ys):
        yhat = torch.sigmoid(yhat)
        valid = (ys == 1) | (ys == 0)
        
        if bool(valid.any()):
            return self.bce(yhat[valid], ys[valid])
        else:
            return None
        return self.bce(yhat, ys)

This is the new loss function where I am using BCELosswithLogits and pos_weight parameter to balance the data.

class CustomBCELossLogits:
    def __init__(self):
        self.bceLogit = nn.BCEWithLogitsLoss()

    def __call__(self, yhat, ys):
        valid = (ys == 1) | (ys == 0)
        weight = torch.tensor([3,1])
        weights = weight.to('cuda')
        if bool(valid.any()):
            return self.bceLogit(yhat[valid], ys[valid], pos_weight=weights)
        else:
            return None
        return self.bceLogit(yhat, ys, pos_weight=weights)
  1. Is this approach correct?@ptrblck Please suggest.
  2. I am getting an error with the pos_weights parameter. I have picked the weights as [3,1] → referring to white pixels to be prioritized 3 times.

I don’t know the class frequencies, but refer to the docs to set the pos_weight:

For example, if a dataset contains 100 positive and 300 negative examples of a single class, then pos_weight for the class should be equal to 300/100=​=3. The loss would act as if the dataset contains 3×100=300 positive examples.

The error is raised since pos_weight should be passed as an argument to the class initialization, not the forward method. If you want to use it in a functional way you could use F.binary_cross_entropy_with_logits instead.

Thank you your input.

@ptrblck Hi, could you please tell me why the threshold was set to 0.0 instead of 0.5 if we go by the range of the Sigmoid Function? Please tell me what I am missing. If 0.0 should be the threshold when applying BCEWithLogitsLoss, should I keep the same threshold in training and testing both parts of my model?