Weighted BCE loss with logits

aleemsidra · April 26, 2022, 2:58pm

I am dealing with imbalanced dataset. I want to use weighted BCE loss with logits. nn.BCEWithLogitsLoss takes pos_weight argument.
From the docs:
pos_weight ( Tensor , optional ) – a weight of positive examples. Must be a vector with length equal to the number of classes.

In my case I have two classes. However, when I pass a vector with a length equal to 2, the following error comes:
RuntimeError: The size of tensor a (2) must match the size of tensor b (16) at non-singleton dimension 0

Where 16 is the batch size and pos_weight vector = [1.1734028683181226, 6.7669172932330826]

Andrei_Cristea · April 26, 2022, 4:18pm

Could you share a minimal reproducible snippet that raises this error?

In particular, make sure to include a print out of the shape of your model output. My best guess is that you’re accidentally transposing the axes such that length 16 is showing up in a non-batch position.

aleemsidra · April 26, 2022, 5:07pm

make_weights_for_balanced_classes is used to generate weights. get_loss is being used to calculate the loss.

def make_weights_for_balanced_classes(labels, nclasses=2):                        
    count = [0] * nclasses                                                      
    for item in labels:                                                         
        count[item] += 1 
                                                        
    weight_per_class = [0.] * nclasses  
    N = float(sum(count))                                                   
    for i in range(nclasses):                                                   
        weight_per_class[i] = N/float(count[i])
    return    weight_per_class 

def get_loss(self, pred, label):

        print("pred shape:", pred.shape)
        print("labels", label.shape)
        print("weights", self.weight)
        criterion = nn.BCEWithLogitsLoss(pos_weight= self.weight)
        if torch.cuda.is_available():
            criterion.cuda()
        return criterion(pred, label)

model output: torch.Size([16, 2])
weight vector: torch.Size([2])
pred shape: torch.Size([16])
labels torch.Size([16])
weights tensor([1.1734, 6.7669], dtype=torch.float64)
It results in error:
RuntimeError: The size of tensor a (2) must match the size of tensor b (16) at non-singleton dimension 0

Andrei_Cristea · April 26, 2022, 5:50pm

The issue is that the way you’re passing the model output and labels is not how BCEWithLogitsLoss expects them. It expects the following:

model output (before sigmoid), in your case this should have shape (16, 2)
one-hot encoded target, which should also have shape (16, 2)

So in your case it thinks you’re running it on a single batch with 16 classes, rather than 16 batches with two classes, which is probably what you intended.

The below code runs fine:

pred = torch.randn(16, 2)
label = torch.empty(16, 2).random_(2)
weight = torch.Tensor([1.1734, 6.7669]).float()

criterion = nn.BCEWithLogitsLoss(pos_weight=weight)

print("pred shape:", pred.shape)
print("labels", label.shape)
print("weights", weight)

print(criterion(pred, label))

Output:
pred shape: torch.Size([16, 2])
labels torch.Size([16, 2])
weights tensor([1.1734, 6.7669])
tensor(2.4276)

So in your case you’ll need to one-hot encode your labels, and pass the model output directly, so something like this:

criterion(model_output, nn.functional.one_hot(label, num_classes=2))

Hope this helps!

KFrank · April 26, 2022, 9:40pm

Hi Sidra!

Is it possible that you are conflating two differing usages of the notion of
“two classes?”

In one usage – the situation I think you are dealing with – “two classes”
would typically mean the two classes in a binary-classification problem.
That is the “positive class” (“yes class,” foreground, etc.) and the “negative
class” (“no class,” background, etc.).

In such a case, the input you pass to BCEWithLogitsLoss (the output of
your model) would typically have shape [nBatch], as would the target.
The pos_weight argument passed into BCEWithLogitLoss’s constructor
would have shape [1].

In the context of BCEWithLogitsLoss, the other usage of “number of
classes” applies to the so-called *multi-label, multi-class" case where
your have multiple classes (it could be two, but it could be more), each
of which can be “active” or “inactive” for any given sample. That is, each
sample is labelled with none, some, or all of these classes (hence
“multi-label”). This is the usage of “number of classes” in your quoted
section of the documentation.

In this situation, your input would have shape [nBatch, nClass], as
would your target. pos_weight would now have shape [nClass].

So, in the “plain-vanilla” version of binary classification, you have “two
classes” (“positive” and “negative”), but for the purposes of the shape
of pos_weight, you have “number of classes” = 1. It’s just a confusing
choice of terminology.

But what would it mean to have “two classes” in the multi-label, multi-class
setting? Consider this example:

Class A is “color” (vs. black and white) and class B is “animal” (vs. anything
else, such as a rock or a tree). None, some (i.e, one), or all (i.e., both) of
these labels can be applied to an image you are classifying.

Thus a color picture of a rock would be “class A – yes” and “class B – no,”
that is, it is a color picture, but it is not an animal. But you could also
have an image labelled with both classes, that is, “class A – yes” and
“class B – yes”, or labelled with neither class.

If you have further questions, please let us know whether you are dealing
with “plain-vanilla” binary classification or whether you have are working
with a multi-label, multi-class problem.

Best.

K. Frank

aleemsidra · April 27, 2022, 10:14am

@Andrei_Cristea , I am calling the get_loss function as:

 losses = self.get_loss(infer[:,0],labels1.float()) + 0.5 * self.get_loss(infer[:, 1],labels2.float())

where infer is the model prediction. So, the pred.shape is torch.Size([16]), which is being passed to get_loss function. I had mistakenly reported the wrong shape originally, it was the original model prediction, but in loss, I am sending torch.Size([16]). Can you please help with this.

aleemsidra · April 27, 2022, 10:43am

@Andrei_Cristea , I am dealing with plain vanilla binary classification.

In such a case, the input you pass to BCEWithLogitsLoss (the output of
your model) would typically have shape [nBatch] , as would the target .

The model output is model output: torch.Size([16, 2]). I am calling get_loss() as follows:

losses = self.get_loss(infer[:,0],labels1.float()) + 0.5 * self.get_loss(infer[:, 1],labels2.float())

So, the pred.shape becomes equal to [nBatch] which is 16 in my case.

The pos_weight argument passed into BCEWithLogitLoss 's constructor
*would have shape [1]
How pos_weight shape should be [1]? Does that mean we only pass the minority class weight and not the majority class?

Andrei_Cristea · April 27, 2022, 12:41pm

Per @KFrank’s insightful observation above, it sounds like what you really have is a single binary classification, whereas an output shaped [16,2] and a pos_weight shaped [2] is meant for a two-class binary classification. In writing this I’m realizing the terminology is confusing, maybe we need better terms but I think KFrank did a good job above explaining the distinction.

To be clear, it sounds like what you have is a single class, and for each image your labels say either “no” the image does not contain that class, or “yes” the image does contain that class.

In any case, in your scenario, I think you want the following:

Your model output ought to be shaped like [nBatch]. Its value should be quite negative if you want to predict “no” and quite positive if you want to predict “yes”. The reason is that this would map to 0 for “no” and 1 for “yes” once you apply sigmoid, which BCEWithLogitsLoss does. This is because sigmoid(-inf) ~= 0 and sigmoid(+inf) ~= 1.
Your label should be shaped like [nBatch] and equal to either 0 or 1, 0 for “no” and 1 for “yes” (unless you use label smoothing, in which case you can use something like 0.05 instead of 0, and 0.95 instead of 1.0)
Your pos_weight should be shaped like [1] since you only have one class. The higher the pos_weight, the bigger the weight you’ll assign, inside your loss function, to how well you did classifying the true positives (i.e. where the labels is 1, meaning “yes”). Per the docs, the purpose of this is to trade off precision and recall (depending on your specific problem, you may care more about avoiding false positives or false negatives, and this argument helps you bake that preference in to your fitting).

To illustrate, look at this small example:

model_output = torch.Tensor([+5., -5., -5., -5.])  # you predict "yes, no, no, no"
targets = torch.Tensor([1., 1., 0., 0.])    # actual labels are "yes, yes, no, no"
pos_weight = torch.Tensor([2.]).float()

criterion_simple = nn.BCEWithLogitsLoss()
criterion_posweight = nn.BCEWithLogitsLoss(pos_weight=pos_weight)

print(criterion_simple(model_output, targets))
print(criterion_posweight(model_output, targets))

Output:
tensor(1.2567)
tensor(2.5101)

When applying pos_weight we get a bigger loss, because the place where the error occurs is where the true label was “yes”.

aleemsidra · April 28, 2022, 12:56pm

@Andrei_Cristea thank you for your detailed answer.

I am still confused about what actually pos_weght is for. Does it represent the ‘weight of the minority class’ or the ‘weight of the positive class only’ (irrespective of whether it’s a majority or minority class)? From your explanation, I understood that pos_weght is for the positive class only (irrespective of whether it’s the majority or minority class). Is my understanding correct? In my case, positive class (1) is a minority class and I want to assign more weight to that. Is pos_weight suitable for this scenario? Or should I use the weight attribute for this scenario.

weight ( Tensor , optional ) – a manual rescaling weight given to the loss of each batch element. If given, has to be a Tensor of size nbatch.

Below is the code and respective shapes of tensor, can you please look at it and advise.

  def get_loss(self, pred, label):

        weights =self.weight[1]
        weights = weights.cuda().half() # using amp, so using half precision
        print("weights", weights, weights.shape)
        print("pred", pred.shape)
        print("label", label.shape)
        criterion = nn.BCEWithLogitsLoss(pos_weight = weights )
        if torch.cuda.is_available():
            criterion.cuda()

        return criterion(pred, label)

weights tensor(6.7656, device=‘cuda:0’, dtype=torch.float16)
pred torch.Size([16])
label torch.Size([16])

Andrei_Cristea · April 28, 2022, 5:05pm

Yes, from what you’ve described, pos_weight does seem suitable for your scenario. Below I’ll illustrate that it does indeed assign more weight to the positive class, inside the loss calculation.

Before I do that, I just want to mention that you can also achieve a similar effect by using a the vanilla loss function (without pos_weight) but just sampling more frequently from your desired label by using a WeightedRandomSampler. Personally I prefer this approach, if your objective here is simply to deal with a label imbalance within the class (and you could also use it to deal with class imbalance, though in your case that’s not required since you have a single class). I think the two approaches might be equivalent under some particular parameters and assumptions (someone can correct me if that’s wrong) but intuitively if your positives are very much in the minority, and you try to address that via pos_weight, I feel like your training will be very jumpy. Your training batches either won’t have any positive labels, but when they do you’ll make a very large step in that direction. This feels less robust to me (and a bit less intuitive to think about) than just balancing out your sampling, by sampling from the positives more frequently.

Having said that, here’s an illustration that pos_weight does indeed represent a weight multiplier on the positive label.

label      = torch.Tensor([  0.,  +1.,   0.,  +1.])
prediction = torch.Tensor([-10., -10., +10., +10.])
#     the labels are      [    neg,   pos,   neg,     pos]
# model's predictions are [correct, wrong, wrong, correct]

crit_basic     = torch.nn.BCEWithLogitsLoss(reduction='none')
crit_posweight = torch.nn.BCEWithLogitsLoss(reduction='none', pos_weight=torch.Tensor([2.]))

print("losses by element:")
print("     basic", crit_basic(prediction, label))
print("pos_weight", crit_posweight(prediction, label))

Output:
losses by element:
     basic tensor([4.5418e-05, 1.0000e+01, 1.0000e+01, 4.5418e-05])
pos_weight tensor([4.5776e-05, 2.0000e+01, 1.0000e+01, 9.0835e-05])

In this setup, your model happens to be right half the time and wrong half the time. As a reminder, a model prediction of -10 means it expects a negative label, and a prediction of +10 means it expects a positive label (because sigmoid(-10) ~= 0 and sigmoid(+10) ~= +1).

You can see that your vanilla loss (excluding pos_weight) is 0 when the model is right and 1 when the model is wrong. It doesn’t care whether the model was wrong about the positive or the negative label, they are weighted the same.

The loss that uses pos_weight is still equal to 1 when the label was negative, (third element in the tensor) however it has doubled when the label is positive, the 2nd element in the tensor, going from 1 to 2. The doubling corresponds to the pos_weight passed, which is 2. Indeed, your loss function now assigns more weight to a mistake made on a positive label, but otherwise behaves the same as before.

aleemsidra · May 3, 2022, 10:16am

@Andrei_Cristea , thank you for your detailed explanation. I was actually using weighted random sampler before, but because of the random sampling, the results come out to be quite different every time. I was looking for stability, so, I wanted to try out weighted BCE.

One last question: in my current case, the positive class was in minority. But what if the negative class is in minority, in that scenario too, pos_weight attribute can be for the negative class? Or is it used only when the positive class is in minority? The documentation is bit cinfusing.

Andrei_Cristea · May 3, 2022, 12:34pm

Hello!

Regarding getting different results every time when using weighted random sampler, have you considered controlling your random seed? That would at least ensure that you can always reproduce a particular training experience.

Regarding your question, I think you have two options I can think of:

Suppose you want to multiply the weight on your negative (minority) class by N. You can use a pos_weight of 1/N. Since the weights multiplicatively toggle the relative importance between the classes, using a weight between 0 and 1 will down-weight the positive class, and therefore implicitly up-weight the negative class, by whatever you’re looking for.
Simply swap the labels on your dataset at training time (make the 1s into 0s and vice-versa), use regular pos_weight (which will boost the weights of your “1” labels, which now correspond to your negative class) and then remember to add a post-processing step to your net output where you swap the labels back (since your net will now learn to predict the swapped labels!)

Hope this helps!
Andrei

aleemsidra · May 4, 2022, 1:09pm

@Andrei_Cristea thank you so much for being patient and giving detailed answers.