Weights in BCEWithLogitsLoss

If I have two unbalanced classes I’d like to this cost function since I can weight the error with reference to their class.

I’d like to know if the variable pos_weight is related only to class one or I can balance both classes. saying that I want to make for example class zero weighted by 0.6 and class One weighted by 0.4. or this only works for positive class ?

pos_weight = torch.FloatTensor([.6, .4])


Yes, you are weighting the positive class using pos_weight.
However, as you are dealing with a binary use case, you can balance the recall against the precision.
pos_weight > 1 will increase the recall while pos_weight < 1 will increase the precision.

1 Like

there is a point I can’t get from your answer. by applying pos_weight, can I balance only negative class this way
pos_weight = torch.FloatTensor([.6, 1])
or it is just for positive ?

You provide just the weight for the positive class. The formula is in the docs.:

l_n = -w_n * [p_n * log(sigmoid(x_n) + (1-t_n) * log(1 - sigmoid(x_n))]

However, this parameter can influence the recall (true positive rate / sensitivity) vs. precision (positive predictive value). You can read more about it in this Wikipedia article.

In case you have binary labels the negative classes will be influenced jointly. I’m not sure what would happen if you use soft labels.


As I see from the formula the p_n will have the value zero in case of negative classes ? why do we have then to provide a weight of the number of classes ?

It’s not applied to the “negative” part of the formula, so it’s rather has the value of 1.

the formula you wrote is missing the part t_n multiplied by p_n. this way the formula will be in the negative class :

so the p_n has no contribution to the negative class as I can see and so why do we have to give p_n with the number of classes ?

Yeah sorry for the typo. The t_n is missing indeed!

I think you are referring to the docs which states:

Must be a vector with length equal to the number of classes.

Now I understand the issue and am confused as well, as I thought a scalar tensor would work, e.h. pos_weight = num_neg / num_pos. Thanks for digging deeper!

CC @velikodniy who implemented this. Am I misunderstanding the usage of pos_weight?

1 Like

I’m testing the weights influence on the model training and it seems like not working properly. any updates ?

I’m sorry for the long delay. You aren’t misunderstanding.

If we’re using BCEWithLogitsLoss, it means we’re solving multi-label classification task. And we’re working with a number of independent classes. For example, imagine you’re classifying photographs by smile and eyeglasses presence. These classes are independent and you have to balance them independently too. So you have to put two coefficients to pos_weight.

If we have only one class, pos_weight should be a vector with a single element.

We don’t add neg_weight because, actually, pos_weight is enough to make CE formula asymmetric to balance classes.


my label and output is like ([B, 3 , 96, 128]) where 3= number of classes, and I want to use pos_weight in BCEWithLogitsLoss, but I am not sure how to calculate the values of the parameters for the pos_weight…1) It should have 3 elements right? if yes 2) what should they be?
positive_weights = torch.FloatTensor([?,?,?])
positive_weights=torch.reshape(positive_weights,(1, 3, 1, 1))

@ptrblck @velikodniy I think the docs is ambiguous in how they define negative examples for calculating pos_weight. For a multihot encoding, is number negative examples an int of all-zero labels, or is it a vector counting the number of zeros in each multihot label columns?

In other words:

  def pos_weights(class_counts):
    pos_weights = np.ones_like(class_counts)
    neg_counts = [len(data)-pos_count for pos_count in class_counts]  # <-- HERE 
    for cdx, pos_count, neg_count in enumerate(zip(class_counts,  neg_counts)):
      pos_weights[cdx] = neg_count / (pos_count + 1e-5)

    return torch.as_tensor(weights, dtype=torch.float)

OR, are “negative examples” where there are no class labels, i.e.:

  def negative_count(data):
    neg_count = 0
    for idx in range(len(data)):
      _, labels = data[idx]
      if sum(labels) == 0:
        neg_count += 1

    return neg_count

I think you should provide the pos_weight as a tensor containing the weights for each class.
E.g. in the doc example 64 classes are used and pos_weight is defined as:

pos_weight = torch.ones([64])  # All weights are equal to 1

Just by skimming through your code snippets, I think the first approach might be valid.


Ty for the quick response. I should’ve been clearer in my post; I wanted to know whether to use the third line of the first snippit neg_counts = ... or the negative_count function in the second snippet.

1 Like

I am facing the same issue in a multi label, multi class classification task. I have a dataset of size 33000, each samples containing 104 classes. I split the dataset in 16500 samples with labels such as [1, 0, 1, 0, 0, …], [0, 1, 1, 0, 1, …], [1, 0, 0, 0] (each label has at least one element 1 in it) and 16500 labels such as [0, 0, 0, …], [0, 0, 0, …] (all elements in all labels are 0). When calculating the pos_count for each class, the number pos_count_0 for class 0 is how many of 1 appear in the first position of each label in my dataset. For class 1, pos_count_1 the number of 1 in the second position and so on. And after that, the pos_weight of class 0 is (33000-pos_count_0)/pos_count_0, pos_weight of class 1 is (33000-pos_count_1)/pos_count_1 ? I am a little bit confused how neg_count and pos_count for a class are calculated.

I guess what’s missing from the document is that the last dim is the dim of M classes? The other dimensions of the score tensor are H, W, D… (can I call them geometric dimensions?) But in the case of binary classification, the last dim is also a geometric dimension. That’s why the use of BCEWithLogitsLoss is confusing.

Dear ptrblck,
Should the $pos_weight$ be computed for every batch or should use the statistic num_neg / num_pos in the whole dataset? Would u give any recomendation?

The po_weight is usually computed for the complete training dataset and passed during the instantiation of the criterion.

1 Like

Unfortunately, all answers are ambiguous and uselss.

1 Like

Hello, sorry to kickstart this thread again, but I tried using the pos_weight as described here and it doesn’t seem to be working for me. I triple checked everything and I can’t seem to find the issue. Can you confirm if my understanding is correct?

I have a binary classification problem and am using my own implementation of a U-Net (fancy CNN with a decoder rather than a fully connected layer [1505.04597] U-Net: Convolutional Networks for Biomedical Image Segmentation).

I decided to have the output prediction be of size [B, 1, 192, 192] (B = batch size) and just interpret the result as > 0.5 = class 1, else class 0 where class 1 is the class I want to predict. My ground truth data also exists as [B, 1, 192, 192] when I feed it (and the prediction) into the BCEWithLogitsLoss function, then do loss.backward() and optimizer.step(). I am also zeroing the gradient before each batch is received.


  1. Is class 1 considered the “positive class”?
  2. If yes, how does the system know that class “0” is the negative class (which is only something I interpret)? I would have thought I needed 2 channels for this rather than 1
  3. As I am working with single values at each x, y index of the 192x192 prediction, does the pos_weight parameter even work?

Can someone tell me if my understanding (based on what I said above) is flawed?

Thank you in advance.

1 Like