Predictions stuck at zero when positive label (1) is only 16% of data

So, I run the same code with a 50/50 split of 0 and 1 label, I get aboyt 70% accuracy on val set and my val preds are not stuck at 0.

However, when I run the code on a dataset with 84/16 % split of labels 0 and 1, all my val preds end up being 0. I used both cross-entropy loss as well as BCEWithLogitLoss with a weight vector (though I am not sure if I set the weight vector correctly). Also, is weight = torch.tensor([0.84, 0.16]) correct?

How can I fix this problem?


        loss_type = 'BCEWithLogitsLoss'
        if loss_type == 'BCEWithLogitsLoss':
            self.criterion = nn.BCEWithLogitsLoss(reduction='none') # weighted loss for imbalanced dataset
            #self.criterion = nn.BCEWithLogitsLoss() 
        elif loss_type == 'CrossEntropyLoss':
            self.criterion = nn.CrossEntropyLoss() # this should work for binary classification

if loss_type == 'BCEWithLogitsLoss':
            labels = torch.as_tensor(labels, dtype=torch.float32) # we need float labels for BCEWithLogitsLoss
            weight = torch.tensor([0.84, 0.16]) # how to decide on this weights?
            #weight = torch.tensor([0.5, 0.5])
            weight_ = weight[labels.data.view(-1).long()].view_as(labels)
            m = nn.Sigmoid()
            with torch.cuda.amp.autocast():
                loss = self.criterion(m(out[:,1]-out[:,0]), labels.cuda())    
                loss_class_weighted = loss * weight_.cuda()
                loss_class_weighted = loss_class_weighted.mean()
                loss = loss_class_weighted
        elif loss_type == 'CrossEntropyLoss':
            labels = torch.as_tensor(labels)
            with torch.cuda.amp.autocast():   
                loss = self.criterion(out, labels.cuda())
        pred_labels = out.data.max(1)[1]
        #pred_labels = out.argmax(dim=1)
        labels = labels.int()  
        return pred_labels, labels, loss

Hi Mona!

Let me outline how to use BCEWithLogitsLoss and CrossEntropyLoss
with class weights without commenting directly on your code.

When performing binary classification, I prefer using BCEWithLogitsLoss.
Doing so more nearly “says what you mean” than CrossEntropyLoss
does, and is marginally more efficient.

It is perfectly reasonable, however, to treat binary classification as the
two-class case of multi-class classification, and use CrossEntropyLoss.

Let’s assume that the input to your model is a batch of nBatch samples.
In typical usage with BCEWithLogitsLoss the final layer of your network
will be a Linear with out_features = 1 and the output of your model
will be a batch of nBatch logit predictions with shape [nBatch, 1].

If you were to convert your logit to a probability (You don’t – this is done
internally, in effect, in BCEWithLogitsLoss.), you would get the predicted
probability of your sample being in “class-1” (the “positive” class). Your
target would be the known probability of your sample being in “class-1”
and can be exactly 0.0 or 1.0, in which case it is easy to think of this
probability as being a 0 / 1 label where 0.0 means “class-0” and 1.0
means “class-1.”

If your data is unbalanced, such as in your case where 16% of your training
samples are in “class-1,” you can use BCEWithLogitsLoss’s pos_weight
constructor argument to weight the “class-1” samples more heavily in the
calculated loss. You would typically use a weight of class-0-% / class-1-%;
thus in your case you might use pos_weight = torch.tensor ([5.25]).

If, instead, you choose to treat this as the two-class case of a multi-class
problem and use CrossEntropyLoss, your final layer should be a Linear
with out_features = 2, that is, separate output values for “class-0” and
“class-1”, and the output of your model would be a batch of class
predictions with shape [nBatch, 2] (These are again logits that are, in
effect, internally converted to probabilities in CrossEntropyLoss.) Your
target will be (a batch of) integer class labels that take on the values 0
(for “class-0”) and 1 (for “class-1”), and will have shape [nBatch].

To weight your (two) classes in the loss calculation, you would use
CrossEntropyLoss’s weight constructor argument. Now, instead
of a single weight (such as BCEWithLogitLoss’s pos_weight) for
“class-1,” you will have a weight for each of your (two) classes. You
would typically weight each class proportionally to the reciprocal of
the frequency with which it appears in your training data. So in your
case, you could use weight = torch.tensor ([1.0, 5.25]).

Note, another approach to compensating for unbalanced training data
is to sample the underrepresented class more heavily. In your case
you could build your training batches by sampling randomly from your
training data, but sample any specific “class-1” sample 5.25 times as
often as any specific “class-0” sample. Now a given batch will contain,
on average, an equal number of “class-1” samples and “class-0” samples.
(You can use this technique with both the BCEWithLogitsLoss and the
CrossEntropyLoss approach, and you would no longer use class weights
in the loss calculation.)

Best.

K. Frank

1 Like

Dear K. Frank, thanks a lot for your explanation. However, I am not sure how you arrived to weight = torch.tensor ([1.0, 5.25]) if my class 0 is 84% of data and class 1 is 16% of the data. What is the formula you used?

Thanks a bunch,
Mona

Hi Mona!

Typically, we reweight classes with the reciprocal of how often they appear.
However, CrossEntropyLoss doesn’t care about the overall weight, as
it computes the weighted average of the individual sample losses.

>>> class_frequencies = torch.tensor ([0.84, 0.16])
>>> class_weightsA = 1 / class_frequencies
>>> class_weightsA
tensor([1.1905, 6.2500])
>>> class_weightsB = class_frequencies[0] * class_weightsA
>>> class_weightsB
tensor([1.0000, 5.2500])

Here, class_weightsA and class_weightsB differ only in their overall
scale – the relative weights of “class-0” and “class-1” are the same, so
they’ll give the same result with CrossEntropyLoss.

I chose to use class_weightsB for my example to make numerically
obvious the relationship to pos_weight in the BCEWithLogitsLoss
version.

Best.

K. Frank

1 Like

Thanks for the response. Just to be clear, does this apply also to BCEWithLogitLoss since that is what I intend to use.

Should I use [1.0, 5.25] instead of [0.84, 0.16] also for BCEWithLogitLoss or is that one correct only for CrossEntropyLoss?

^ Asking this because you mentioned:

Typically, we reweight classes with the reciprocal of how often they appear.
However, CrossEntropyLoss doesn’t care about the overall weight, as
it computes the weighted average of the individual sample losses.

if loss_type == 'BCEWithLogitsLoss':
            labels = torch.as_tensor(labels, dtype=torch.float32) # we need float labels for BCEWithLogitsLoss
            weight = torch.tensor([0.84, 0.16]) # how to decide on this weights?
            #weight = torch.tensor([0.5, 0.5])
            weight_ = weight[labels.data.view(-1).long()].view_as(labels)
            m = nn.Sigmoid()
            with torch.cuda.amp.autocast():
                loss = self.criterion(m(out[:,1]-out[:,0]), labels.cuda())    
                loss_class_weighted = loss * weight_.cuda()
                loss_class_weighted = loss_class_weighted.mean()
                loss = loss_class_weighted

Hi Mona!

No, you should use neither version with BCEWithLogitsLoss. Instead,
use a single pos_weight constructor argument:

self.criterion = nn.BCEWithLogitsLoss (pos_weight = torch.tensor ([5.25]))

(This assumes that you are performing ordinary binary classification,
rather than multi-label, multi-class classification.)

Some brief further comments:

By using pos_weight, you will be telling BCEWithLogitsLoss to perform
the weighting for you. So you can also let it perform the 'mean' reduction
for you. That is, you don’t need to perform the weighted average by hand.

Just use

        if loss_type == 'BCEWithLogitsLoss':
            self.criterion = nn.BCEWithLogitsLoss (pos_weight = torch.tensor ([5.25]))
            # or equivalently
            self.criterion = nn.BCEWithLogitsLoss (pos_weight = torch.tensor ([5.25]), reduction = 'mean')

and then leave out the part where you do the weighting by hand:

Lastly:

you do not want to apply Sigmoid to the output of your model. This
is already done for you, in effect, inside of BCEWithLogitsLoss.

Best.

K. Frank

1 Like