BCEWithLogitsLoss pos_weights -- how to do re-weighting on inference

layman · July 16, 2024, 10:43pm

I am training a multilabel classifier on some imbalanced dataset where I am using pos weights on loss.

Dataset looks like

label	cat1	cat2	cat3
a	100	100	100
b	100	200	400
c	100	400	800
d	100	800	1600

And my corresponding loss

# One-hot encoded trinary labels hence 12-dim output for 4 labels
nn.BCEWithLogitsLoss(pos_weight=torch.tensor([1, 1, 1,
                                              4, 2, 1, 
                                              8, 2, 1, 
                                              16, 2, 1])

My question is, what is the mathematically correct way to correct for the expected data distribution for actual inference? I don’t think dividing the logits directly by weight is correct.

output = model(data.to(device))
output = output.detach().cpu() / np.array([1, 1, 1,
                                           4, 2, 1, 
                                           8, 2, 1, 
                                           16, 2, 1])
output = output.reshape((-1, 4, 3))
output = softmax(output)

KFrank · July 17, 2024, 4:59pm

Hi Victor!

Think through with care what you are trying to do here – things look a
little bit garbled.

Most likely you don’t want one-hot encoding nor an eight-dimensional
output.

A *multi-label" classifier consists of a set of binary classifiers, one for each
of your four classes. Thus “a – yes, no,” “b – yes, no,” etc.

When using pytorch and BCEWithLogitsLoss, you would want to have
four (floating-point) labels for each sample – 0.0 vs. 1.0 for each of your
four classes. So labels should be a FloatTensor of shape [nBatch, 4].

Your output would have the same type and shape and would be the
predicted logits for each of your four classes.

To me, “inference” means you take a sample and make a prediction. For
just this operation, you wouldn’t perform any re-weighting. Any re-weighting
would have occurred during training, with the goal of training your model to
make better predictions.

(You may or may not wish to use re-weighting when computing performance
metrics for your validation or test datasets. If you compute a loss function
for your validation dataset, you generally want it to be the same loss function
as the one you compute for your training set so that the two will be directly
comparable. You would then typically use the same re-weighting for your
validation-dataset loss function.)

When performing “pure” binary classification, you do not ever want to
use softmax(). softmax() might be used for a multi-class, single-label
problem.

One possible point of confusion: It is possible to recast a single-label binary
problem as a two-class (single-label) problem, and train it as a multi-class
(single-label) problem (that happens two have two classes) using
CrossEntropyLoss. (Your two-class labels would then be the one-hot
encoded version of your original binary labels.)

It looks as though this approach might be being mixed into what you are
doing. While treating a binary problem as a two-class, multi-class problem
is fully legitimate, it doesn’t play nicely with a multi-label problem (even
though the multi-label problem can be looked at as being a set of binary
problems).

Best.

K. Frank

layman · July 17, 2024, 5:08pm

Hi K, thanks for the detailed response! One thing though – I actually posted it this way with binary labels to simplify things. In my actual problem, the labels are ordinal categories, hence how I ended up with the multi-hot encoded vector. What would you suggest in this case of multilabel+multiclass?

KFrank · July 17, 2024, 5:32pm

Hi Victor!

You will have to explain what you mean by “multilabel+multiclass” and
illustrate it with a concrete (if contrived) example.

Best.

K. Frank

layman · July 17, 2024, 6:24pm

Just updated the initial post as a multilabel trinary classification example.

For something more tangible, imagine something like a video classifier where the data labels are satisfaction surveys with 1-5 stars across multiple categories. Also imagine the distribution is skewed towards 5 stars by varying degrees, sometimes by multiple orders of magnitude.

My thinking here is, loss weighting should serve to amplify the gradients so that the model can learn something beyond just biases for the final linear layer. But I don’t really want to trade precision for recall in my use case either, so I want to recalibrate for when I am doing inference.

KFrank · July 17, 2024, 9:29pm

Hi Victor!

Okay, I understand what you’re trying to do now.

First off:

layman:

And my corresponding loss

# One-hot encoded trinary labels hence 12-dim output for 4 labels
nn.BCEWithLogitsLoss(pos_weight=torch.tensor([1, 1, 1,
                                              4, 2, 1, 
                                              8, 2, 1, 
                                              16, 2, 1])

This isn’t right. You are working with a not-binary, multi-class problem
(which in your example, has three classes). You do not want to be using
BCEWithLogitsLoss nor any kind of one-hot encoding (even though
your problem is “multi-label” in the sense that you have four three-class
classifiers that share some upstream processing).

Now to your problem:

Let’s say you are rating videos across four categories (your four “labels”),
say, “plot,” “dialog,” “acting,” and “makeup,” and each rating category has
three classes, “one star,” two stars," and “three stars.”

You should treat each of these categories as a multi-class classification
problem and each should have its own instance of a CrossEntropyLoss
loss_fn, each instantiated with its own weight argument that accounts
for the class imbalance across that specific category’s number-of-stars
classes. (weight is CrossEntropyLoss’s analog of BCEWithLogitsLoss’s
pos_weight constructor argument.)

Something like this:

Your model will have a shared “backbone” that all four categories use and
four classifier “heads,” one for each category. Let’s say that the last layer
of your backbone is a Linear with out_features = 100. Then:

# your earlier backbone layers
self.last_backbone_layer = torch.nn.Linear (500, 100)
self.headA = torch.nn.Linear (100, 3)
self.headB = torch.nn.Linear (100, 3)
self.headC = torch.nn.Linear (100, 3)
self.headD = torch.nn.Linear (100, 3)

And forward() would look something like this:

x = self.last_backbone_layer (x)
return  self.headA (x), self.headB (x), self.headC (x), self.headD (x)

Then

outputA, outputB, outputC, outputD = model (input)
lossA = loss_fnA (outputA, labelsA)
lossB = loss_fnB (outputB, labelsB)
lossC = loss_fnC (outputC, labelsC)
lossD = loss_fnD (outputD, labelsD)
loss_total = lossA + lossB + lossC + lossD

Each of the “labels” tensors, e.g., labelsA, will be a LongTensor with
shape [nBatch] (with no class dimension) and consist of integer categorical
class labels whose values run over the values 0, 1, and 2.

Make sure that you understand how to build and train a “conventional” (that
is, not “multi-label”) multi-class classifier using CrossEntropyLoss, using
its weight constructor argument to compensate for class imbalance. This
is really just the same except that you have four such classifiers that share
a common backbone.

Best.

K. Frank