My question is, what is the mathematically correct way to correct for the expected data distribution for actual inference? I don’t think dividing the logits directly by weight is correct.
Think through with care what you are trying to do here – things look a
little bit garbled.
Most likely you don’t want one-hot encoding nor an eight-dimensional
output.
A *multi-label" classifier consists of a set of binary classifiers, one for each
of your four classes. Thus “a – yes, no,” “b – yes, no,” etc.
When using pytorch and BCEWithLogitsLoss, you would want to have
four (floating-point) labels for each sample – 0.0 vs. 1.0 for each of your
four classes. So labels should be a FloatTensor of shape [nBatch, 4].
Your output would have the same type and shape and would be the
predicted logits for each of your four classes.
To me, “inference” means you take a sample and make a prediction. For
just this operation, you wouldn’t perform any re-weighting. Any re-weighting
would have occurred during training, with the goal of training your model to
make better predictions.
(You may or may not wish to use re-weighting when computing performance
metrics for your validation or test datasets. If you compute a loss function
for your validation dataset, you generally want it to be the same loss function
as the one you compute for your training set so that the two will be directly
comparable. You would then typically use the same re-weighting for your
validation-dataset loss function.)
When performing “pure” binary classification, you do not ever want to
use softmax(). softmax() might be used for a multi-class, single-label
problem.
One possible point of confusion: It is possible to recast a single-label binary
problem as a two-class (single-label) problem, and train it as a multi-class
(single-label) problem (that happens two have two classes) using CrossEntropyLoss. (Your two-class labels would then be the one-hot
encoded version of your original binary labels.)
It looks as though this approach might be being mixed into what you are
doing. While treating a binary problem as a two-class, multi-class problem
is fully legitimate, it doesn’t play nicely with a multi-label problem (even
though the multi-label problem can be looked at as being a set of binary
problems).
Hi K, thanks for the detailed response! One thing though – I actually posted it this way with binary labels to simplify things. In my actual problem, the labels are ordinal categories, hence how I ended up with the multi-hot encoded vector. What would you suggest in this case of multilabel+multiclass?
Just updated the initial post as a multilabel trinary classification example.
For something more tangible, imagine something like a video classifier where the data labels are satisfaction surveys with 1-5 stars across multiple categories. Also imagine the distribution is skewed towards 5 stars by varying degrees, sometimes by multiple orders of magnitude.
My thinking here is, loss weighting should serve to amplify the gradients so that the model can learn something beyond just biases for the final linear layer. But I don’t really want to trade precision for recall in my use case either, so I want to recalibrate for when I am doing inference.
Okay, I understand what you’re trying to do now.
First off:
This isn’t right. You are working with a not-binary, multi-class problem
(which in your example, has three classes). You do not want to be using BCEWithLogitsLoss nor any kind of one-hot encoding (even though
your problem is “multi-label” in the sense that you have four three-class
classifiers that share some upstream processing).
Now to your problem:
Let’s say you are rating videos across four categories (your four “labels”),
say, “plot,” “dialog,” “acting,” and “makeup,” and each rating category has
three classes, “one star,” two stars," and “three stars.”
You should treat each of these categories as a multi-class classification
problem and each should have its own instance of a CrossEntropyLoss loss_fn, each instantiated with its own weight argument that accounts
for the class imbalance across that specific category’s number-of-stars
classes. (weight is CrossEntropyLoss’s analog of BCEWithLogitsLoss’s pos_weight constructor argument.)
Something like this:
Your model will have a shared “backbone” that all four categories use and
four classifier “heads,” one for each category. Let’s say that the last layer
of your backbone is a Linear with out_features = 100. Then:
Each of the “labels” tensors, e.g., labelsA, will be a LongTensor with
shape [nBatch] (with no class dimension) and consist of integer categorical
class labels whose values run over the values 0, 1, and 2.
Make sure that you understand how to build and train a “conventional” (that
is, not “multi-label”) multi-class classifier using CrossEntropyLoss, using
its weight constructor argument to compensate for class imbalance. This
is really just the same except that you have four such classifiers that share
a common backbone.