Multi-label model outputs only negative values while trained with BCEWithLogitsLoss()

amirhf · November 19, 2020, 5:51pm

Hi,

I am training a multi-class multi-label model with torch.nn.BCEWithLogitsLoss() on 8M data points for 10 epochs. I have 54 classes and a single image can have multiple labels. During training the loss decreases nice and decreasing:

However, when I look at trained my model outputs for the last epoch, I see that the model is outputing negative values only. For example, for one sample with the following label:

target = 
tensor([[0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
             0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
             0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0.]])

I get the following output from the model:

output = 
tensor([[-1.2380, -2.3283, -2.3025, -2.1275, -2.1020, -2.3684, -3.4669, -3.4503,
         -2.1905, -1.8565, -3.4215, -3.5318, -3.5715, -4.3836, -4.5215, -6.2270,
         -3.8660, -3.7280, -4.6043, -4.7601, -9.5219, -9.4969, -9.4392, -8.0596,
         -6.0773, -5.7972, -4.2495, -4.4533, -4.2641, -4.1068, -4.9987, -4.9321,
         -7.9726, -7.4475, -4.8016, -5.6634, -6.3762, -6.0103, -6.7561, -3.3259,
         -3.8778, -6.7682, -6.5663, -4.0945, -3.0747, -5.5408, -5.6429, -5.9659,
         -5.8574, -7.6435, -7.8895, -6.6514, -6.5506, -5.0583]],
       device='cuda:0')

So if I do sigmoid on top of this, I won’t get any good prediction.

KFrank · November 19, 2020, 9:18pm

Hi Amir -

amirhf:

target = 
tensor([[0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
             0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
             0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0.]])

The short answer is that your dataset is “unbalanced,” so you
should try using the pos_weight argument when you construct
your BCEWithLogitsLoss loss criterion.

Looking at your target, and naively assuming that all positive
class labels appear about equally frequently, it appear that any
giving class will be labelled positive only once in about every
nine images.

So your classifier could do a pretty good job by just predicting
negative for all of your classes all of the time.

It is the purpose of the pos_weight argument to address this
issue by weighting the less-frequent positive samples more heavily
than the more-frequent negative samples. Doing so will penalize
a model that simply tries to predict negative all the time.

It’s quite likely that some classes have positive labels more often
than others. pos_weight takes a vector of per-class weights so
that you can reweight each class separately. A common (and
reasonable) choice for the class weights is:

weights[i] = total_samples / number_of_positive_class_i_samples[i]

Best.

K. Frank

amirhf · November 19, 2020, 11:01pm

Hi @KFrank,

Thank you fo analysis. However I wonder how you calculated this:

Looking at your target, and naively assuming that all positive
class labels appear about equally frequently, it appear that any
giving class will be labelled positive only once in about every
nine images.

The target that I have here is only for one image, meaning that some classes are present for only one image!

KFrank · November 20, 2020, 3:47am

Hi Amir!

Yes, this was not meant to be a realistic calculation. I was illustrating
an oversimplified estimate of the class weights based on the single
data point you provided.

The (unrealistic) details: Out of the 54 binary labels in your target,
six were positive. If one assumes (without any evidence) that positive
labels occur equally frequently for all 54 classes, and further assumes
(again, without any evidence) that the single target you gave is in
some sense randomly representative of all of your targets, then one
would conclude that, across your ensemble of targets, any given
class is labelled positive about one time in nine.

Under these assumptions, you would want to use the same value
of 9.0 for the pos_weight for all of your 54 classes.

Of course, these assumptions are probably not correct, so you should
look at a representative sample of your training data to determine the
per-class pos_weight values.

Best.

K. Frank