Weight decay in multi-label models trained with BCEWithLogitsLoss

I have a question about weight decay on multi-label image classification models.

My training setting is the following: I’m training ResNet18 models on an image classification dataset, either as single-label or multi-label models. For the single-label models, I use PyTorch’s CrossEntropyLoss as the loss function, and for the multi-label models, I use BCEWithLogitsLoss. I also use an Adam optimizer with weight decay.

For the single-label models, weight decay with coefficients around 0.5 works well and improves model performance. However, for the multi-label models, weight decay with coefficients around 0.5 results in a significant decrease in model performance compared to training runs without weight decay. With weight decay, the multi-label model learns to recognize only some classes and the F1-scores of all other classes are close to zero.

Since the single-label and multi-label models are trained on the same dataset, I am surprised by the different impact of weight decay in both cases. Is this due to the combination of BCEWithLogitsLoss and weight decay? Or does the weight decay coefficient simply need to be retuned for the multi-label case?

Thanks a lot!

Hi @josafatburmeister

Could you please tell with a decay different from 0.5 does multi-label model show significantly better F1-scores?

Yes, without weight decay, the multi-label model achieves a macro-averaged F1-score of about 0.74, with F1-scores above 0.6 for all classes. As the weight decay coefficient is increased, the F1-scores decrease. Specifically, the F1 scores were as follows:

Weight decay coefficient Macro-average F1-score Minimum class-wise F1-score
0 0.74 0.63
0.1 0.67 0.5
0.2 0.6 0.29
0.3 0.55 0.23
0.4 0.49 0.12
0.5 0.41 0