I have a question about weight decay on multi-label image classification models.
My training setting is the following: I’m training ResNet18 models on an image classification dataset, either as single-label or multi-label models. For the single-label models, I use PyTorch’s CrossEntropyLoss as the loss function, and for the multi-label models, I use BCEWithLogitsLoss. I also use an Adam optimizer with weight decay.
For the single-label models, weight decay with coefficients around 0.5 works well and improves model performance. However, for the multi-label models, weight decay with coefficients around 0.5 results in a significant decrease in model performance compared to training runs without weight decay. With weight decay, the multi-label model learns to recognize only some classes and the F1-scores of all other classes are close to zero.
Since the single-label and multi-label models are trained on the same dataset, I am surprised by the different impact of weight decay in both cases. Is this due to the combination of BCEWithLogitsLoss and weight decay? Or does the weight decay coefficient simply need to be retuned for the multi-label case?
Thanks a lot!