Weight decay in the optimizers is a bad idea (especially with BatchNorm)

It has been a year, but no feedback on it ?
I absolutely agree with @Michael_Oliver . For example, in the official example training on ImageNet https://github.com/pytorch/examples/tree/master/imagenet, it seems to me that Batch Norm Weight and Bias are added to Regularization Loss.
I think the correct way to implement it shoulde be:

optimizer = torch.optim.SGD( model.parameters(), args.lr,
                                momentum=args.momentum)
                                # ,weight_decay=args.weight_decay) #Remove weight decay in here
cls_loss = criterion(output, target)
reg_loss = 0
for name,param in model.named_parameters():
    if 'bn' not in name:
         reg_loss += torch.norm(param)
loss = cls_loss + args.weight_decay*reg_loss #Manual add weight decay

Please confirm or help with more elegant solutions. Thank you.

1 Like