Weight decay in the optimizers is a bad idea (especially with BatchNorm)

rwightman · May 29, 2019, 4:05pm

I usually create a fn like add_weight_decay below. In current form it will add all batch norm parameters and bias layers to the no_decay list. I use this instead of looking for ‘bn’ strings in the name because that isn’t always consistent from model to model. It’s usually suggested that bias params should also not be decayed, so this does the job for me. You can still use the name if it works for your case though. Separating it otherwise would be a pain.

def add_weight_decay(model, weight_decay=1e-5, skip_list=()):
    decay = []
    no_decay = []
    for name, param in model.named_parameters():
        if not param.requires_grad:
            continue
        if len(param.shape) == 1 or name in skip_list:
            no_decay.append(param)
        else:
            decay.append(param)
    return [
        {'params': no_decay, 'weight_decay': 0.},
        {'params': decay, 'weight_decay': weight_decay}]

When I create the optimizer, I put this block in front (usually all this is wrapped in a optim creation factory that also picks the optimizer to create from config or cmd args…

    weight_decay = args.weight_decay
    if weight_decay and filter_bias_and_bn:
        parameters = add_weight_decay(model, weight_decay)
        weight_decay = 0.
    else:
        parameters = model.parameters()

   if args.opt.lower() == 'sgd':
        optimizer = optim.SGD(
            parameters, lr=args.lr,
            momentum=args.momentum, weight_decay=weight_decay, nesterov=args.nesterov)
   ...