Per-parameter learning rate and weight decay

I’m trying to apply both per-parameter learning rate and weight decay, however, I get an error saying:

some parameters appear in more than one parameter group
Here’s a snippet of my code:

param_optimizer = list(model.named_parameters())
    no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
    optimizer_parameters = [
        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.001},
        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0},
        {'params': model.roberta.parameters(), 'lr': lr[0]},        
        {'params': model.last_linear.parameters(), 'lr': lr[1]}    
        
    ]

    optimizer = AdamW(optimizer_parameters)```

I would much appreciate your help!
2 Likes

The parameters might overlap, as you are getting all parameters in param_optimizer, while also using model.roberta.parameters() and model.last_linear.parameters().

You could create dicts for all your conditions and parameter sets and check the keys for duplicates.

1 Like

So my workaround was to use the per-layer learning rates and use one weight decay value for all the parameters.

optimizer_parameters = [
#         {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.001},
#         {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0},
        {'params': model.roberta.parameters(), 'lr': lr[0]},        
        {'params': model.last_linear.parameters(), 'lr': lr[1]}    
        
    ]

    optimizer = AdamW(optimizer_parameters, weight_decay=0.001)

Would that affect performance since now I’m applying weight decay to the bias and LayerNorm params, or is it ok?
And thanks btw!

1 Like