I’m trying to apply both per-parameter learning rate and weight decay, however, I get an error saying:
some parameters appear in more than one parameter group
Here’s a snippet of my code:
param_optimizer = list(model.named_parameters())
no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
optimizer_parameters = [
{'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.001},
{'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0},
{'params': model.roberta.parameters(), 'lr': lr[0]},
{'params': model.last_linear.parameters(), 'lr': lr[1]}
]
optimizer = AdamW(optimizer_parameters)```
I would much appreciate your help!
2 Likes
The parameters might overlap, as you are getting all parameters in param_optimizer
, while also using model.roberta.parameters()
and model.last_linear.parameters()
.
You could create dicts
for all your conditions and parameter sets and check the keys for duplicates.
1 Like
So my workaround was to use the per-layer learning rates and use one weight decay value for all the parameters.
optimizer_parameters = [
# {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.001},
# {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0},
{'params': model.roberta.parameters(), 'lr': lr[0]},
{'params': model.last_linear.parameters(), 'lr': lr[1]}
]
optimizer = AdamW(optimizer_parameters, weight_decay=0.001)
Would that affect performance since now I’m applying weight decay to the bias
and LayerNorm
params, or is it ok?
And thanks btw!
1 Like