Should parameters within optimizer groups be disjoint?

In all examples with optimizer parameter groups I have seen, people split parameters in disjoint groups (groups with no shared members), like this:

optim.SGD([
        {'params': model.base.parameters()},
        {'params': model.classifier.parameters(), 'lr': 1e-3}
          ], lr=1e-2, momentum=0.9)

However, one may consider additional groups, which dissect the set of all parameters with respect to some property, and overlap with other groups. For example, one may additionally wish to have a default weight decay value, and 0 weight decay for all biases and LayerNorm parameters. Would the following example, where the 3rd group contains parameters also contained in the 1st and 2nd groups, be valid?

optim.SGD([
        {'params': model.base.parameters()},
        {'params': model.classifier.parameters(), 'lr': 1e-3}
        {'params': [param for name, param in model.named_parameters() \
                         if any(nd in name for nd in ['bias', 'LayerNorm.weight'])], 'weight_decay': 0.0},
          ], lr=1e-2, momentum=0.9, weight_decay=0.01)

If so, does the order in which the groups (dict objects) appear in the list determine the final optimizer hyperparameters which will be used for the parameters shared among different groups?

No, that should not be possible and you would get an error:

ValueError: some parameters appear in more than one parameter group

I see, thanks @ptrblck ! So the only way to implement this then would be to define 4 disjoint groups: 1) biases/LayerNorm in model.base, 2) non biases/LayerNorm in model.base, 3) biases/LayerNorm in model.classifier, 4) non biases/LayerNorm in model.classifier.

Any suggestions on how to use a scheduler to change the learning rate for groups 1-2 together, and 3-4 together?

Can someone answer the above question please.

I have same situation, I have optimizer with different parameter group with different learning rate. Now can I know how to use the scheduler for different parameter group?

Thanks

The learning rate scheduler will apply its update rule to each parameter group as seen here.
In case you want to use different schedulers for different parameter groups you might need to create separate optimizers and also separate schedulers instead.

1 Like