In all examples with optimizer parameter groups I have seen, people split parameters in disjoint groups (groups with no shared members), like this:
optim.SGD([
{'params': model.base.parameters()},
{'params': model.classifier.parameters(), 'lr': 1e-3}
], lr=1e-2, momentum=0.9)
However, one may consider additional groups, which dissect the set of all parameters with respect to some property, and overlap with other groups. For example, one may additionally wish to have a default weight decay value, and 0 weight decay for all biases and LayerNorm
parameters. Would the following example, where the 3rd group contains parameters also contained in the 1st and 2nd groups, be valid?
optim.SGD([
{'params': model.base.parameters()},
{'params': model.classifier.parameters(), 'lr': 1e-3}
{'params': [param for name, param in model.named_parameters() \
if any(nd in name for nd in ['bias', 'LayerNorm.weight'])], 'weight_decay': 0.0},
], lr=1e-2, momentum=0.9, weight_decay=0.01)
If so, does the order in which the groups (dict
objects) appear in the list determine the final optimizer hyperparameters which will be used for the parameters shared among different groups?