Here is a minimal example to explain my question, suppose we have an optimizer:
optimizer = optim.Adam(model.parameters(), lr = args.lr)
and another one that put parameters in different parameter groups but using same learning rate:
optimizer = optim.Adam([{'params': model.aaa},
{'params': param for name, param in model.named_parameters() if 'aaaa' not in name}],
lr = args.lr)
In my experiments, these two optimizers give different results. Specifically, the aaa
parameter in the first case gets a small gradient (doesn’t change much during training) while it gets larger gradients in the second case. But if I understand right, these two optimizers don’t have any difference except putting parameters into different param groups. So why are they behaving differently here?