Different optimization process when put parameters in different param groups?

sunshineatnoon · June 24, 2019, 7:32pm

Here is a minimal example to explain my question, suppose we have an optimizer:

optimizer = optim.Adam(model.parameters(), lr = args.lr)

and another one that put parameters in different parameter groups but using same learning rate:

optimizer = optim.Adam([{'params': model.aaa},
                        {'params': param for name, param in model.named_parameters() if 'aaaa' not in name}],
                        lr = args.lr)

In my experiments, these two optimizers give different results. Specifically, the aaa parameter in the first case gets a small gradient (doesn’t change much during training) while it gets larger gradients in the second case. But if I understand right, these two optimizers don’t have any difference except putting parameters into different param groups. So why are they behaving differently here?

smth · June 25, 2019, 5:24am

that sounds like a bug, but i cant fathom why.

Because the implementation of Adam is simple, and it just loops over param groups: https://github.com/pytorch/pytorch/blob/master/torch/optim/adam.py#L60

sunshineatnoon · June 25, 2019, 4:17pm

Thanks for your reply. I found if I manually add all parameters instead of using for name param in model.named_parameters(), then it works fine.

ptrblck · June 25, 2019, 4:26pm

Could there be a typo model.aaa vs 'aaaa'?