In PyTorch, we can pass parameter groups when initializing an optimizer. This is a very useful feature. For example, we can specify parameters in different layers into different groups to have separate learning rate for each layer.
However, is it the only way to achieve this? If we just create separate optimizers to optimize different parameters, is it the same or not?
If we are using state-less optimizer like SGD, there is apparently no difference. If we are using something like Adam, which needs to record the moving average of gradients from previous iterations, is it still the same?
I looked into the source code of Adam, and found the following part:
for group in self.param_groups:
for p in group['params']:
if p.grad is None:
continue
grad = p.grad
if grad.is_sparse:
raise RuntimeError('Adam does not support sparse gradients, please consider SparseAdam instead')
amsgrad = group['amsgrad']
state = self.state[p]
# State initialization
if len(state) == 0:
state['step'] = 0
# Exponential moving average of gradient values
state['exp_avg'] = torch.zeros_like(p, memory_format=torch.preserve_format)
# Exponential moving average of squared gradient values
state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
Based on the last 4 lines of code, the moving average are computed within different param groups separately, which means the optimizer never has a global view of all model parameters. So I think it’s still equivalent to create separate Adam optimizers for different parameter groups. I feel it’s just using different APIs to achieve the same function. Am I right about that?