Is using separate optimizers equivalent to specifying different parameter groups for the same optimizer?

In PyTorch, we can pass parameter groups when initializing an optimizer. This is a very useful feature. For example, we can specify parameters in different layers into different groups to have separate learning rate for each layer.

However, is it the only way to achieve this? If we just create separate optimizers to optimize different parameters, is it the same or not?
If we are using state-less optimizer like SGD, there is apparently no difference. If we are using something like Adam, which needs to record the moving average of gradients from previous iterations, is it still the same?

I looked into the source code of Adam, and found the following part:

        for group in self.param_groups:
            for p in group['params']:
                if p.grad is None:
                grad = p.grad
                if grad.is_sparse:
                    raise RuntimeError('Adam does not support sparse gradients, please consider SparseAdam instead')
                amsgrad = group['amsgrad']

                state = self.state[p]

                # State initialization
                if len(state) == 0:
                    state['step'] = 0
                    # Exponential moving average of gradient values
                    state['exp_avg'] = torch.zeros_like(p, memory_format=torch.preserve_format)
                    # Exponential moving average of squared gradient values
                    state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)

Based on the last 4 lines of code, the moving average are computed within different param groups separately, which means the optimizer never has a global view of all model parameters. So I think it’s still equivalent to create separate Adam optimizers for different parameter groups. I feel it’s just using different APIs to achieve the same function. Am I right about that?

This should be the case, as the step function iterates all groups, as you’ve already pointed out.
As a quick test you could seed the code properly and compare both approaches (different param_groups and different optimizers), which should yield the same weight update and states.

1 Like