Optimizers: good practices for handling multiple param groups


I am facing the following problem and I want to solve it using the best possible option in pytorch. The two questions that I end up having are:

  • Can I add parameters to a parameter group in an optimizer?
  • Can I merge two parameter groups that use the same learning rate?
  • Do we suffer (a lot) in performance if our model has one parameter group per parameter?

This questions come from the following problem. I am training a model using a momentum based optimizer such as Adam. The problem I face is that if I reset the momentum parameters my loss function starts to oscillate for a while (this is the first time I face this kind of behavior by resetting the momentums of an optimizer).

What I am trying to do is the following. I have a deep model I want to train layer wise, because is the only way I can make it learn. So what I do is I instance my model, train the model for a while and after reaching a validation error I add a new layer. This new layer is trained with a higher learning rate for a couple of epochs. After these epochs, I change the learning rate of this layer to have the same learning rate as the whole model, and add a new layer that is going to be trained. This new layer has again a higher learning rate, that would be changed again when a new layer is going to be added.

The problem is that if I do this I end up having lots of parameter groups ( in fact I end up having one parameter group per parameter). Most of these parameter groups will end up having the same learning rate. So I would like to have all the parameters under the same parameter group using the same learning rate. However, If I do this directly by creating the parameter groups after each new layer is learned, I end up resetting the momentums of the optimizer (and this affects a lots the stability of my loss). So I need a way to add each of the new layers that I finetune to the initial parameter group, unless having lots of parameter groups is not a problem in pytorch.

Many thanks.

1 Like

That’s an interesting use case and I’m not sure, if there is a clean way of adding parameters to an existing parameter group or merging parameter groups and I would rather create new parameter groups for each new layer.

Regarding the performance, Adam will use this nested loop and I doubt that the content from the outer into the inner loop will save you much time.

Did you see any performance drops or any other issues?

1 Like

Up to this moment I haven’t code it up as I wanted to see first which is the best way to do it. I was thinking in using a similar approach as the one you propose which is create a parameter group per new parameter and then loop over them if I want to change the learning rates.

I think is the cleanest and bug safe. I guess the only problem is that we end up having a bigger loop over self.params_groups .

From my doc check there is no cleaner way to handle what I propose in this post so yes I think the safest is to have several groups in the parameter. Will come back If I see a very drop in performance or any strange behavior.