Optimizers: good practices for handling multiple param groups

jmaronas · May 4, 2020, 8:46am

Hello.

I am facing the following problem and I want to solve it using the best possible option in pytorch. The two questions that I end up having are:

Can I add parameters to a parameter group in an optimizer?
Can I merge two parameter groups that use the same learning rate?
Do we suffer (a lot) in performance if our model has one parameter group per parameter?

This questions come from the following problem. I am training a model using a momentum based optimizer such as Adam. The problem I face is that if I reset the momentum parameters my loss function starts to oscillate for a while (this is the first time I face this kind of behavior by resetting the momentums of an optimizer).

What I am trying to do is the following. I have a deep model I want to train layer wise, because is the only way I can make it learn. So what I do is I instance my model, train the model for a while and after reaching a validation error I add a new layer. This new layer is trained with a higher learning rate for a couple of epochs. After these epochs, I change the learning rate of this layer to have the same learning rate as the whole model, and add a new layer that is going to be trained. This new layer has again a higher learning rate, that would be changed again when a new layer is going to be added.

The problem is that if I do this I end up having lots of parameter groups ( in fact I end up having one parameter group per parameter). Most of these parameter groups will end up having the same learning rate. So I would like to have all the parameters under the same parameter group using the same learning rate. However, If I do this directly by creating the parameter groups after each new layer is learned, I end up resetting the momentums of the optimizer (and this affects a lots the stability of my loss). So I need a way to add each of the new layers that I finetune to the initial parameter group, unless having lots of parameter groups is not a problem in pytorch.

Many thanks.

ptrblck · May 5, 2020, 5:54am

That’s an interesting use case and I’m not sure, if there is a clean way of adding parameters to an existing parameter group or merging parameter groups and I would rather create new parameter groups for each new layer.

Regarding the performance, Adam will use this nested loop and I doubt that the content from the outer into the inner loop will save you much time.

Did you see any performance drops or any other issues?

jmaronas · May 5, 2020, 8:48am

Up to this moment I haven’t code it up as I wanted to see first which is the best way to do it. I was thinking in using a similar approach as the one you propose which is create a parameter group per new parameter and then loop over them if I want to change the learning rates.

I think is the cleanest and bug safe. I guess the only problem is that we end up having a bigger loop over self.params_groups .

From my doc check there is no cleaner way to handle what I propose in this post so yes I think the safest is to have several groups in the parameter. Will come back If I see a very drop in performance or any strange behavior.

jpcl · June 27, 2023, 4:18pm

Hey, I’ve encountered the same question and went with the one param group per parameter approach only to discover that it has terrible performance, at least with the fused AdamW in PyTorch 2.0.

I had 68 parameters, each with its own group and one epoch (on a toy model) took me 13 minutes. Then I implemented a method to keep parameters with the same settings together and this got me down to only 4 groups and the training time went down to 5 minutes.

I wanted to share this here since I found this topic when searching whether it’s safe to have a separate group for each parameter.

ptrblck · June 27, 2023, 6:35pm

That’s an interesting observation as it seems the (faster) foreach approach cannot update the param groups of your setup. Let me add @crcrpar as the code owner to chime in if that’s expected.