I am facing the following problem and I want to solve it using the best possible option in pytorch. The two questions that I end up having are:
- Can I add parameters to a parameter group in an optimizer?
- Can I merge two parameter groups that use the same learning rate?
- Do we suffer (a lot) in performance if our model has one parameter group per parameter?
This questions come from the following problem. I am training a model using a momentum based optimizer such as Adam. The problem I face is that if I reset the momentum parameters my loss function starts to oscillate for a while (this is the first time I face this kind of behavior by resetting the momentums of an optimizer).
What I am trying to do is the following. I have a deep model I want to train layer wise, because is the only way I can make it learn. So what I do is I instance my model, train the model for a while and after reaching a validation error I add a new layer. This new layer is trained with a higher learning rate for a couple of epochs. After these epochs, I change the learning rate of this layer to have the same learning rate as the whole model, and add a new layer that is going to be trained. This new layer has again a higher learning rate, that would be changed again when a new layer is going to be added.
The problem is that if I do this I end up having lots of parameter groups ( in fact I end up having one parameter group per parameter). Most of these parameter groups will end up having the same learning rate. So I would like to have all the parameters under the same parameter group using the same learning rate. However, If I do this directly by creating the parameter groups after each new layer is learned, I end up resetting the momentums of the optimizer (and this affects a lots the stability of my loss). So I need a way to add each of the new layers that I finetune to the initial parameter group, unless having lots of parameter groups is not a problem in pytorch.