Changing `requires_grad` during training

(MinkyuChoi) #1

I understand that the tensors with requires_grad=False do not calculate gradients. I wonder if I can change the requires_grad flag during training so as to train/freeze parts of network during training. I want to train some parameters if some conditions are met and freeze them if conditions are not matched. An example code is like below. When I change this flag, the optimizer would reflect it automatically? or do I have to do something to let the optimizer know about it?

optimizer = optim.Adam(model.parameters(), lr = lr)
for epoch in range(total_epochs):
    .... training goes on ....

    if certain_condition:
        # 'certain_parameter' is model parameter in model.parameters(). 

    .... training goes on ....

(Alban D) #2


This will work to freeze the gradients to 0.
Be careful though that many optimizer will still update the weights even for a gradient of 0! If you have regularization or momentum for example.


thanks! I had the same question.
what do you suggest in case one wants to completely freeze the parameters?

(Alban D) #4

The best way is not to give them to the optimizer in the first place I think.
Or if you want to update them sometimes, have two optimizers, one that you step() at every iteration and one that you step() only when you want to update these special elements.


However, Is there an easier way When encountering the following situation?
model has three parameters (parameter_1, parameter_2 and parameter_3),
first, model need update all parameters.
During the training process, parameter_1 and parameter_2 will be updated sometimes, sometimes parameter_1 and parameter_3.
So, we need three optimizers:
optimizer_1—>parameter_1, parameter_2 and parameter_3
optimizer_2—>parameter_1 and parameter_2
optimizer_3—>parameter_1 and parameter_3
But how to deal with when there are more parameters ?
I hope my expression is clear.

(Alban D) #6


You don’t want to have parameters shared in multiple optimizers as internal states kept for example by Adam or momentum will be wrong.
You can create one optimizer for each set of paremeters and .step() all the ones that need to be updated at that iterations.


I know your means.
the number of optimizers will increase when the set of paremeters is too large, which will cause some complexity.