Trying to understand Optimizer and relation to requires_grad

Consider the case where I have constructed the optimizer as follows

model = load_some_pretrained_model()
optimizer = torch.optim.some_optim(model.parameters(), arg1, arg2)

# Later I Set required_grad of some params as False.
for param in model.parameters():
    if param in arbit_list:
        param.requires_grad = False

# ... Some more code for doing training
optimizer.step()

I wanted to confirm if the changes to requires_grad is reflected in the optimizer.step() as well. I do understand the graph is reconstructed every time the code is run, thus it makes sense for the optimizer to check before calling a step on the parameter. But just in case there was some optimization to not check it every time …

Hi,

What happens is that if you set requires_grad=False, the gradient will not be computed and thus param.grad=None.
As you can see for example in the SGD optimizer here the check is done every time.

3 Likes

This does not seem to work if the requires_grad is turned off after at least one forward/backward/step since in that case p.grad is not None anymore but 0. Or am I misunderstanding something?

1 Like

Can someone clarify this? I have same doubt as above

This does not seem to work if the requires_grad is turned off after at least one forward/backward/step since in that case p.grad is not None anymore but 0. Or am I misunderstanding something?

Hi,

Yes this is kind of a grey area for the optimizers: they handle differently a gradient of None vs a Tensor full of 0s.
The first will perform no update at all.
The second will perform the update with a gradient of 0, which can be non-zero if there is weight decay for example.

Can you change the behaviour to check for if not p.requires_grad: ?

Consider the scenario where you are training a GAN and you have a generator and discriminator.

Now the front of the discriminator is a pretrained CNN (like resnet18(pretrained=True) ) and we don’t want to train this. We only want to train the final Linear layer of the discriminator.

However setting cnn.requires_grad=False will not work since generator output (fake images) requires gradient. The gradient has to flow through the frozen CNN back into the generator in order to train it.

Right now for this scenario you need to manually select parameters from the discriminator and pass it to the optimizer to avoid changing the weights of the pretrained CNN.

If you set the requires_grad field of these parameters to False before doing anything with them. Then their .grad field will be None. And they won’t be updated.

Can you change the behaviour to check for if not p.requires_grad: ?

This would be a major breaking change. I don’t think we can do that :confused: