Freezing the updates without freezing the gradients

Shani_Gamrian · September 13, 2017, 4:07pm

Two questions:

What is the meaning exactly of param.requires_grad = False?
I’m trying to do a different type of fine tuning in which the first layer is a new one and the rest is trained. I want to update only the weights of the first layer and in order to do that I want whole network to back-propagate and pass the loss to the first layer but to not do any updates, How can I do that?

tom · September 13, 2017, 7:19pm

param.requires_grad = False (will cause the parameter not to get a gradient stored in the loss.backward() call.
When you param.requires_grad = False to some parameters, it will not affect the gradient calculation for the others. Note that while the errors need to backpropagate through the layers for your set-up, the parameters of the layers are leaf nodes (and the backpropagation continues through the input data).

Best regards

Thomas

greaber · September 13, 2017, 8:05pm

When you create an optimizer, you pass it the parameters you want it to optimize, so just give it the ones in the first layer.

Note also that when you call backwards, PyTorch will need to compute grads of variables other than the ones you specified when you created it, but they won’t be leaf variables. When you call zero_grad, it will zero the grad of the parameters you specified when you created it. It doesn’t need to zero the grad of the other variables it computed grad for because grads only accumulate in leaf variables.

Shani_Gamrian · September 13, 2017, 8:19pm

That’s exactly what I did. So you’re saying that even though I gave the optimizer only the params of the first layer and set param.requires_grad = False, the loss backpropagate all the way to the first layer?

greaber · September 13, 2017, 9:27pm

Yes. The parameters you set requires_grad to False on are leaf variables anyway; you don’t have to backpropagate through them to get to the variables you want grads for. It is still a good idea to set requires_grad to False though, so that they don’t needlessly accumulate grads.

greaber · September 13, 2017, 9:32pm

(I’m assuming you set requires_grad to False on all the parameters except those in the first layer that you want grads for. I guess you don’t explicitly say that that’s what you did.)

Shani_Gamrian · September 14, 2017, 6:59am

Yes, I did it but how come the gradients are being calculated in layers in which I didn’t set an optimizer?

vabh · September 14, 2017, 8:16am

Hi,

The optimizer accesses the gradients through the ‘.grad’ attribute of the specific parameters that you assigned to the optimizer object. It does not compute the gradients but uses it when .step() is called.

The computation of these gradients is indepedent of the optimizer. The gradients are computed (and set as .grad) when you do the .backward() call.

greaber · September 14, 2017, 8:26am

Just to add to what Anuvabh writes, setting requires_grad to False won’t set the grad to zero or None. If a variable already has a grad from a previous backward pass, setting required_grad to False on it will just stop future backward passes from accumulating their gradients into it.

Shani_Gamrian · September 14, 2017, 8:55am

Thank you, now I get it. One more question, do you think it’s possible to fine tune in when the layer I add is connected to the input instead of the output? I tried to train the network and obviously when I used the same input it learned very fast but when I changed the input (added some noise to the image) it didn’t learn at all.

greaber · September 14, 2017, 9:47am

Great, glad everything is clear now. As for what you are trying, I don’t know if it can work. Is it based on some paper? Maybe pretrain a denoiser or gradually increase the noise from zero, and maybe one new layer at the beginning will not be enough capacity.