Freezing part of the Layer Weights

Lok · November 3, 2017, 12:53pm

Hello Everyone,

How could I freeze some parts of the layer weights to zero and not the entire layer.
I tried below code, but it doesn’t freeze the specific parts(1:10 array in 2nd dimension) of the layer weights.
I am new to ML & started with Pytorch. Appreciate any help. Thanks.

for child in model_ft.children():
print(“Freezing Parameters(1->10) on the Convolution Layer”,child)
for param in child.parameters():
param.data[:,1:10,:,:].zero_()
param.data[:,1:10,:,:].requires_grad = False

optimizer_ft = OPTIM.SGD(filter(lambda p: p.requires_grad, model_ft.parameters()), lr=0.001, momentum=0.9)

sonsang · November 3, 2017, 1:12pm

I think freezing specific parts of a parameter is not possible in PyTorch because requires_grad flag is set on each Parameter(collection of weights), not each weight.

One possible approach is manually zeroing gradient before you call optimizer function.

After you calculate gradient using backward() function, call

param.grad[:, 1:10, :, :] = 0

to achieve you want.

Further, you can automatize this method using backward_hook in PyTorch.

You can check this link if you want.
http://pytorch.org/tutorials/beginner/former_torchies/nn_tutorial.html

Lok · November 6, 2017, 4:55pm

Thanks Sonsang,
This works, I can see the required weights are set to zero after training, but the training accuracy has come down. Initializing(xavier_normal) the non-zero weights didn’t help.
Not sure if I am missing something else in my setup.

Does this approach work same as the freezing the weights to zero in the beginning?

sonsang · November 6, 2017, 5:58pm

Do you mean that training accuracy is going down while training, or the training accuracy decreased after you zeroing-out some gradients?

In the first case, I’m not sure what is happening.

Maybe full code will help the understanding of that situation.

In the second case, it is natural because you use less parameters than the original model.

Thank you.

Lok · November 6, 2017, 7:00pm

It’s the second case, I understand. Thank you.

Brando_Miranda · April 9, 2018, 11:12pm

Just to double check, how is this mathematically correct? Since the chain rule involves products (and sums) couldn’t it lead to earlier layers having gradient zero even though they should not? Consider just a simple 3 number abc and the derivative would be the product of each a’*b’*c’. Then set b’=0 yields the wrong derivative when it should be a’*b’, no?

sonsang · April 10, 2018, 7:34am

It does not mean that we set the gradient of b to zero.

param.grad[:, 1:10, :, :] = 0

This line indicates that we just ignore the calculated gradient w.r.t loss so that the parameters are not updated.

You do not have to worry about your example because all gradients are calculated with backward() call.

We modify the gradient after the backward(), so other parameters are safe.

Huige_Cheng · October 18, 2018, 9:02am

This works, thanks a lot. But what if I want to set different learning rate to the first 10 rows of embedding ?
Can I just grad *= 0.1 like this ?

sonsang · October 18, 2018, 9:38am

Sure. You can do anything you want with those gradients between backward() and optimizer.step() calls.

3yanlis1bos · February 8, 2019, 8:32am

Your suggestion works, and might really be useful in a lot of cases, but it may not have the same effect with (requires_grad=false)

It works since it sets the selected gradients to zero and ensure thats there will be no changes for some certain weights, but the gradients will still be calculated for all the weights, including the ones that have been frozen.

I am concerned about the outcome of that since I am not sure if it would create the same kind of result as setting a requires_grad flag to false.

Because I think =not sure tho= that the loss calculation might be omitting the participation of frozen parameters in the inference when we set requires_grad to false and this would create a focus on the parameters with requires_grad set to true during loss calculation and may eventually let these parameters to be penalized as they are responsible for all the losses of the network.

Please correct me if I am wrong, I know I am assuming a lot of things but, maybe these thoughts can help someone.

tengerye · July 7, 2019, 1:54am

From the perspective of mathematics, they are the same. Freezing means no updating. Equivalently, the gradients to be updated on are zeros.

Giang_Nguyen · September 11, 2019, 9:14am

I run like this and no error.
Can I use like this? This way doesnt requires manually zeroing grad like your method:

decoder.linear.weight[:100].require_grad = False

ANKUR_GUPTA1 · March 5, 2020, 2:17pm

I have this model

class my_model(nn.Module):
  def __init__(self):
    super(my_model,self).__init__()
    self.conv1 = nn.Conv2d(3,16,kernel_size=3,stride=1,padding=1)
    self.conv2 = nn.Conv2d(16,32,kernel_size=3,stride=1,padding=1)
    self.conv3 = nn.Conv2d(32,64,kernel_size=3,stride=1,padding=1)
    self.pool = nn.MaxPool2d(2, 2)
    self.fc1 = nn.Linear(4*4*64,64)
    self.fc2 = nn.Linear(64,10)
  def forward(self,inp):
    ab = self.pool(F.relu(self.conv1(inp)))
    ab = self.pool(F.relu(self.conv2(ab)))
    ab = self.pool(F.relu(self.conv3(ab)))
    ab = ab.view(ab.shape[0],-1)
    ab = F.relu(self.fc1(ab))
    ab = F.relu(self.fc2(ab))
    return ab

can you tell how can i set the gradients of first filter of the conv1 layer to be zero
using backward hook

uhmbg · March 25, 2020, 5:57pm

What does this mean, that the first 100 will be frozen? I proved this, and it’s certainly not wrong, but I don’t really understand what weights are frozen, because they are not the first 100?

Roy_Paik · October 16, 2020, 5:39am

Hi, Let me ask you a quick question:
When I have a model which has total 5 layer, if I want to freeze 3rd layer from the beginning layer, then the gradient of 3rd layer will affect updating 2nd and 1st layer, but the effect of 3rd layer’s gradients is consistent, and thus there won’t be problem, is my understanding correct?

tengerye · October 16, 2020, 6:35am

Yes, you only need to remember this: calculating the gradients and update the weights using the computed gradients are two different operations.

galidor · November 17, 2020, 4:21pm

Hi, I see this issue is going on for a really long time, so I wrote this simple repo in PyTorch. I hope someone may find it useful. It supports only partial freezing of Conv2d, Conv3d and Linear layers, but these are the most common anyway. If you want me to extend it to other layers, feel free to send me a message.

kwea123 · May 12, 2022, 11:50am

Hi, this does not always work as intended. It depends on the optimizer. For example, if you look at some optimizer like radam, you can see that even if the current gradient is set to zero, due to the exponential average of previous results, the parameters will still be updated (although only by a little).

To achieve exactly what we wish, I can only come up with the naive implementation where we save the parameter value before opt.step() and reset it back after.

Vezen_BU · March 6, 2023, 12:20am

Hi! In your case, is using a “mask” matrix and a temporal matrix possible? You record the current values of the entries you do not want to update in the “temporal matrix” and use the “mask matrix” to indicate which entries they are.