Freezing part of the Layer Weights

Hello Everyone,

How could I freeze some parts of the layer weights to zero and not the entire layer.
I tried below code, but it doesn’t freeze the specific parts(1:10 array in 2nd dimension) of the layer weights.
I am new to ML & started with Pytorch. Appreciate any help. Thanks.

for child in model_ft.children():
print(“Freezing Parameters(1->10) on the Convolution Layer”,child)
for param in child.parameters():
param.data[:,1:10,:,:].zero_()
param.data[:,1:10,:,:].requires_grad = False

optimizer_ft = OPTIM.SGD(filter(lambda p: p.requires_grad, model_ft.parameters()), lr=0.001, momentum=0.9)

2 Likes

I think freezing specific parts of a parameter is not possible in PyTorch because requires_grad flag is set on each Parameter(collection of weights), not each weight.

One possible approach is manually zeroing gradient before you call optimizer function.

After you calculate gradient using backward() function, call

param.grad[:, 1:10, :, :] = 0

to achieve you want.

Further, you can automatize this method using backward_hook in PyTorch.

You can check this link if you want.
http://pytorch.org/tutorials/beginner/former_torchies/nn_tutorial.html

8 Likes

Thanks Sonsang,
This works, I can see the required weights are set to zero after training, but the training accuracy has come down. Initializing(xavier_normal) the non-zero weights didn’t help.
Not sure if I am missing something else in my setup.

Does this approach work same as the freezing the weights to zero in the beginning?

Do you mean that training accuracy is going down while training, or the training accuracy decreased after you zeroing-out some gradients?

In the first case, I’m not sure what is happening.

Maybe full code will help the understanding of that situation.

In the second case, it is natural because you use less parameters than the original model.

Thank you.

It’s the second case, I understand. Thank you.

Just to double check, how is this mathematically correct? Since the chain rule involves products (and sums) couldn’t it lead to earlier layers having gradient zero even though they should not? Consider just a simple 3 number abc and the derivative would be the product of each a’*b’*c’. Then set b’=0 yields the wrong derivative when it should be a’*b’, no?

It does not mean that we set the gradient of b to zero.

param.grad[:, 1:10, :, :] = 0

This line indicates that we just ignore the calculated gradient w.r.t loss so that the parameters are not updated.

You do not have to worry about your example because all gradients are calculated with backward() call.

We modify the gradient after the backward(), so other parameters are safe.

10 Likes

This works, thanks a lot. But what if I want to set different learning rate to the first 10 rows of embedding ?
Can I just grad *= 0.1 like this ?

Sure. You can do anything you want with those gradients between backward() and optimizer.step() calls.

Your suggestion works, and might really be useful in a lot of cases, but it may not have the same effect with (requires_grad=false)

It works since it sets the selected gradients to zero and ensure thats there will be no changes for some certain weights, but the gradients will still be calculated for all the weights, including the ones that have been frozen.

I am concerned about the outcome of that since I am not sure if it would create the same kind of result as setting a requires_grad flag to false.

Because I think =not sure tho= that the loss calculation might be omitting the participation of frozen parameters in the inference when we set requires_grad to false and this would create a focus on the parameters with requires_grad set to true during loss calculation and may eventually let these parameters to be penalized as they are responsible for all the losses of the network.

Please correct me if I am wrong, I know I am assuming a lot of things but, maybe these thoughts can help someone.

1 Like

From the perspective of mathematics, they are the same. Freezing means no updating. Equivalently, the gradients to be updated on are zeros.

2 Likes

I run like this and no error.
Can I use like this? This way doesnt requires manually zeroing grad like your method:

decoder.linear.weight[:100].require_grad = False

I have this model

class my_model(nn.Module):
  def __init__(self):
    super(my_model,self).__init__()
    self.conv1 = nn.Conv2d(3,16,kernel_size=3,stride=1,padding=1)
    self.conv2 = nn.Conv2d(16,32,kernel_size=3,stride=1,padding=1)
    self.conv3 = nn.Conv2d(32,64,kernel_size=3,stride=1,padding=1)
    self.pool = nn.MaxPool2d(2, 2)
    self.fc1 = nn.Linear(4*4*64,64)
    self.fc2 = nn.Linear(64,10)
  def forward(self,inp):
    ab = self.pool(F.relu(self.conv1(inp)))
    ab = self.pool(F.relu(self.conv2(ab)))
    ab = self.pool(F.relu(self.conv3(ab)))
    ab = ab.view(ab.shape[0],-1)
    ab = F.relu(self.fc1(ab))
    ab = F.relu(self.fc2(ab))
    return ab

can you tell how can i set the gradients of first filter of the conv1 layer to be zero
using backward hook

What does this mean, that the first 100 will be frozen? I proved this, and it’s certainly not wrong, but I don’t really understand what weights are frozen, because they are not the first 100?

Hi, Let me ask you a quick question:
When I have a model which has total 5 layer, if I want to freeze 3rd layer from the beginning layer, then the gradient of 3rd layer will affect updating 2nd and 1st layer, but the effect of 3rd layer’s gradients is consistent, and thus there won’t be problem, is my understanding correct?

Yes, you only need to remember this: calculating the gradients and update the weights using the computed gradients are two different operations.

1 Like

Hi, I see this issue is going on for a really long time, so I wrote this simple repo in PyTorch. I hope someone may find it useful. It supports only partial freezing of Conv2d, Conv3d and Linear layers, but these are the most common anyway. If you want me to extend it to other layers, feel free to send me a message. :slight_smile:

4 Likes

Hi, this does not always work as intended. It depends on the optimizer. For example, if you look at some optimizer like radam, you can see that even if the current gradient is set to zero, due to the exponential average of previous results, the parameters will still be updated (although only by a little).

To achieve exactly what we wish, I can only come up with the naive implementation where we save the parameter value before opt.step() and reset it back after.

Hi! In your case, is using a “mask” matrix and a temporal matrix possible? You record the current values of the entries you do not want to update in the “temporal matrix” and use the “mask matrix” to indicate which entries they are.