Freezing part of the Layer Weights

Hello Everyone,

How could I freeze some parts of the layer weights to zero and not the entire layer.
I tried below code, but it doesn’t freeze the specific parts(1:10 array in 2nd dimension) of the layer weights.
I am new to ML & started with Pytorch. Appreciate any help. Thanks.

for child in model_ft.children():
print(“Freezing Parameters(1->10) on the Convolution Layer”,child)
for param in child.parameters():
param.data[:,1:10,:,:].zero_()
param.data[:,1:10,:,:].requires_grad = False

optimizer_ft = OPTIM.SGD(filter(lambda p: p.requires_grad, model_ft.parameters()), lr=0.001, momentum=0.9)

I think freezing specific parts of a parameter is not possible in PyTorch because requires_grad flag is set on each Parameter(collection of weights), not each weight.

One possible approach is manually zeroing gradient before you call optimizer function.

After you calculate gradient using backward() function, call

param.grad[:, 1:10, :, :] = 0

to achieve you want.

Further, you can automatize this method using backward_hook in PyTorch.

You can check this link if you want.
http://pytorch.org/tutorials/beginner/former_torchies/nn_tutorial.html

5 Likes

Thanks Sonsang,
This works, I can see the required weights are set to zero after training, but the training accuracy has come down. Initializing(xavier_normal) the non-zero weights didn’t help.
Not sure if I am missing something else in my setup.

Does this approach work same as the freezing the weights to zero in the beginning?

Do you mean that training accuracy is going down while training, or the training accuracy decreased after you zeroing-out some gradients?

In the first case, I’m not sure what is happening.

Maybe full code will help the understanding of that situation.

In the second case, it is natural because you use less parameters than the original model.

Thank you.

It’s the second case, I understand. Thank you.

Just to double check, how is this mathematically correct? Since the chain rule involves products (and sums) couldn’t it lead to earlier layers having gradient zero even though they should not? Consider just a simple 3 number abc and the derivative would be the product of each a’*b’*c’. Then set b’=0 yields the wrong derivative when it should be a’*b’, no?

It does not mean that we set the gradient of b to zero.

param.grad[:, 1:10, :, :] = 0

This line indicates that we just ignore the calculated gradient w.r.t loss so that the parameters are not updated.

You do not have to worry about your example because all gradients are calculated with backward() call.

We modify the gradient after the backward(), so other parameters are safe.

4 Likes

This works, thanks a lot. But what if I want to set different learning rate to the first 10 rows of embedding ?
Can I just grad *= 0.1 like this ?

Sure. You can do anything you want with those gradients between backward() and optimizer.step() calls.

Your suggestion works, and might really be useful in a lot of cases, but it may not have the same effect with (requires_grad=false)

It works since it sets the selected gradients to zero and ensure thats there will be no changes for some certain weights, but the gradients will still be calculated for all the weights, including the ones that have been frozen.

I am concerned about the outcome of that since I am not sure if it would create the same kind of result as setting a requires_grad flag to false.

Because I think =not sure tho= that the loss calculation might be omitting the participation of frozen parameters in the inference when we set requires_grad to false and this would create a focus on the parameters with requires_grad set to true during loss calculation and may eventually let these parameters to be penalized as they are responsible for all the losses of the network.

Please correct me if I am wrong, I know I am assuming a lot of things but, maybe these thoughts can help someone.

From the perspective of mathematics, they are the same. Freezing means no updating. Equivalently, the gradients to be updated on are zeros.

I run like this and no error.
Can I use like this? This way doesnt requires manually zeroing grad like your method:

decoder.linear.weight[:100].require_grad = False