How to freeze or fix the specific(subset, partial) weight in convolution filter

hongjunchoi · April 9, 2020, 2:34pm

Above is my code for trying to fix, freeze the subset(temp_5.values ) of 25th convolution layer’s filter. And the result.
I try to set the subset filter’s gradient as 0 so not to update. And the weight has been changed.

So two questions

I wonder why the gradient becomes not zero?
Is there other solution to fix the subset convolution filter?

ptrblck · April 10, 2020, 3:50am

Which optimizer are you using?
If you are using an optimizer with running estimates, note that even with a zero gradient, the parameters might still be updated, if the running estimates were previously calculated.
E.g. optim.Adam would show such a behavior.

PS: It’s better to post the code directly by wrapping it into three backticks ``` as it makes debugging easier.

hongjunchoi · April 10, 2020, 4:01am

Thanks for the reply this is my code. I use the SGD optimizer

‘’’
optimizer = optim.SGD(im_net.parameters(), lr=0.1, momentum=0.9,
weight_decay=5e-4)
weight_reference = im_net.features[25].weight[temp_4].clone()
for batch_idx, (inputs, targets) in enumerate(testloader):

    inputs = inputs.float().cuda()
    inputs, targets = inputs.to(device), targets.to(device)
        
        
    outputs = im_net(inputs)
                       
    ce_loss = criterion(outputs, targets)
    
    optimizer.zero_grad()
    loss = criterion(outputs, targets)
        
    
    #im_net.features[25].weight[temp_4].register_hook(lambda grad : grad * 0)
    loss.backward()
    im_net.features[25].weight.grad[temp_4] = 0.
    
    
    optimizer.step()
    print(weight_reference == im_net.features[25].weight[temp_4])

‘’’

By the way, could you explain about ‘optimizer with running estimates’? I don’t know what it means.

ptrblck · April 10, 2020, 4:09am

momentum would be such an internal state, which uses the previous gradients to update the parameters even with a zero gradients.
Here is a small example showing this behavior:

# Setup
model = nn.Conv2d(1, 10, 3, 1, 1)
weight_reference = model.weight.clone()
optimizer = torch.optim.SGD(model.parameters(), lr=1., momentum=0.0)

# Should fail
model(torch.randn(1, 1, 3, 3)).mean().backward()
optimizer.step()
print((model.weight == weight_reference).all())

# Should work with momentum = 0.0
optimizer.zero_grad()
weight_reference = model.weight.clone()

model.weight.register_hook(lambda grad: grad * 0)
model(torch.randn(1, 1, 3, 3)).mean().backward()
optimizer.step()

print((model.weight == weight_reference).all())

The parameters should not be updated using mementum=0.0, while any other value for momentum will still update the parameters.

hongjunchoi · April 10, 2020, 4:34am

Thanks for the reply again
I change the momentum to zero but it still changes the gradient and the weight still updates.
So I made the weight decay to zero and it works! I wonder why weight decay makes this difference.
P.S If i have to set weight decay to zero, then is there other solution to give regularization like the weight decay do?

ptrblck · April 10, 2020, 4:38am

I think you would have to define the weight decay manually, e.g. as shown here, and filter out the parameters, which should not be added to weight decay.

SofiaCP · December 16, 2021, 10:44am

Hi @ptrblck,

Regarding the fact that parameters are still updated even when the grad is zero (momentum, decay, etc), will setting the grads to none solve this issue? ie, optimizer.zero_grad(set_to_none=True)?

Thanks!

ptrblck · December 16, 2021, 11:09am

Yes, this should be the case, since the optimizers should check if the gradients are even set (e.g. here for Adam). In any case, I would always double check the behavior especially if you are using custom optimizers, as they might use another workflow.

SofiaCP · December 16, 2021, 11:22am

Okay, this is the behaviour I was looking for. Thank you for your time!

SinHanYang · November 13, 2022, 10:29am

Hi, I would like to ask whether this method works for only freezing a part of the parameter.
Since the code of Adam here is probably checking whether the whole parameter is None.

for p in group['params']:
    if p.grad is not None:

Then if I only want to freeze the subset of the parameter as @hongjunchoi does, does optimizer.zero_grad(set_to_none=True) works when setting gradient of the subset to zero?

Thanks!

ptrblck · November 13, 2022, 8:58pm

I’m unsure what “subset” refers to, but note that you cannot set some gradients to None and others to valid values for the same parameter. Each parameter has a .grad attribute which is either set to None or a tensor.

SinHanYang · November 13, 2022, 9:15pm

Thanks for the reply!

My definition of the “subset” is same as @hongjunchoi .
My goal is to freeze the subset of the parameter, and the subset is given by a list of index id=[0,1,2,...]. Therefore, I set the subset’s gradient to zero, and the code is like:

param.grad[id]=0

This method doesn’t work if the optimizer has momentum part.
However, I’m not sure whether optimizer.zero_grad(set_to_none=True) would prevent Adam from updating the subset since param.grad is not None in this case.

ptrblck · November 13, 2022, 10:37pm

optimizer.zero_grad(set_to_none=True) is irrelevant, since you are setting the gradient to zero after a valid gradient was already calculated.
The optimizer.zero_grad() call could have set the .grad to None before, but since a backward() pass was executed afterwards, the .grad attribute is already populated.
As described in my previous post, it’s not possible to set an attribute to None and a valid value.

SinHanYang · November 14, 2022, 8:42pm

I see. Is it impossible to freeze the subset of the parameter if I use Adam optimizer? Or there are others ways to do so instead of setting subset’s gradient to zero.

Thanks

ptrblck · November 14, 2022, 9:07pm

I would recommend to explicitly restore the “frozen” values of the parameter as described in this post.