Set a portion of a tensor to have requires_grad=False

Qwwq · July 29, 2022, 4:35pm

I want to set a portion of a tensor (such as x[1:3, 5:10, :]) to have requires_grad=False. Is this possible?

ptrblck · July 30, 2022, 1:28am

No, that’s not directly possible since the .requires_grad attribute is used for the entire tensor.
You could either zero out the gradients of the frozen tensor part, restore their values after the rest was updated, or recreate the x tensor from smaller tensors having different .requires_grad settings via torch.cat or torch.stack.

Qwwq · July 30, 2022, 7:28pm

That’s what I though too. I wish it was possible though. For example, if the requires_grad flag is instead made to contain indices of the tensor, then optimizer.step could perform the gradient update to only those indices. This wouldn’t hurt performance as the optimizer could fallback on the existing way of performing the update if requires_grad takes a boolean value.

whubaichuan · April 16, 2024, 7:19pm

@ptrblck

You could either zero out the gradients of the frozen tensor part, restore their values after the rest was updated.

I think this solution has a problem. Because in principal, the frozen tensor will not contribute to the rest part. But here if we do this, the frozen tensor will contribute to the rest in terms of weight update.

recreate the x tensor from smaller tensors having different .requires_grad settings via torch.cat or torch.stack

Is this possible to has one official function in Pytorch?

ptrblck · April 16, 2024, 9:32pm

I’m not sure I understand the description correctly, so could you add more details or an example?
Frozen parameters generally still contribute to the dgrad computation, but won’t receive a wgrad (so their own .grad attribute remains empty). Here is a small example:

lin1 = nn.Linear(1, 1)
lin2 = nn.Linear(1, 1)
for param in lin2.parameters():
    param.requires_grad = False


x = torch.randn(1, 1)
intermediate = lin1(x)
intermediate.retain_grad()
out = lin2(intermediate)
out.mean().backward()

print({n: p.grad for n, p in lin1.named_parameters()})
# {'weight': tensor([[0.3544]]), 'bias': tensor([0.8607])}
print({n: p.grad for n, p in lin2.named_parameters()})
# {'weight': None, 'bias': None}
print(intermediate.grad)
# tensor([[0.8607]])

whubaichuan · April 17, 2024, 7:26am

@ptrblck Thanks for your quick reply. I put my code below. The question is that if we zero out the gradients of the lin2 part, but the lin1 part is still the same as before. In principal, if lin2 is frozen, then the weight of line2 is not updated. Then for lin1, the error is passed back by the old weight of lin2. But here, even we zero out the gradients of lin2 part, the error for lin1 seems to be passed back by the updated weight of lin2.



import torch
import torch.nn as nn

lin1 = nn.Linear(1, 1)
lin2 = nn.Linear(1, 1)
lin3 = nn.Linear(1,1)
# for param in lin2.parameters():
#    param.requires_grad = False


x = torch.randn(1, 1)
intermediate = lin1(x)
intermediate.retain_grad()
out = lin2(intermediate)
final = lin3(out)
final.mean().backward()

print({n: p.grad for n, p in lin1.named_parameters()})
#{'weight': tensor([[0.5468]]), 'bias': tensor([0.5331])}
print({n: p.grad for n, p in lin2.named_parameters()})
#{'weight': tensor([[-0.8191]]), 'bias': tensor([0.8629])}
print(intermediate.grad)
#tensor([[0.5331]])

for name, module in lin2.named_modules():
   module.weight.grad = torch.zeros_like(module.weight.grad)
   module.bias.grad = torch.zeros_like(module.bias.grad)


print({n: p.grad for n, p in lin1.named_parameters()})
#{'weight': tensor([[0.5468]]), 'bias': tensor([0.5331])}
print({n: p.grad for n, p in lin2.named_parameters()})
#{'weight': tensor([[0.]]), 'bias': tensor([0.])}
print(intermediate.grad)
# tensor([[0.5331]])

ptrblck · April 17, 2024, 4:21pm

Yes, since the dgrad computation won’t be changed as mentioned before.

That’s also correct and zeroing out the gradients will also not update the corresponding weights unless the optimizer has calculated running stats from previous iterations for this parameter, which is why restoring the weights might be considered the safe option.

Where does the updated weight in lin2 come from? The gradient computation is performed before calling optimizer.step() and performing the parameter updates.

whubaichuan · April 17, 2024, 7:55pm

@ptrblck Thanks for your detailed and quick reply. Ok, let me make it clear.

Where does the updated weight in lin2 come from? The gradient computation is performed before calling optimizer.step() and performing the parameter updates.

In this code, as I wrote before, in principle, if the lin2 is requires_grad = False , the gradient will not be passed back to lin1 in the chain rule, because lin2 is not in the computation graph. In this computation graph, the connection between lin1 and lin2 is cut off.

However, if we manually zero out the gradients of the lin2 part after the backward(), the gradient is already passed back to lin1, which means that the requires_grad of lin2 is True although the grad.data=0. This situation actually will not update the weight of lin2, but this is not exactly the requires_grad = False.

What do you think?

ptrblck · April 17, 2024, 9:30pm

No, this shouldn’t be the case, as freezing intermediate parameters would otherwise detach the computation graph and stop the training of previous layers.
My code snippet also demonstrates that the gradient is still properly passed to previous layers.

I.e. in my previous post lin1.parameters() receive a valid gradient:

print({n: p.grad for n, p in lin1.named_parameters()})
# {'weight': tensor([[0.3544]]), 'bias': tensor([0.8607])}

as well as the intermediate even though lin2 is frozen.

Let me know if I misunderstand you post.

whubaichuan · April 18, 2024, 6:06am

@ptrblck Hi, thanks for the explanation. You are right. Freezing the weight will not interrupt the error/gradient backpropagation.

Require_grad=false is not equal to .detach()