Parameters with requires_grad = False are updated during training

Hello. I’am trying to freeze front layers during training.

Before starting optimization, the optimizer is constructed by

optimizer = torch.optim.SGD(net.parameters(), lr, ...)

Then, during training, i changed the front layers’ requires_grad=False.
Specifically,

for epoch in range(total_epoch):
  if epoch == freeze_epoch:
    net.conv1.weight.requires_grad = False
  train_one_epoch() #update network in an epoch

However, I found that the weights of the front layers are still updated.
I also double-checked that the front layer’s requires_grad=False after changing it.

  1. Why the gradients of front layers are still computed and updated?
  2. How can i implement the same effect in other way (without wrapping each layer as module -> because layers to freeze are not pre-defined ) ??

p.s.

def train_one_epoch():
  for data, target in train_loader:
    ...
    optimizer.zero_grad()
    loss = ...
    loss.backward()
    optimizer.step()

Thank you.

4 Likes

Hi Doyup!

If you use momentum or weight_decay (both supported by SGD),
your weights can still be updated even if their gradients are zero.

Have you checked that your requires_grad = False gradients are,
in fact, being calculated and given non-zero values? Or do you just
see that the weights are being changed by optimizer.step()?

Best.

K. Frank

5 Likes

Thanks for your fast reply !!!

I also checked that the requires_grad=False but, still there are some values in weight.grad.

During training, the gradients are still computed, despite the require_grad=False.

Hi Doyup!

First, are you using non-zero momentum or weight_decay?

I should have asked my question about non-zero gradients more
carefully:

I’m reasonably sure that SGD implements weight_decay (and
maybe momentum) by modifying the gradients before applying
the actual update step – even if the gradients are zero, or have
requires_grad = False.

So, I should aks: Are the gradients in question zero after calling
loss.backward(), but before calling optimizer.step()?

Best.

K. Frank

1 Like

First of all, i am using the momentum in optimizer.
So i understand, the optimizer could update the parameters after i changed the requires_grad=False.

When i check the gradient is “after” calling optimizer.step().
As you mentioned, if the optimizer with momentum modify the gradients of parameters with requires_grad=False, the phenomenon is understanble.

So i will check the gradients before optimizer.step() and will additionally attach the reply.

I found that

optimizer.zero_grad()
loss = torch.mean(torch.sum(-onehot*pred, dim=1)) #CE loss
loss.backwrad()

until this, the conv1.weight.grad has zero gradients.
However, after i procede

optimizer.step()

the conv1.weight.grad was updated.

Thank you so much !!!

Then, @KFrank,
Could you let me know how can we change the gradient histories of freezing layers to prevent gradient update after optimizer.step()

Hi @LeeDoYup,

Seems like I have also faced with similar problem. I was trying to use pretrained ResNet with changing few last layers, but it was not training (loss was increasing, while accuracy was near same).

Here you can see how you can specify which layers in the model should be optimized, by checking whether they are freezed or not.

Hope that will help you.
BR,
Petro

My experience with AdamW told me that once you pass the parameters into an optimizer and train them for a step, then the only flag that determines whether to calculate grad, is NOT

param.requires_grad

but

param.grad is None

If you don’t want to re-instantiate the optimizer and just want to tell the existing optimizer not to calculate the grad of some parameters, you have to manually set param.grad to None.

For the same reason, if you want to resume the training of some parameters, requires_grad = True will not work. Instead you have to do use param.grad = torch.zeros_like(param)

For your reference, my pytorch version is 1.7.1

4 Likes

Interesting! Just curious, are there any plans to standardize this behavior? Or has this behavior already changed in the later versions?

For instance, isn’t this property desirable? : Regardless of the model’s state, the requires_grad flag should dictate the gradient computation and the update by the optimizer.

1 Like