Then, during training, i changed the front layers’ requires_grad=False.
Specifically,
for epoch in range(total_epoch):
if epoch == freeze_epoch:
net.conv1.weight.requires_grad = False
train_one_epoch() #update network in an epoch
However, I found that the weights of the front layers are still updated.
I also double-checked that the front layer’s requires_grad=False after changing it.
Why the gradients of front layers are still computed and updated?
How can i implement the same effect in other way (without wrapping each layer as module -> because layers to freeze are not pre-defined ) ??
p.s.
def train_one_epoch():
for data, target in train_loader:
...
optimizer.zero_grad()
loss = ...
loss.backward()
optimizer.step()
If you use momentum or weight_decay (both supported by SGD),
your weights can still be updated even if their gradients are zero.
Have you checked that your requires_grad = False gradients are,
in fact, being calculated and given non-zero values? Or do you just
see that the weights are being changed by optimizer.step()?
First, are you using non-zero momentum or weight_decay?
I should have asked my question about non-zero gradients more
carefully:
I’m reasonably sure that SGD implements weight_decay (and
maybe momentum) by modifying the gradients before applying
the actual update step – even if the gradients are zero, or have requires_grad = False.
So, I should aks: Are the gradients in question zero after calling loss.backward(), but before calling optimizer.step()?
First of all, i am using the momentum in optimizer.
So i understand, the optimizer could update the parameters after i changed the requires_grad=False.
When i check the gradient is “after” calling optimizer.step().
As you mentioned, if the optimizer with momentum modify the gradients of parameters with requires_grad=False, the phenomenon is understanble.
So i will check the gradients before optimizer.step() and will additionally attach the reply.
Seems like I have also faced with similar problem. I was trying to use pretrained ResNet with changing few last layers, but it was not training (loss was increasing, while accuracy was near same).
Here you can see how you can specify which layers in the model should be optimized, by checking whether they are freezed or not.
My experience with AdamW told me that once you pass the parameters into an optimizer and train them for a step, then the only flag that determines whether to calculate grad, is NOT
param.requires_grad
but
param.grad is None
If you don’t want to re-instantiate the optimizer and just want to tell the existing optimizer not to calculate the grad of some parameters, you have to manually set param.grad to None.
For the same reason, if you want to resume the training of some parameters, requires_grad = True will not work. Instead you have to do use param.grad = torch.zeros_like(param)
Interesting! Just curious, are there any plans to standardize this behavior? Or has this behavior already changed in the later versions?
For instance, isn’t this property desirable? : Regardless of the model’s state, the requires_grad flag should dictate the gradient computation and the update by the optimizer.