How to use modified gradient to update parameters


I have the following code:

h = model.classifier.weight.register_hook(lambda grad: grad * 0)

where model.classifier is a single fully-connected layer.

The purpose of this piece of code is to investigate how to use modified gradient to update parameters.

If I understand this correctly, after I set the gradient to 0, the weight will not be updated after each iteration. But, after I run this code, the weight still update normally, which means the optimizer still use the old gradient.

I also explicitly set the gradient to a new value, like: = torch.zeros(model.classifier.weight.size()).cuda()

and it doesn’t work either.

So, my question is how to pass the modified gradient to the optimizer? Thank you.

1 Like

Your first code snippet should just work. Could you post more information how it doesn’t work for you? The second won’t work since weight has not ._grad attribute and .grad should be a Variable.

The optimizer just uses the .grad attribute. So you can manually inspect whether you successfully changed gradient after doing backward.

Also, it is worth noting if you do the hook approach, the gradients calculation of other variables that use weight.grad will also be incorrect.

Hi Simon

Thank you for the reply.

For the first approach, I print out the gradient after the backward() and it do change to zero, but the weight itself still update. So I guess the step function still use the old gradient.

For the second approach, I change the _grad to grad as you said and it does not work either.

Thank you.

The optimizer can only see the .grad attribute so it can never really see the “old” gradient. Could it be the case that the optimizer has non-zero momentum?

It could be. I use the Adam optimizer with the default setting expect the learning rate. Could this be the problem?

Did you add the hook after having already done some iterations?

No, I add the hook from the very beginning.

Hi, I change the optimizer to SGD and the weight do not update anymore. But I still cannot figure out what really happens here.

I tried to reproduce it with the following script but can’t get it work:

import torch
from torch.optim import Adam
import torch.nn as nn
from torch.autograd import Variable

net = nn.Sequential(
        nn.Linear(5, 10),
        nn.Linear(10, 5),

optim = Adam(net.parameters(), lr = 3)
net[2].weight.register_hook(lambda grad: grad * 0)
x = Variable(torch.cuda.FloatTensor(3, 5).normal_())
old_weight = net[2]
new_weight = net[2]
print((old_weight - new_weight).abs_().mean())  # get 0

Do you mind sharing your script and torch version? I want to make sure that our Adam implementation doesn’t have bugs :smiley:


I use convolutional layers from a officially implemented pre-trained ResNet-18 followed by a new fc layer. Here is my training code:

optimizer = torch.optim.Adam(model.parameters(), 
                             betas=(args.momentum, 0.999),

out = model(input_var)
loss = criterion(out, target_var)

h = model.classifier.weight.register_hook(lambda grad: grad * 0)

The model.classifier is the self-defined fc layer and the args.momentum is set to 0.9. The fc layer has two output neurons so the dimension of the weight is 2*512.

The Pytorch that I am using has the version of 0.3.0.post4.

oh there is weight_decay option… weight_decay is like l2 regularization so it will change your weight if it is nonzero. By default in optim.Adam it is zero though. It could be that the script has nonzero default value. Could you try setting that to 0?

Yeah, I will try it later. But the SGD optimizier also has the weight_decay option.

Let’s first try Adam with no weight decay :slight_smile: . Also could you check later what’s your weight decay value when you used SGD?

Yeah, sure. The weight decay value is 1e-4 for both Adam and SGD.