How to use modified gradient to update parameters

SKYHOWIE25 · January 30, 2018, 2:25am

Hi

I have the following code:

h = model.classifier.weight.register_hook(lambda grad: grad * 0)
model.zero_grad()
loss.backward()
optimizer.step()

where model.classifier is a single fully-connected layer.

The purpose of this piece of code is to investigate how to use modified gradient to update parameters.

If I understand this correctly, after I set the gradient to 0, the weight will not be updated after each iteration. But, after I run this code, the weight still update normally, which means the optimizer still use the old gradient.

I also explicitly set the gradient to a new value, like:

model.classifier.weight._grad.data = torch.zeros(model.classifier.weight.size()).cuda()

and it doesn’t work either.

So, my question is how to pass the modified gradient to the optimizer? Thank you.

SimonW · January 30, 2018, 4:22am

Your first code snippet should just work. Could you post more information how it doesn’t work for you? The second won’t work since weight has not ._grad attribute and .grad should be a Variable.

The optimizer just uses the .grad attribute. So you can manually inspect whether you successfully changed gradient after doing backward.

Also, it is worth noting if you do the hook approach, the gradients calculation of other variables that use weight.grad will also be incorrect.

SKYHOWIE25 · January 30, 2018, 5:28am

Hi Simon

Thank you for the reply.

For the first approach, I print out the gradient after the backward() and it do change to zero, but the weight itself still update. So I guess the step function still use the old gradient.

For the second approach, I change the _grad to grad as you said and it does not work either.

Thank you.

SimonW · January 30, 2018, 5:30am

The optimizer can only see the .grad attribute so it can never really see the “old” gradient. Could it be the case that the optimizer has non-zero momentum?

SKYHOWIE25 · January 30, 2018, 5:32am

It could be. I use the Adam optimizer with the default setting expect the learning rate. Could this be the problem?

SimonW · January 30, 2018, 5:41am

Did you add the hook after having already done some iterations?

SKYHOWIE25 · January 30, 2018, 5:44am

No, I add the hook from the very beginning.

SKYHOWIE25 · January 30, 2018, 5:54am

Hi, I change the optimizer to SGD and the weight do not update anymore. But I still cannot figure out what really happens here.

SimonW · January 30, 2018, 4:46pm

I tried to reproduce it with the following script but can’t get it work:

import torch
from torch.optim import Adam
import torch.nn as nn
from torch.autograd import Variable

net = nn.Sequential(
        nn.Linear(5, 10),
        nn.ReLU(),
        nn.Linear(10, 5),
        nn.ReLU(),
        nn.Softmax(1)
        ).cuda()

optim = Adam(net.parameters(), lr = 3)
net[2].weight.register_hook(lambda grad: grad * 0)
x = Variable(torch.cuda.FloatTensor(3, 5).normal_())
net(x).mean().backward()
old_weight = net[2].weight.data.clone()
optim.step()
new_weight = net[2].weight.data.clone()
print((old_weight - new_weight).abs_().mean())  # get 0

Do you mind sharing your script and torch version? I want to make sure that our Adam implementation doesn’t have bugs

SKYHOWIE25 · January 31, 2018, 1:32am

Hi

I use convolutional layers from a officially implemented pre-trained ResNet-18 followed by a new fc layer. Here is my training code:

optimizer = torch.optim.Adam(model.parameters(), 
                             lr=args.lr,
                             betas=(args.momentum, 0.999),
                             weight_decay=args.weight_decay)

out = model(input_var)
loss = criterion(out, target_var)

h = model.classifier.weight.register_hook(lambda grad: grad * 0)
model.zero_grad()
loss.backward()
optimizer.step()

The model.classifier is the self-defined fc layer and the args.momentum is set to 0.9. The fc layer has two output neurons so the dimension of the weight is 2*512.

The Pytorch that I am using has the version of 0.3.0.post4.

SimonW · January 31, 2018, 1:34am

oh there is weight_decay option… weight_decay is like l2 regularization so it will change your weight if it is nonzero. By default in optim.Adam it is zero though. It could be that the script has nonzero default value. Could you try setting that to 0?

SKYHOWIE25 · January 31, 2018, 11:35pm

Yeah, I will try it later. But the SGD optimizier also has the weight_decay option.

SimonW · January 31, 2018, 11:38pm

Let’s first try Adam with no weight decay . Also could you check later what’s your weight decay value when you used SGD?

SKYHOWIE25 · January 31, 2018, 11:40pm

Yeah, sure. The weight decay value is 1e-4 for both Adam and SGD.