h = model.classifier.weight.register_hook(lambda grad: grad * 0)
model.zero_grad()
loss.backward()
optimizer.step()
where model.classifier is a single fully-connected layer.
The purpose of this piece of code is to investigate how to use modified gradient to update parameters.
If I understand this correctly, after I set the gradient to 0, the weight will not be updated after each iteration. But, after I run this code, the weight still update normally, which means the optimizer still use the old gradient.
I also explicitly set the gradient to a new value, like:
Your first code snippet should just work. Could you post more information how it doesn’t work for you? The second won’t work since weight has not ._grad attribute and .grad should be a Variable.
The optimizer just uses the .grad attribute. So you can manually inspect whether you successfully changed gradient after doing backward.
Also, it is worth noting if you do the hook approach, the gradients calculation of other variables that use weight.grad will also be incorrect.
For the first approach, I print out the gradient after the backward() and it do change to zero, but the weight itself still update. So I guess the step function still use the old gradient.
For the second approach, I change the _grad to grad as you said and it does not work either.
The optimizer can only see the .grad attribute so it can never really see the “old” gradient. Could it be the case that the optimizer has non-zero momentum?
I use convolutional layers from a officially implemented pre-trained ResNet-18 followed by a new fc layer. Here is my training code:
optimizer = torch.optim.Adam(model.parameters(),
lr=args.lr,
betas=(args.momentum, 0.999),
weight_decay=args.weight_decay)
out = model(input_var)
loss = criterion(out, target_var)
h = model.classifier.weight.register_hook(lambda grad: grad * 0)
model.zero_grad()
loss.backward()
optimizer.step()
The model.classifier is the self-defined fc layer and the args.momentum is set to 0.9. The fc layer has two output neurons so the dimension of the weight is 2*512.
The Pytorch that I am using has the version of 0.3.0.post4.
oh there is weight_decay option… weight_decay is like l2 regularization so it will change your weight if it is nonzero. By default in optim.Adam it is zero though. It could be that the script has nonzero default value. Could you try setting that to 0?