Model.zero_grad() is not working if optimizer.step() is invoked

falmasri · August 12, 2019, 9:27am

Hello I have a model call M. when I call it and call backward() to calculate the gradient and then I zero the gradient the model has no gradient. but once I call optimizer.step() on zero gradient the model update its parameters on something I don’t know so the next iteration the model has new params.

output = M(input,GT)
error = L1(output,GT)
error.backward()
print(M.grad) —> non-zero.
M.zero_grad()
print(M.grad) —> zero.
optimizer.step() -> the model is update !!!

I don’t want to use with torch.no_grad() because I want to check each iteration.

JamesTrick · August 12, 2019, 9:51am

Hi @falmasri

I think you’re on the right track with this, if you’re doing a typical training cycle - you don’t need to zero_grad() on your model, instead you zero out your optimizer.

Try this and let me know how you go:

# I personally like to .zero_grad() as the first thing.
optimizer.zero_grad()
output = M(input) # Run input through. No need for the target here?
error = L1(output, GT)
# Remove the .zero_grad, backpropagate and step the optimizer.
error.backward()
optimizer.step()

This tutorial has a lot more detail on this as well.

falmasri · August 12, 2019, 12:09pm

It didn’t work, the model is still updating its params. To make the model clear. M includes two sequential instantiated twice. if the model is called first M(input,1) the first is computed , while the second is computed with M(input,2).
I’m doing this procedure to make sure the first call or the second are not interfering but apparently yes.

zhl515 · August 12, 2019, 2:23pm

It depends your update rules (optimizer) actually. Many optimizers not only depend on grad, such as Nesterov-SGD, Adam, RMSProp eg:

# weight = weight - learning_rate * gradient, if gradients are zero, the weight will not update
optimizer = optim.SGD(model.parameters(), lr=0.01) 
# even gradients are zero, the weight will decay 
optimizer = optim.SGD(model.parameters(), lr=0.01, weight_decay=0.1)

You can run below code, see the differences

model = nn.Sequential(
          nn.Linear(6, 2, bias=False),
          nn.Sigmoid(),
        )
input = torch.randn(6)
target = torch.randn(2)
#optimizer = optim.SGD(model.parameters(), lr=0.01, weight_decay=0.1)
optimizer = optim.SGD(model.parameters(), lr=0.01)
criterion = nn.MSELoss()



output = model(input)
loss = criterion(output, target)
loss.backward()

print(list(model.parameters())[0].grad)
optimizer.zero_grad() 
print(list(model.parameters())[0].grad)


print("before step: ", list(model.parameters()))
optimizer.step()
print("after step: ", list(model.parameters()))

falmasri · August 12, 2019, 2:35pm

Then maybe it is better to have two optimizers working on part of the model. if once the backward is called there is no way to reset the optimizer parameters.