Model.zero_grad() or optimizer.zero_grad()?

kunasiramesh · October 31, 2018, 8:50am

Hi everyone,

I have confusion when to use model.zero_grad() and optimizer.zero_grad()? I have seen some examples they are using model.zero_grad() in some examples and optimizer.zero_grad() in some other example. Is there any specific case for using any one of these?

kunasiramesh · October 31, 2018, 8:53am

I am training a network on speech data.

ritchieng · October 31, 2018, 9:10am

If you’re referring to:

optimizer = optim.SGD(net.parameters())

They’re the same.

kunasiramesh · October 31, 2018, 9:23am

I am using :
optimizer = optim.Adam(model.parameters())

In this case what should I use?

ritchieng · October 31, 2018, 9:40am

Same; whether you use SGD, Adam, RMSProp etc.

Typically I use optimizer.zero_grad().

dmitryako · May 26, 2019, 4:52am

model.zero_grad() and optimizer.zero_grad() are the same IF all your model parameters are in that optimizer. I found it is safer to call model.zero_grad() to make sure all grads are zero, e.g. if you have two or more optimizers for one model.

gabrieldernbach · September 22, 2021, 3:51pm

According to
https://tigress-web.princeton.edu/~jdh4/PyTorchPerformanceTuningGuide_GTC2021.pdf (Slide 6)
it seems preferable to call model.zero_grad() for performance reasons (pytorch>=1.7).

EDIT: thanks to ptrblck, I should have put model.zero_grad(set_to_none=True).

ptrblck · September 22, 2021, 5:36pm

The performance gains mentioned in the tuning guide are coming from setting the .grad attributes to None either manually or via the set_to_none=True argument, which would avoid a memset + read/write operation in the next backward.