Model.zero_grad() or optimizer.zero_grad()?

Hi everyone,

I have confusion when to use model.zero_grad() and optimizer.zero_grad()? I have seen some examples they are using model.zero_grad() in some examples and optimizer.zero_grad() in some other example. Is there any specific case for using any one of these?

29 Likes

I am training a network on speech data.

If you’re referring to:

optimizer = optim.SGD(net.parameters())

They’re the same.

1 Like

I am using :
optimizer = optim.Adam(model.parameters())

In this case what should I use?

Same; whether you use SGD, Adam, RMSProp etc.

Typically I use optimizer.zero_grad().

model.zero_grad() and optimizer.zero_grad() are the same IF all your model parameters are in that optimizer. I found it is safer to call model.zero_grad() to make sure all grads are zero, e.g. if you have two or more optimizers for one model.

48 Likes

According to
https://tigress-web.princeton.edu/~jdh4/PyTorchPerformanceTuningGuide_GTC2021.pdf (Slide 6)
it seems preferable to call model.zero_grad() for performance reasons (pytorch>=1.7).

EDIT: thanks to ptrblck, I should have put model.zero_grad(set_to_none=True).

4 Likes

The performance gains mentioned in the tuning guide are coming from setting the .grad attributes to None either manually or via the set_to_none=True argument, which would avoid a memset + read/write operation in the next backward.

7 Likes