I have confusion when to use model.zero_grad() and optimizer.zero_grad()? I have seen some examples they are using model.zero_grad() in some examples and optimizer.zero_grad() in some other example. Is there any specific case for using any one of these?
model.zero_grad() and optimizer.zero_grad() are the same IF all your model parameters are in that optimizer. I found it is safer to call model.zero_grad() to make sure all grads are zero, e.g. if you have two or more optimizers for one model.
The performance gains mentioned in the tuning guide are coming from setting the .grad attributes to None either manually or via the set_to_none=True argument, which would avoid a memset + read/write operation in the next backward.