Hi, I have been looking into the source code of the optimizer, zero_grad() function in particular.

def zero_grad(self):
r"""Clears the gradients of all optimized :class:`torch.Tensor` s."""
for group in self.param_groups:
for p in group['params']:
if p.grad is not None:
p.grad.detach_()
p.grad.zero_()

and I was wondering if one could just exchange

p.grad.detach_()
p.grad.zero_()

with p.grad = None

In what cases these two options would make difference?

There is one main difference:
The main difference is that the Tensor containing the gradients will not be reallocated at every backward pass. Since memory allocation is quite expensive (especially on GPU), this is much more efficient.

There are other subtle differences between the two like some optimizers that behave differently if a gradient is 0 or None. It am sure there are other places that behave like that.

Will setting the grads to none (instead of zero) avoid that an optimizerâ€™s internal states (momentum, weight decay etc) update the model in those parameters?