In optimizer.zero_grad(), set p.grad = None?

marmelad · December 12, 2018, 9:44am

Hi, I have been looking into the source code of the optimizer, zero_grad() function in particular.

    def zero_grad(self):
        r"""Clears the gradients of all optimized :class:`torch.Tensor` s."""
        for group in self.param_groups:
            for p in group['params']:
                if p.grad is not None:
                    p.grad.detach_()
                    p.grad.zero_()

and I was wondering if one could just exchange

p.grad.detach_()
p.grad.zero_()

with
p.grad = None

In what cases these two options would make difference?

albanD · December 12, 2018, 10:25am

Hi,

There is one main difference:
The main difference is that the Tensor containing the gradients will not be reallocated at every backward pass. Since memory allocation is quite expensive (especially on GPU), this is much more efficient.

There are other subtle differences between the two like some optimizers that behave differently if a gradient is 0 or None. It am sure there are other places that behave like that.

marmelad · December 12, 2018, 12:54pm

Thank you very much for the response!

SofiaCP · December 16, 2021, 10:52am

Hi @albanD.

Will setting the grads to none (instead of zero) avoid that an optimizer’s internal states (momentum, weight decay etc) update the model in those parameters?

Thanks!

ptrblck · December 16, 2021, 11:09am

Double post from here.