In optimizer.zero_grad(), set p.grad = None?

Hi,

There is one main difference:
The main difference is that the Tensor containing the gradients will not be reallocated at every backward pass. Since memory allocation is quite expensive (especially on GPU), this is much more efficient.

There are other subtle differences between the two like some optimizers that behave differently if a gradient is 0 or None. It am sure there are other places that behave like that.