This question is a bundle of a few things I’ve struggled to answer from the docs.
At the top level, I am interested in approaches to maintaining “flat grads” – ie, the model gradients are a single contiguous buffer (of the same dtype as model params), and the .grad
attribute for each param is a pointer into that contiguous buffer. Afaict, this is still possible (if not exactly encouraged) by manually mutating .grad
to be a view tensor like so:
m = torch.nn.Linear(hidden, hidden).cuda()
buf = torch.zeros(m.weight.numel(), dtype=m.weight.dtype, device=m.weight.device)
m.weight.grad = buf[0:m.weight.numel()].view_as(m.weight)
# And so on for all the the parameters
# After running fwd + bwd, we have:
assert buf.untyped_storage().data_ptr() == m.weight.grad.untyped_storage().data_ptr()
Two partiallyrelated questions related to this technique:

How does this interact with cuda graph capture – and more generally, is it possible to reason at all about the pointers in a captured cuda graph? By definition, the output
.grad
of each gradientcomputing kernel needs to be stable across graph executions – so if I do something like the above, will it capture the pointer that is a referencetobuf on the first run and use it in subsequent runs? Or does the graph capture need to use its own internal memory pool to allocate the grad attributes? 
There is (afaik) a small optimization when calling
zero_grad(set_to_none=True)
where the computed gradient doesn’t need to be accumulated during backprop but can instead be stored (saving an extra memory round trip) – I heard this referred to as “grad copy elision”. Is there any way to convince autograd to perform this same optimization in the flat grads case defined above, where I want to ensure the grad pointer remains unchanged, but I don’t care about accumulating it?
Thanks!