This question is a bundle of a few things I’ve struggled to answer from the docs.
At the top level, I am interested in approaches to maintaining “flat grads” – ie, the model gradients are a single contiguous buffer (of the same dtype as model params), and the
.grad attribute for each param is a pointer into that contiguous buffer. Afaict, this is still possible (if not exactly encouraged) by manually mutating
.grad to be a view tensor like so:
m = torch.nn.Linear(hidden, hidden).cuda() buf = torch.zeros(m.weight.numel(), dtype=m.weight.dtype, device=m.weight.device) m.weight.grad = buf[0:m.weight.numel()].view_as(m.weight) # And so on for all the the parameters # After running fwd + bwd, we have: assert buf.untyped_storage().data_ptr() == m.weight.grad.untyped_storage().data_ptr()
Two partially-related questions related to this technique:
How does this interact with cuda graph capture – and more generally, is it possible to reason at all about the pointers in a captured cuda graph? By definition, the output
.gradof each gradient-computing kernel needs to be stable across graph executions – so if I do something like the above, will it capture the pointer that is a reference-to-buf on the first run and use it in subsequent runs? Or does the graph capture need to use its own internal memory pool to allocate the grad attributes?
There is (afaik) a small optimization when calling
zero_grad(set_to_none=True)where the computed gradient doesn’t need to be accumulated during backprop but can instead be stored (saving an extra memory round trip) – I heard this referred to as “grad copy elision”. Is there any way to convince autograd to perform this same optimization in the flat grads case defined above, where I want to ensure the grad pointer remains unchanged, but I don’t care about accumulating it?