This question is a bundle of a few things I’ve struggled to answer from the docs.
At the top level, I am interested in approaches to maintaining “flat grads” – ie, the model gradients are a single contiguous buffer (of the same dtype as model params), and the .grad
attribute for each param is a pointer into that contiguous buffer. Afaict, this is still possible (if not exactly encouraged) by manually mutating .grad
to be a view tensor like so:
m = torch.nn.Linear(hidden, hidden).cuda()
buf = torch.zeros(m.weight.numel(), dtype=m.weight.dtype, device=m.weight.device)
m.weight.grad = buf[0:m.weight.numel()].view_as(m.weight)
# And so on for all the the parameters
# After running fwd + bwd, we have:
assert buf.untyped_storage().data_ptr() == m.weight.grad.untyped_storage().data_ptr()
Two partially-related questions related to this technique:
-
How does this interact with cuda graph capture – and more generally, is it possible to reason at all about the pointers in a captured cuda graph? By definition, the output
.grad
of each gradient-computing kernel needs to be stable across graph executions – so if I do something like the above, will it capture the pointer that is a reference-to-buf on the first run and use it in subsequent runs? Or does the graph capture need to use its own internal memory pool to allocate the grad attributes? -
There is (afaik) a small optimization when calling
zero_grad(set_to_none=True)
where the computed gradient doesn’t need to be accumulated during backprop but can instead be stored (saving an extra memory round trip) – I heard this referred to as “grad copy elision”. Is there any way to convince autograd to perform this same optimization in the flat grads case defined above, where I want to ensure the grad pointer remains unchanged, but I don’t care about accumulating it?
Thanks!