I use a hook to store gradient information when I test, but this will cause CUDA out of memory, how do I release the information in the hook?
I assume storing the gradient information increases the memory usage and you are thus running OOM. You can delete tensors via
Hi, ptrblck! Could you please help me with a little related question:
Is it possible to deallocate the CUDA memory earlier in the backward pass using hooks? It seems that PyTorch would do this at once for all gradients (I heard some people was saying this but I cannot confirm it myself)
Thanks in advance! Really thank you for consistently helping people on this forum!
I’m unsure what exactly you want to delete.
model.zero_grad() will use
set_to_none=True in recent PyTorch releases and will thus delete the
.grad attributes of the corresponding parameters.
However, this is done after calling
optimizer.step() to update the parameters with the calculated gradients. Deleting gradients in a backward hook won’t make sense, since you won’t be able to update the parameters at all.
If you are thinking about the intermediate forward activations, I also don’t think you would be able to delete them in backward hooks, as I guess that references are still stored (outside of the backward hook). Even if you
del the saved tensors or gradient tensors, I would guess Autograd would still keep the references and eventually free it when the last reference was deleted.
However, I didn’t test this behavior so it’s a guess.
I would like to do it anyway with some personal experiments. So is there any API to remove the gradient of a given operator/backward node during the backward pass? (Or maybe you just answered me since the API is the
set_to_none option?) (by remove I meant I want to release the GPU memory it has occupied during the backward pass. It’s OK if this requires writing some C++ or maybe I need to change some underlying codes)
Thanks for your reading.
I tried to delete the
.grad attribute using
.register_hook, but this didn’t work unfortunately.
@albanD would you have a recommendation how to run such experiments?
I know CPU-offloading uses context managers to at least move tensors to the CPU, but I assume these are forward activations (and other stored tensors needed for backward) and not the actual gradients?
Ho yes because such hook runs before we accumulate into the
I think you can either use autograd.grad to not have issues with .grad fields being populated.
Another trick is to get a post-Node hook and clear the
.grad field then.
An example I have where this is done is this one where I make optimizer a hook: PyTorch optimizer as hook · GitHub
Thanks for the example!