I’m looking at pytorch/torch/_tensor.py at v2.6.0 · pytorch/pytorch · GitHub
Is this a guarantee that the hook will only be called once per parameter per backward pass? Even if I’m training something like an RNN where the grad is going to be accumulated multiple times? Or is that not what it’s saying? I’m not able to understand it from the docs or blog post.
My specific use case is an optimizer that calls register_post_accumulate_grad_hook on the parameters you pass into it. I’m using fsdp2’s CPUOffloadPolicy, so they’re already on the CPU. Then I get the tensor data, do the optimizer step in C++,
This is all implemented here:
Emperically it works, but optimizers are probably tolerant to this sort of thing, and I’d like to gain a better understanding of what the autograd engine does before it fires the hooks. If it’s “Ah, I have no more references to this tensor in my graph, let’s fire the hook” or if it’s more naive.