I am trying to record the values of the gradients as they propagate in time through an RNN. Initially I thought this would be easily accomplished using the register_hook function by calling it to each parameter, yet I now realize that for some reason, the hook is not called at each step in time, contrary to what I had understood. Take for example the following code:

import torch
x = torch.randn((1, 1))
w = torch.ones((2, 1), requires_grad=1)
z = w * x
z *= w
z *= w
z *= w
def print_grad(grad):
print(grad)
h = z.register_hook(print_grad)
z.sum().backward()

This produces as output:

tensor([[1.], [1.]])

Instead of the same, but repeated 4 times.

What am I missing here? If nothing, is there a way to record the gradients at each point in time?

The hook will give you the gradient at the point where the hook is registered.
If you want to get it after every computation, you need to register one after every computation.

That would certainly work, but I still have to modify the code in all the RNN models to save the gradients, which is not what I want. The recording and the model itself should be separate. Itâ€™s also not what the documentation implies. Taken directly from there:

def register_hook(self, hook):
r"""Registers a backward hook.
The hook will be called every time a gradient with respect to the
Tensor is computed. The hook should have the following signature::
hook(grad) -> Tensor or None
...

Which implies that it should be called at each time step, since we compute a gradient w.r.t. the parameters for each of them. Yet when I do this, I only get one gradient.

Maybe someone wantâ€™s the gradient for the whole sequence, but that can be obtained using the register_backward_hook method in the Module class.

Every time you do z = foo(z) in your RNN, you most likely create a new Tensor that is then assigned with the python variable named z. But the original Tensor contained in z and the new one have nothing in common. So if you add a hook to the second one, you wonâ€™t get gradient for the first one.

This comment refers to the case where you call .bakward() multiple times.

Yes, but I am trying to compute the gradient w.r.t. the parameters, which are tensors that do not change and are repeated across time steps. Also I only call backwards one time. From what I read in the docs, this should give me a gradient for each time step and parameters (i.e. input and recurrent weights and biases), unless I have a bug or misunderstood somethingâ€¦

If you use a Tensor multiple times, the gradient associated to it is the sum of the gradients corresponding to each use.
If you want to get the gradient for a single use, you need to make a temporary tensor corresponding to that single use.

So interestingly enough, the gradient returned in every time step by creating this hook is the same. Iâ€™m guessing itâ€™s the sum total of all gradients of z over every time step. Can someone confirm this?