Recording gradients in RNNs at each point in time

Hi,

I am trying to record the values of the gradients as they propagate in time through an RNN. Initially I thought this would be easily accomplished using the register_hook function by calling it to each parameter, yet I now realize that for some reason, the hook is not called at each step in time, contrary to what I had understood. Take for example the following code:

import torch

x = torch.randn((1, 1))
w = torch.ones((2, 1), requires_grad=1)

z = w * x
z *= w
z *= w
z *= w

def print_grad(grad):
    print(grad)

h = z.register_hook(print_grad)
z.sum().backward()

This produces as output:

tensor([[1.],  [1.]])

Instead of the same, but repeated 4 times.

What am I missing here? If nothing, is there a way to record the gradients at each point in time?

Thanks in advance!

The hook will give you the gradient at the point where the hook is registered.
If you want to get it after every computation, you need to register one after every computation.

1 Like

Oh, I see. I guess for feed-forward models that’s good enough. But for recurrent networks it makes tracing gradients backward in time a bit hard.

Thanks for the answer!

Well if you have a single call in a for-loop, you will still have one call to the hook per time it was applied. So that works for rnn as well :slight_smile:

def print_grad(grad):
    print(grad)

for i in range(4):
  z *= w
  z.register_hook(print_grad)

That will register one hook per iteration of the loop.

That would certainly work, but I still have to modify the code in all the RNN models to save the gradients, which is not what I want. The recording and the model itself should be separate. It’s also not what the documentation implies. Taken directly from there:

def register_hook(self, hook):
        r"""Registers a backward hook.

        The hook will be called every time a gradient with respect to the
        Tensor is computed. The hook should have the following signature::

            hook(grad) -> Tensor or None
...

Which implies that it should be called at each time step, since we compute a gradient w.r.t. the parameters for each of them. Yet when I do this, I only get one gradient.

Maybe someone want’s the gradient for the whole sequence, but that can be obtained using the register_backward_hook method in the Module class.

Every time you do z = foo(z) in your RNN, you most likely create a new Tensor that is then assigned with the python variable named z. But the original Tensor contained in z and the new one have nothing in common. So if you add a hook to the second one, you won’t get gradient for the first one.

This comment refers to the case where you call .bakward() multiple times.

Yes, but I am trying to compute the gradient w.r.t. the parameters, which are tensors that do not change and are repeated across time steps. Also I only call backwards one time. From what I read in the docs, this should give me a gradient for each time step and parameters (i.e. input and recurrent weights and biases), unless I have a bug or misunderstood something…

If you use a Tensor multiple times, the gradient associated to it is the sum of the gradients corresponding to each use.
If you want to get the gradient for a single use, you need to make a temporary tensor corresponding to that single use.

So interestingly enough, the gradient returned in every time step by creating this hook is the same. I’m guessing it’s the sum total of all gradients of z over every time step. Can someone confirm this?

If the Tensor is the same, then yes.

1 Like

And just of curiosity, there’s no way to extract the gradients for every time step separately (rather than deal with the sum) ?

As mentioned above, you can do that by having a different Tensor (with same content) for each step.

1 Like