Implement models that include gradients

Suppose (it is a toy example) we want to implement an RNN that maps an input sequence $x_t$ to an output sequence $y_t$ via hidden states $h_t$.

We assume the outputs to be scalars. They are computed as $y_t=F(x_t,h_{t-1})$ with a scalar-valued feedforward neural network $F$.

Let us further assume that the hidden states should be computed as
$$ h_t = \frac{\partial F}{\partial h}(x_t,h_{t-1}). $$

How can this derivative be computed in the forward pass?
The problem is that the derivative is w.r.t. the non-leaf node $h_{t-1}$.
So we cannot simply do:

def RNN_step(x, h_prev):
  h_prev.requires_grad = True # this does not work
  y = F(x, h_prev)
  y.sum().backward(create_graph=True) # summing over the batch dimension
  h = h_prev.grad
  h_prev.requires_grad = False
  return y, h

No one has a hint as to this?

Create a nested autodiff graph, with cloned input tensors (using .detach().clone().requires_grad_()). Evaluate the function of interest, compute gradients with autograd.grad. Do all this inside autograd.Function’s forward(), plug output & gradients into the outer graph. There are snippets of this somewhere in this forum…

Thank you for these hints! I tried to fill in the blanks an came up with this:

class RNNstep(torch.autograd.Function):
  def forward(ctx, x_in, h_in):
    with torch.enable_grad():
      x_in.requires_grad = True
      h_in_copy = h_in.detach().clone().requires_grad_()
      y_out = F(x_in, h_in_copy)
      h_out, = torch.autograd.grad(

    ctx.save_for_backward(x_in, h_in_copy, y_out, h_out)

    return y_out.detach(), h_out.detach()

  def backward(ctx, grad_y_out, grad_h_out):
    x_in, h_in_copy, y_out, h_out = ctx.saved_tensors

    grad_x_in1, = torch.autograd.grad(y_out, x_in, grad_y_out, retain_graph=True)
    grad_x_in2, = torch.autograd.grad(h_out, x_in, grad_h_out, retain_graph=True)

    grad_h_in1, = torch.autograd.grad(y_out, h_in_copy, grad_y_out, retain_graph=True)
    grad_h_in2, = torch.autograd.grad(h_out, h_in_copy, grad_h_out, retain_graph=True)

    return grad_x_in1 + grad_x_in2, grad_h_in1 + grad_h_in2

Although this code runs, I am quite sure that it is not correct.
For example: Do the gradients w.r.t. the parameters of F get accumulated at all?

You spoke of snippets which I could find somewhere in this forum. Can you be a bit more specific what I should search for? The closest thing I found was this.

I am not quite sure what you mean by “plug output & gradients into the outer graph”.

I tried to learn a lot about torch.autograd today, but apparently I have not yet figured it out really…

I have trouble understanding your intentions in this code. In particular, h_out is gradient, but it is handled the same way as regular output y_out.

In general, I’d suggest comparing empirical gradients that your custom function writes out with expected / externally precalculated values.

I’ve used this technique a bit differently; for elementwise functions Jacobian is diagonal and extra grad() if backward is unnecessary. It looked like:

#in forward
dY_dX = torch.autograd.grad(y, x_autograd, y.new_ones(1).expand_as(y))
#in backward
known_grads = ctx.saved_tensors
grad_inputs[idx] = grad_output * known_grads[idx]
return (*grad_inputs,)

Hence, dY_dX tensors are plugged into backward(). For other functions, this approach would require autograd.jacobian() (likely prohibitive) and matrix multiplications.

Yes! That is exactly my intention. I want to implement a neural net which outputs not only the value, but also the gradient of another neural net. Thus the gradient is computed in the forward pass.

That is what I stated in my initial post. Apparently, this forum does not render latex-code, so here I try it again, maybe it is more readable.

Given an input x, and a hidden state h, the output of the RNN should be computed as
y = F(x,h)
The next hidden state should be computed as
h_next = dF/dh(x, h)

Here F is a multilayer perceptron that I want to train.
The training data consists of sequences (x_t,y_t)

You still have an option to treat h_out as a by-product (as if additional detach() was applied afterwards) from F instead of computing second order derivatives. I have no idea if that would work with your code. Otherwise, i.e. for y_out → (x,h_in) gradient flow, it should be ok.

You may also be interested in forward mode AD, something that is in the works atm. Or maybe existing autograd.functional.jvp() would work better for your case.