Hello.

I’m currently trying to understand Backpropagation Through Time (BPTT), and am trying test my understanding by re-creating the backpropagation calculation that `torch.nn.RNN`

performs.

After having read many articles and performed the mathematical proofs, I am 90% certain that not only my theoretical understanding, but my code implementation is also correct – however, the gradients that I am manually calculating do not match the gradients produced by `torch`

, and I have no idea why.

I’ll show you what I mean. Below is a simple `torch_rnn`

object, which is a simple `RNN`

layer:

```
import torch.nn as nn
torch_rnn = nn.RNN(input_size=1, hidden_size=2, batch_first=True)
```

I’ll also be using the following sequence as the input to our `RNN`

:

```
import torch
x = torch.tensor([[1.0],
[2.0],
[3.0]])
```

I’ll now perform a forward pass using `torch_rnn`

, and will store the final hidden state (or the output of the layer) as `torch_last_hidden`

:

```
_, torch_last_hidden = torch_rnn(x)
```

*Now*, I’ll use my own hard-coded implementation to do the same thing, but will copy the weights and biases *directly* from the `torch_rnn`

object. I’ll then walk-forward through the RNN calculations step by step, until I reach `h_3`

, the final hidden state. I’ll also be initialising `h_0`

to be all-zero; as is the default behaviour for `nn.RNN`

:

```
W_ih = torch_rnn.weight_ih_l0.detach()
b_ih = torch_rnn.bias_ih_l0.detach()
W_hh = torch_rnn.weight_hh_l0.detach()
b_hh = torch_rnn.bias_hh_l0.detach()
h_0 = torch.zeros(1, 64)
tanh = torch.nn.Tanh()
z_1 = (x[0] @ W_ih.T) + b_ih + (h_0 @ W_hh.T) + b_hh
h_1 = tanh(z_1)
z_2 = (x[1] @ W_ih.T) + b_ih + (h_1 @ W_hh.T) + b_hh
h_2 = tanh(z_2)
z_3 = (x[2] @ W_ih.T) + b_ih + (h_2 @ W_hh.T) + b_hh
h_3 = tanh(z_3)
```

Let’s make sure that our manual calculation has produced a very close approximation of the `torch`

implementation:

```
assert torch.allclose(h_3, torch_last_hidden, atol=1e-6)
```

^^ This assert statement doesn’t throw an error, so the forward pass has been successfully recreated.

Next, is the backward pass. For this example, I will try and calculate the gradients of `W_ih`

with respect to `h_3`

. The theory behind how to do this (as I understand it), is to calculate the following:

I’ll now try applying this to our above problem and manually perform the backpropagation for `W_ih`

. I’ll also allow `torch`

to perform its own calculation of the `W_ih`

gradients for comparison, and save the result to `torch_W_ih_gradients`

:

```
import torch.optim as optim
optimiser = optim.SGD(torch_rnn.parameters(), lr=0.001)
optimiser.zero_grad()
torch_last_hidden.sum().backward()
torch_W_ih_gradients = copy.deepcopy(torch_rnn.weight_ih_l0.grad)
dh_3_dW_ih_3 = (1 - tanh(z_3)**2)*x[2]
dh_3_dW_ih_2 = (((1 - tanh(z_3)**2) @ W_hh) *
((1 - tanh(z_2)**2) * x[1]))
dh_3_dW_ih_1 = (((1 - tanh(z_3)**2) @ W_hh) *
((1 - tanh(z_2)**2) @ W_hh) *
((1 - tanh(z_1)**2) * x[0]))
dh_3_dW_ih = dh_3_dW_ih_1 + dh_3_dW_ih_2 + dh_3_dW_ih_3
assert torch.allclose(dh_3_dW_ih.T, torch_W_ih_gradients, atol=1e-6)
```

The above code will throw an `AssertionError`

, and a manual inspection of the gradients reveals that they are in fact different:

```
dh_3_dW_ih.T -------------> tensor([[0.5162],
[0.5511]])
torch_W_ih_gradients -----> tensor([[0.5053],
[0.5356]]))
```

The difference is small, but significant enough to cause concern.

Have I made a mistake with my theoretical understanding of the backpropagation in the RNN?

Is there a mistake with my code implementation?

Why are my calculations not yielding the same gradients as the `torch`

implementation?