I want to train an autoregressive RNN as follows:

```
for i in range(max_seq_len):
pred, hidden = GRU(pred, hidden) # pred.requires_grad = True
tot_timestep_loss += loss_fn(pred, target[:, i])
tot_timestep_loss.backward()
optim.step()
```

If I want to backpropogate through the output-to-input autoregressive connection via `pred`

, would the above be sufficient? Specifically, I want to ensure the autoregressive mapping is considered in the evaluation of the gradient during backprop such that in theory the model parameters are influenced by the free-running autoregressive decoding errors during training. When I print the gradient of the non-leaf variable `pred`

using `pred.register_hook()`

, I notice it returns a tensor of size `BxD`

, where `B`

is the batch size and `D`

is the output dimension, which I think is associated with the final timestep gradient at `max_seq_len`

. Is the model accurately accounting for the autoregressive term in the gradient calculation using this method? Is there an easy way to visualize the backprop graph to see if these connections exist?