Training an autoregressive RNN

Messiah · May 26, 2023, 4:01am

I want to train an autoregressive RNN as follows:

for i in range(max_seq_len):
    pred, hidden = GRU(pred, hidden) # pred.requires_grad = True
    tot_timestep_loss += loss_fn(pred, target[:, i])
tot_timestep_loss.backward()
optim.step()

If I want to backpropogate through the output-to-input autoregressive connection via pred, would the above be sufficient? Specifically, I want to ensure the autoregressive mapping is considered in the evaluation of the gradient during backprop such that in theory the model parameters are influenced by the free-running autoregressive decoding errors during training. When I print the gradient of the non-leaf variable pred using pred.register_hook(), I notice it returns a tensor of size BxD, where B is the batch size and D is the output dimension, which I think is associated with the final timestep gradient at max_seq_len. Is the model accurately accounting for the autoregressive term in the gradient calculation using this method? Is there an easy way to visualize the backprop graph to see if these connections exist?