Proper way to train an autoregressive RNN


I implemented an RNN that steps through each element of a sequence and decides with some probability whether to use the ground-truth target or the output of the previous model prediction as the current input to predict the next timestep. In order to allow the model to learn from its own predictive errors when using the models output as input, I would like to backpropogate through the output-to-input connection induced by that choice by preserving it in the computation graph.

Is this the proper way to account for long-term predictive errors induced by an autoregressive model? When I apply this strategy, the performance is worse compared to when I do not backpropogate through the output-to-input connection and treat the previous model output as essentially a constant by .detach()ing it at each timestep. Should I also be optimizing the model output used as input tensor by including it in the optimizer instantiation?

Any insights are appreciated.