Should we .detach() predicted model outputs used as input in seq2seq model training?

Thanks @ptrblck and @srishti-git1110 for the clarification! It makes sense why we should detach() those variables since we usually do not want to associate gradients with inputs. I seem to have confused the output of the GRU/LSTM with the hidden state, which in fact is backpropogated through (and in the case of per timestep training is actually the same as the output of a GRU/LSTM at each timestep).

I would like to ask a few other questions if you don’t mind:

  1. Is BPTT executed on an RNN model when we call loss.backwards(), where loss is say the accumulated loss across each timestep of a training sequence, after which we call optim.step(), which updates model parameters the gradients calculated using all prior outputs (hidden states) of the sequence?

  2. What if we call loss.backward(retain_graph=True) and optim.step() at each timestep, but detach() the hidden state (in order to avoid the error as you show)? There is no BPTT happening in this case, since we break the computation graph at each timestep. Is there then a way to update model parameters at each timestep of a sequence without needing to detach() the hidden state? It seems this issue was raised before but was left unanswered.

  3. When training a LSTM/GRU, there are normally two inputs provided to the model at each timestep: the current state features and the previous hidden state. Training seq2seq models often uses ‘teacher forcing’, where the ground truth targets are provided as the current state input at each timestep as opposed to the prior model prediction. How does BPTT differ when training a model using teacher forcing and when using its own model predictions (i.e., autoregressive training), and what specific autograd considerations should be made in the latter approach if any?