I’m a bit lost with understanding what PyTorch does internally in these two variations of passing a sequence through an RNN model:
out, hidden = model(sequence)
for element in sequence: out, hidden = model(element, hidden)
In my understanding, version 1 and 2 should be the same - I’m just manually unrolling it over time in version 2. Version 2 also allows us to implement stuff like modifying the hidden state before passing it to the next timestep (which is what I would like to do for my use case), and also to train it without teacher forcing (by passing in the last “out” instead of “element”). However, I noticed that version 1 trains much faster (about 8x on my machine).
Question 1: Is version 1 faster simply because e.g. it has less forward-calls, or does PyTorch internally somehow parallelise something across timesteps (which I don’t see how it could do that given that the RNN’s computations depend on the output of the previous timestep)?
Question 2: More generally, is version 2 indeed the way to go if I need to train without teacher forcing or if I need to modify “hidden” between timesteps?
Thank you for your help!