Schema for feeding RNN training data

For a data set with 600 time steps, this stackoverflow answer proposes the following training schema, where each line represents a batch with sequence_length=5 that will be trained on an RNN model:

             t=0  t=1  t=2  t=3  t=4  t=5  ...  t=598  t=599
sample       |---------------------|
sample            |---------------------|
sample                 |-----------------
sample                                          ----|
sample                                           ----------|

I had naively assumed that this would be excessive (as the overlap will have the model seeing each data point around sequence_length times), and thought that the following would be sufficient (say bptt sequence_length is 3 for convenience):

             t=0  t=1  t=2  t=3  t=4  t=5  t=6  t=7  ...  t=598  t=599
sample       |-----------|
sample                      |-----------|
sample                                     |-------
sample                                                    -----------|

The first schema now makes sense to me, as it is the only way the model will be able to see each transition between time stops at least once. If I read correctly, it also looks like get_batch does this in the word_language_model example. I just wanted to verify that this is the way we should be training sequential data.


No, the second schema is more common for language model training (and is used in the word_language_model example). The first schema is needed in certain unusual cases, such as Pointer Sentinel Language Models, and in principle can provide marginally more information, but it’s much slower.

1 Like

Thanks for the answer. I’m using time series data rather than LM but it sounds like the 2nd schema is still preferred.

Is it important that the sequences are trained in order so that the relevant hidden state is reused, or is it common to shuffle them? It seems the hidden state may not be used anyways if they’re fed in parallel batches.

Ok, I see. I was just looking at the way get_batch was indexing, but it looks like in training/evaluation it loops using range with a bptt stepsize.

For truncated BPTT training, it’s important that batches be processed in order so that hidden states are preserved.

@jekbradbury, so we have truncated BPTT training supported in Pytorch? I fail to find any doc on it yet

Yes, it’s used in the word language model example – all you need to do is call .detach_() on a variable and it will break the computation graph in a way that truncates backpropagation.