Schema for feeding RNN training data

d10genes · March 18, 2017, 12:12am

For a data set with 600 time steps, this stackoverflow answer proposes the following training schema, where each line represents a batch with sequence_length=5 that will be trained on an RNN model:

             t=0  t=1  t=2  t=3  t=4  t=5  ...  t=598  t=599
sample       |---------------------|
sample            |---------------------|
sample                 |-----------------
...
sample                                          ----|
sample                                           ----------|

I had naively assumed that this would be excessive (as the overlap will have the model seeing each data point around sequence_length times), and thought that the following would be sufficient (say bptt sequence_length is 3 for convenience):

             t=0  t=1  t=2  t=3  t=4  t=5  t=6  t=7  ...  t=598  t=599
sample       |-----------|
sample                      |-----------|
sample                                     |-------
...
sample                                                    -----------|

The first schema now makes sense to me, as it is the only way the model will be able to see each transition between time stops at least once. If I read correctly, it also looks like get_batch does this in the word_language_model example. I just wanted to verify that this is the way we should be training sequential data.

Thanks

jekbradbury · March 18, 2017, 12:50am

No, the second schema is more common for language model training (and is used in the word_language_model example). The first schema is needed in certain unusual cases, such as Pointer Sentinel Language Models, and in principle can provide marginally more information, but it’s much slower.

d10genes · March 18, 2017, 1:29am

Thanks for the answer. I’m using time series data rather than LM but it sounds like the 2nd schema is still preferred.

Is it important that the sequences are trained in order so that the relevant hidden state is reused, or is it common to shuffle them? It seems the hidden state may not be used anyways if they’re fed in parallel batches.

Ok, I see. I was just looking at the way get_batch was indexing, but it looks like in training/evaluation it loops using range with a bptt stepsize.

jekbradbury · March 18, 2017, 8:59am

For truncated BPTT training, it’s important that batches be processed in order so that hidden states are preserved.

ecolss · April 26, 2017, 2:55pm

@jekbradbury, so we have truncated BPTT training supported in Pytorch? I fail to find any doc on it yet

jekbradbury · April 27, 2017, 8:04pm

Yes, it’s used in the word language model example – all you need to do is call .detach_() on a variable and it will break the computation graph in a way that truncates backpropagation.