In the pytorch examples repository, the word language model is being fed batches of size bptt x batch_size,
however in the training loop the code iterates over the dataset with a step of length bptt.
In my understanding this means that the dataset is being spliced as follows:
Given the sequence of characters: “a” “b” “c” “d” … “z” and bptt equal to 3 and ignoring batching for simplicity:
- first sequence: src=“a”,“b”,“c”; trg=“d”
- second sequence: src=“d”, “e”, “f”; trg=“g”
Perhaps, I am wrong but doesn’t it mean that an amount of data proportional to the value of bptt isn’t being used during training (in the example above sequences src=“b” “c” “d”, trg=“e” and src=“c” “d” “e”, trg=“f” aren’t in the training set)?
b,c,d target=“e” is covered by carrying the hidden state forward between sequences.
Thanks for your reply.
I am aware that those characters would also be taking into account for the prediction, however my concern is more related to the error signal during training and the evaluation during testing. In the current approach there is only one backpropagation every bptt characters and similarly for testing, perplexity is only estimated on a fraction of characters in the test set.
In other words, the parameter bptt seems to be tweaking two things:
- how many steps back to include in the rnn computational graph.
- how many examples to take into account for estimate perplexity on.
but perhaps I am understanding things in a wrong way…