In the pytorch examples repository, the word language model is being fed batches of size bptt x batch_size,
however in the training loop the code iterates over the dataset with a step of length bptt.
In my understanding this means that the dataset is being spliced as follows:
Given the sequence of characters: “a” “b” “c” “d” … “z” and bptt equal to 3 and ignoring batching for simplicity:
first sequence: src=“a”,“b”,“c”; trg=“d”
second sequence: src=“d”, “e”, “f”; trg=“g”
Perhaps, I am wrong but doesn’t it mean that an amount of data proportional to the value of bptt isn’t being used during training (in the example above sequences src=“b” “c” “d”, trg=“e” and src=“c” “d” “e”, trg=“f” aren’t in the training set)?
Thanks for your reply.
I am aware that those characters would also be taking into account for the prediction, however my concern is more related to the error signal during training and the evaluation during testing. In the current approach there is only one backpropagation every bptt characters and similarly for testing, perplexity is only estimated on a fraction of characters in the test set.