As I understand it, in order to ‘mask’ the input to an RNN (e.g. values of zero which simply represent padding of unequal length text to be uniform), one has to sort the sequence from longest to shortest and then this new sorted stacked sequence is sent into the RNN in mini batches.
Doesn’t this preclude you from shuffling your mini batches between epochs?
I am presuming that the full training set is packed prior to training. I tried to shuffle this packed data during training and an exception was raised that the data needed to be sorted.
Perhaps a work around is to pack each minibatch separately in the training loop? Not sure the impact this has on efficiency.
Slightly off-topic: For everything except machine translation, that is, for classifiers, autoencoder, sequence tagger, I always use custom batch sampler that organizes the dataset into batches with sequences of equal length. This avoids the hassle with PackedSequence as well as padding in general.
from pytorch.utils.data.text.dataset import BucketBatchSampler, BucketDataset
bucket_batch_sampler = BucketBatchSampler(X_train, batch_size)
bucket_dataset = BucketDataset(X_train, None) # None because example is for an autoencoder (otherwise, e.g., y_train for classifier)
X_train_iter = DataLoader(bucket_dataset, batch_sampler=bucket_batch_sampler, num_workers=4)
If you check the code, there’s a lot of shuffling going on before each iteration.
Yes, say your chosen batch size is 32, some batches might not have 32 samples. However, if your dataset is large, these cases are negligible. I did a test with a 300k sentences or so. 99.7% of batches were full, and the rest were almost full maybe with an occasional outlier. There’s no performance loss.