Impact of Sorting on Packed Padded LSTM

As I understand it, in order to ‘mask’ the input to an RNN (e.g. values of zero which simply represent padding of unequal length text to be uniform), one has to sort the sequence from longest to shortest and then this new sorted stacked sequence is sent into the RNN in mini batches.

Credit to :

Doesn’t this preclude you from shuffling your mini batches between epochs?

I am presuming that the full training set is packed prior to training. I tried to shuffle this packed data during training and an exception was raised that the data needed to be sorted.

Perhaps a work around is to pack each minibatch separately in the training loop? Not sure the impact this has on efficiency.

Slightly off-topic: For everything except machine translation, that is, for classifiers, autoencoder, sequence tagger, I always use custom batch sampler that organizes the dataset into batches with sequences of equal length. This avoids the hassle with PackedSequence as well as padding in general.

Here’s the code. And that’s what the usage looks like:

from import BucketBatchSampler, BucketDataset

bucket_batch_sampler = BucketBatchSampler(X_train, batch_size)
bucket_dataset = BucketDataset(X_train, None) # None because example is for an autoencoder (otherwise, e.g., y_train for classifier)

X_train_iter = DataLoader(bucket_dataset, batch_sampler=bucket_batch_sampler, num_workers=4)

Maybe useful for you.

1 Like

I am curious though, it seems like you must have uneven batch sizes and always the same observations in the same order still?

If you check the code, there’s a lot of shuffling going on before each iteration.

Yes, say your chosen batch size is 32, some batches might not have 32 samples. However, if your dataset is large, these cases are negligible. I did a test with a 300k sentences or so. 99.7% of batches were full, and the rest were almost full maybe with an occasional outlier. There’s no performance loss.