Using batches for Seq2Seq models

vdw · September 5, 2019, 3:51am

Since I got a couple of questions in this previous thread, which aims to order sequence data into batches where all input sequences in a batch have the same length. This avoids the need of padding and optional packing.

The original solution work only for sequence classification, sequence tagging, autoencoder models since the ordering only considered the input sequences. For Seq2Seq models such as machine translation, where the output sequences also differ in lengths, it didn’t work.

I’ve now made the small changes to support this as well – here is the code for the Sampler and my own Vectorizer I use, as well as a Jupyter Notebook showing the usage. Now, each batch contains sequence pairs where all input sequences are the same length and all target sequences have the same length. For example a batch might contain 32 pairs with all input sequences having length 12 and all target sequences having length 15.

To see how the combination of input and target lengths are distributed, I’ve made a test with a real-world dataset for machine translation. I’ve set the batch size to 32. The figure below shows the distribution of the sizes of all batches. As one can see, the vast majority of batches is indeed full (i.e., 32 sequence pairs). This shouldn’t really be surprising since:

Batch sizes (e.g., 32 or 64) are is essentially nothing given large datasets of millions of sequences pairs or more. Thus, the chance that enough pairs share the same input and target lengths is high.
The combination of input and target lengths is typically not independent. For example, an input of length 5 generally does not have a target length of 20 but in a similar range.

bucketbatchsample-batch-size-distribution

In short, I think it’s a convenient solution for working with batches in case of Seq2Seq models, again, with no need for padding and packing – and way faster that training with a batch size of 1. Maybe it’s useful for some of you.

ptrblck · September 5, 2019, 11:32am

Thanks for sharing this, Chris!

TrentBrick · September 5, 2019, 4:09pm

Nice editions! @ptrblck this should be added to Pytorch as a DataLoader object!

ptrblck · September 5, 2019, 4:11pm

This might be a good idea and worth discussing.
@vdw would you mind creating an issue in the GitHub repo and linking your post here?

vdw · September 6, 2019, 12:17am

As suggested, I’ve submitted a feature requested. However, I feel the latest version of the BucketIterator of torchtext might be already sufficient. At least, I saw some examples online where it was used for machine translation tasks. While it “only” minimizes the padding, one can argue that this hardly effects the accuracy of the model anymore. I haven’t checked the inner workings of the BucketIterator, though.

EDIT: Link to the feature request in case someone wants to comment on.