Tensorflow-esque bucket by sequence length

vdw · July 10, 2019, 4:20am

Say you’re building an RNN-based sentiment classifier for tweets/sentences. RNNs go over a sentences word by word, but sentences generally have different lengths. If your batch_size=1 it doesn’t matter since each sentences to processed/classified independently. However, with batch_size=1 you sacrifice a lot of performance. You want to process multiple sentence at once, e.g., batch_size=64. To deal with the different lengths of sentences you have different options:

Padding: All shorter sentences get extended by a default word, e.g., <pad>, to the length of the longest sentences. Now all sentences in the batch have the same length. The problem is that it’s not obvious if and how padding effects the training and accuracy of the classifier (particularly when the differences in lengths are very great).
PackedSequence: That’s concept provided by PyTorch that hides that the RNN stops at the last word of each sentence in batch, even if they have different lengths. This works fine if you only need the output of the last step. If you need the output at each steps (e.g., for pooling over all steps), I don’t think you can use that approach. It adds additional logic which reduces performance (I’ve noticed 10% in one of my cases but that’s anecdotal).
Bucketing: (the proposed idea in this thread): Well, you simply ignore all the issues by ensuring that all sentences in a batch have the same length. Needs no padding, needs no PackedSequence, the output of each time steps can be subsequently. In my opinion, that’s the most hassle-free approach for classification and sequence labeling task. Note that it’s not applicable as described above for sequence-to-sequence tasks (e.g., machine translation).