Is padding and packing of sequences really needed?

Particularly in the NLP category of this forum, there are regularly questions posted asking how to handle batches in case of sequences of variable length. When it comes to sentence classification, sentence labeling or machine translation, sequences of variable lengths are the default case to deal with.

I like to think to know what padding and packing is doing. However, I still prefer the alternative and to create batches in such way that all sequences (pairs) within a batch have the same length; see this older post of mine for details. But since padding and packing (and resulting consequences) are regularly covered in this forum, I wonder if my approach is fundamentally flawed.

I therefore would like to outline the pros and (potential) cons of using custom batches (i.e., all sequence (pairs) withing a batch have equal lengths) – and hope for others to comment on it, particularly regarding any downsides of this approach I have missed.

PROs:

  • No padding and packing/unpacking needed (duh!). Some very anecdotal observations show that saves me 10% performance loss
  • No need to worry if padding might have any effect on the accuracy (in case only padding but not packing is used)
  • No masking needed, e.g., in case one needs to calculate the loss over the RNN outputs and not the hidden state
  • Things like attention or pooling over the RNN outputs require no special consideration (I assume this also can be achieved using masking, I just never needed to)
  • The code in the forward() methods stays much cleaner without the need for packing/unpacking, masking, or other steps to properly handle variable sequence lengths. that makes it also way more less prone to accidental errors. There’s essentially no difference between for batch_size=1 (what many introductory tutorials assume) and batch_size>1.

CONs(?):

  • No arbitrary shuffling of batches between epochs (one can still shuffle among all sequences or sequence pairs of equal length)
  • Not all batches are full. For example, if there are 100 sequence pairs of lengths (15,18) and the batch size is 64, one batch will only have 36 pairs. However, for large datasets the number of “non-full” batches is negligible; see linked post for a crude evaluation.

In short, at least right now, I would go with custom batches over padding/packing/masking/etc. any day of the week. Any counter-arguments would be more than welcome. It would really help my understanding. Thanks a lot!

6 Likes

I don’t think the approach is fundamentally flawed, people use sorting to similar size whenever they can (see the existing stuff in torchvision and torchtext).

Not all batches are full. For example, if there are 100 sequence pairs of lengths (15,18) and the batch size is 64, one batch will only have 36 pairs. However, for large datasets the number of “non-full” batches is negligible; see linked post for a crude evaluation.

However,

  • this is problematic when you have batch norm (not that common in NLP, but an important caveat),
  • it also distorts other things - batch size 1 gradient estimates are 8 times noisier than 64. People have found all sorts effects from the last batch being smaller when the batch size does not divide the data set size (and asked here, and there is an option to drop some samples instead).
  • If you pool data for inference, you’re likely to have differing sequence sizes in your batch, anyway.

Personally, I would expect the 10% performance loss from padding/packing to be exaggerated for nontrivial models and a reasonable setup.
Neither did I have problems with the (lack of) cleanliness from unpacking / packing.

It would seem that clever hacks are like implicitly assuming all inputs are of the same length will drop on one’s foot (possibly of the next person working with the code) sooner or later. If one thinks of it as a trade-off between performance/code complexity vs. generality, requiring same-size inputs for a probably very modest gain seems dubious at best.

Best regards

Thomas

5 Likes

Thanks a lot for you reply, @tom! This is the kind of feedback I was hoping for. All your points make a lot of sense to me. Admittedly, I don’t quite understand the issue with batch norm, and I’m quite surprised that even one incomplete batch would cause problems.

Again, very useful feedback, thanks a lot!