Particularly in the NLP category of this forum, there are regularly questions posted asking how to handle batches in case of sequences of variable length. When it comes to sentence classification, sentence labeling or machine translation, sequences of variable lengths are the default case to deal with.
I like to think to know what padding and packing is doing. However, I still prefer the alternative and to create batches in such way that all sequences (pairs) within a batch have the same length; see this older post of mine for details. But since padding and packing (and resulting consequences) are regularly covered in this forum, I wonder if my approach is fundamentally flawed.
I therefore would like to outline the pros and (potential) cons of using custom batches (i.e., all sequence (pairs) withing a batch have equal lengths) – and hope for others to comment on it, particularly regarding any downsides of this approach I have missed.
PROs:
- No padding and packing/unpacking needed (duh!). Some very anecdotal observations show that saves me 10% performance loss
- No need to worry if padding might have any effect on the accuracy (in case only padding but not packing is used)
- No masking needed, e.g., in case one needs to calculate the loss over the RNN outputs and not the hidden state
- Things like attention or pooling over the RNN outputs require no special consideration (I assume this also can be achieved using masking, I just never needed to)
- The code in the
forward()
methods stays much cleaner without the need for packing/unpacking, masking, or other steps to properly handle variable sequence lengths. that makes it also way more less prone to accidental errors. There’s essentially no difference between forbatch_size=1
(what many introductory tutorials assume) andbatch_size>1
.
CONs(?):
- No arbitrary shuffling of batches between epochs (one can still shuffle among all sequences or sequence pairs of equal length)
- Not all batches are full. For example, if there are 100 sequence pairs of lengths (15,18) and the batch size is 64, one batch will only have 36 pairs. However, for large datasets the number of “non-full” batches is negligible; see linked post for a crude evaluation.
In short, at least right now, I would go with custom batches over padding/packing/masking/etc. any day of the week. Any counter-arguments would be more than welcome. It would really help my understanding. Thanks a lot!