pack_padded_sequence when having batches with sequences of different lengths is not absolutely mandatory. Just try it with and without. In many cases – I talking mainly about classification here!!! – I don’t think you will see major differences in the results.
Ideally, the network will learn to kind of ignore the special padding word (e.g.,
<pad>). Some affects are probably always gonna be in practice.
How important using
packed_padded_sequence is will most likely depend on the training data. Most notably, batches where the lengths of the sequences vary greatly and/or are very skewed (99 sequences with around 10 items and 1 sequence with 100 items, so the other 99 sequences have to be padded a lot) will arguably have the most issues. That’s why existing solutions try to ensure that batches are relatively homogeneous:
torchtext defines an iterator that batches sequences of similar lengths together. This minimizes amount of padding needed while producing freshly shuffled batches for each new epoch.
- So this older post where we came up with an iterator that ensures that each batch contains sequences of the same length. While this might yield batches that are not full, for large datasets this issues is absolutely negligible.
- Padding is not necessarily bad (i.e., packing is not necessarily needed)
- Simply try with or without padding to see the effects.
- Try approaches to automatically minimize the required padding.
I hope that helps.