Padding sequence in LSTM

vdw · January 11, 2020, 2:10am

As @zhangguanheng66 rightly said, <unk> and <pad> serve different purposes.unk represents tokens that are not in your vocabulary, while <pad> is some kind of non-token to create batches with sequences of equal length. Some very simple consquences:

<pad> can only be at the end or start of sequence (depending which “side” you pad), while <unk> can be anywhere in the non-padded part of the sequence
If, say, you pad at the end, there can never be an unk (or any other token) after the first <pad>
Since pad are kind of filler tokens they are potentially not great to have; it’s not obvious (to me) how <pad> effects the learning. You generally cannot avoid <unk>.

In short, there are arguably differences between <unk> and <pad>. Sure, you can use the same index value to represent both tokens – and maybe it doesn’t cause any problems – but I wouldn’t rob the network of the knowledge that these two token serve different purposes.

Personally, I never use padding (and packing), but organize my batches in such a way that all sequences have the same length. This has a better performance, and I don’t have to worry whether <pad> effects the learning. You also might want to look at PyTorch’s BucketIterator which minimizes padding out of the box.