As @zhangguanheng66 rightly said,
<pad> serve different purposes.
unk represents tokens that are not in your vocabulary, while
<pad> is some kind of non-token to create batches with sequences of equal length. Some very simple consquences:
<pad> can only be at the end or start of sequence (depending which “side” you pad), while
<unk> can be anywhere in the non-padded part of the sequence
- If, say, you pad at the end, there can never be an
unk (or any other token) after the first
pad are kind of filler tokens they are potentially not great to have; it’s not obvious (to me) how
<pad> effects the learning. You generally cannot avoid
In short, there are arguably differences between
<pad>. Sure, you can use the same index value to represent both tokens – and maybe it doesn’t cause any problems – but I wouldn’t rob the network of the knowledge that these two token serve different purposes.
Personally, I never use padding (and packing), but organize my batches in such a way that all sequences have the same length. This has a better performance, and I don’t have to worry whether
<pad> effects the learning. You also might want to look at PyTorch’s
BucketIterator which minimizes padding out of the box.