As @zhangguanheng66 rightly said, <unk>
and <pad>
serve different purposes.unk
represents tokens that are not in your vocabulary, while <pad>
is some kind of non-token to create batches with sequences of equal length. Some very simple consquences:
-
<pad>
can only be at the end or start of sequence (depending which “side” you pad), while<unk>
can be anywhere in the non-padded part of the sequence - If, say, you pad at the end, there can never be an
unk
(or any other token) after the first<pad>
- Since
pad
are kind of filler tokens they are potentially not great to have; it’s not obvious (to me) how<pad>
effects the learning. You generally cannot avoid<unk>
.
In short, there are arguably differences between <unk>
and <pad>
. Sure, you can use the same index value to represent both tokens – and maybe it doesn’t cause any problems – but I wouldn’t rob the network of the knowledge that these two token serve different purposes.
Personally, I never use padding (and packing), but organize my batches in such a way that all sequences have the same length. This has a better performance, and I don’t have to worry whether <pad>
effects the learning. You also might want to look at PyTorch’s BucketIterator
which minimizes padding out of the box.