I have a few doubts regarding padding sequences in a LSTM/GRU:-
If the input data is padded with zeros and suppose 0 is a valid index in my Vocabulary, does it hamper the training
After doing a pack_padded_sequence , does Pytorch take care of ensuring that the padded sequences are ignored during a backprop
Is it fine to compute loss on the entire padded sequence
While evaluating, I use value of the final hidden state h_n as the one for my prediction tasks. Is this value already corrupted by my input sequence padding?
If you use 0 for padding, then it shouldnāt be also used for any word in your vocabulary. Otherwise the model cannot distinguish between padding and the respective word.
pack_padded_sequence should take care of correctly backpropagating. Thatās the overall purpose
Regarding the last 3 points in general: You might want to consider creating batches where all sequences within one batch have the same length. No need for padding or packing; see this older post. Itās simply convenient, and I use it all the time.
@vdw thank you so much for the reply. I am using glove pretrained embeddings. Do you think it would be a better idea to start using āunkā token for padding? For the time being, I started adding āeosā token at the end of each sentence. My idea is, network would learn to ignore all padded tokens beyond this āunkā token. Since I am still new to using lstm, your help would be highly appreciated
I think unk and pad have different purpose. For example, you might have some unknown word in your test/evaluate datasets while the vocabulary is built on training dataset. In that case, you want to use unk for the unknown words. For padding, itās obviously for the scenario when the sequences have different lengths.
Anyway, it would be an interesting experiment to using unk token for padding.
As @zhangguanheng66 rightly said, <unk> and <pad> serve different purposes.unk represents tokens that are not in your vocabulary, while <pad> is some kind of non-token to create batches with sequences of equal length. Some very simple consquences:
<pad> can only be at the end or start of sequence (depending which āsideā you pad), while <unk> can be anywhere in the non-padded part of the sequence
If, say, you pad at the end, there can never be an unk (or any other token) after the first <pad>
Since pad are kind of filler tokens they are potentially not great to have; itās not obvious (to me) how <pad> effects the learning. You generally cannot avoid <unk>.
In short, there are arguably differences between <unk> and <pad>. Sure, you can use the same index value to represent both tokens ā and maybe it doesnāt cause any problems ā but I wouldnāt rob the network of the knowledge that these two token serve different purposes.
Personally, I never use padding (and packing), but organize my batches in such a way that all sequences have the same length. This has a better performance, and I donāt have to worry whether <pad> effects the learning. You also might want to look at PyTorchās BucketIterator which minimizes padding out of the box.
I was wondering, how have your practices for padding changed since the rise of the transformer?
It seems to me that the best thing to do in that case it to pad with zero vector (equal to the size of the embed dim). That way with the self-attention, when we compare a real token and a padd token (which is just zero) the output attention results in being zero. Is this not what is done?