Padding sequence in LSTM

chinmay5 · December 10, 2019, 2:41pm

I have a few doubts regarding padding sequences in a LSTM/GRU:-

If the input data is padded with zeros and suppose 0 is a valid index in my Vocabulary, does it hamper the training
After doing a pack_padded_sequence , does Pytorch take care of ensuring that the padded sequences are ignored during a backprop
Is it fine to compute loss on the entire padded sequence
While evaluating, I use value of the final hidden state h_n as the one for my prediction tasks. Is this value already corrupted by my input sequence padding?

vdw · December 31, 2019, 4:45am

If you use 0 for padding, then it shouldn’t be also used for any word in your vocabulary. Otherwise the model cannot distinguish between padding and the respective word.
pack_padded_sequence should take care of correctly backpropagating. That’s the overall purpose

Regarding the last 3 points in general: You might want to consider creating batches where all sequences within one batch have the same length. No need for padding or packing; see this older post. It’s simply convenient, and I use it all the time.

chinmay5 · January 10, 2020, 1:39pm

@vdw thank you so much for the reply. I am using glove pretrained embeddings. Do you think it would be a better idea to start using ‘unk’ token for padding? For the time being, I started adding ‘eos’ token at the end of each sentence. My idea is, network would learn to ignore all padded tokens beyond this ‘unk’ token. Since I am still new to using lstm, your help would be highly appreciated

zhangguanheng66 · January 10, 2020, 4:04pm

I think unk and pad have different purpose. For example, you might have some unknown word in your test/evaluate datasets while the vocabulary is built on training dataset. In that case, you want to use unk for the unknown words. For padding, it’s obviously for the scenario when the sequences have different lengths.
Anyway, it would be an interesting experiment to using unk token for padding.

vdw · January 11, 2020, 2:10am

As @zhangguanheng66 rightly said, <unk> and <pad> serve different purposes.unk represents tokens that are not in your vocabulary, while <pad> is some kind of non-token to create batches with sequences of equal length. Some very simple consquences:

<pad> can only be at the end or start of sequence (depending which “side” you pad), while <unk> can be anywhere in the non-padded part of the sequence
If, say, you pad at the end, there can never be an unk (or any other token) after the first <pad>
Since pad are kind of filler tokens they are potentially not great to have; it’s not obvious (to me) how <pad> effects the learning. You generally cannot avoid <unk>.

In short, there are arguably differences between <unk> and <pad>. Sure, you can use the same index value to represent both tokens – and maybe it doesn’t cause any problems – but I wouldn’t rob the network of the knowledge that these two token serve different purposes.

Personally, I never use padding (and packing), but organize my batches in such a way that all sequences have the same length. This has a better performance, and I don’t have to worry whether <pad> effects the learning. You also might want to look at PyTorch’s BucketIterator which minimizes padding out of the box.

chethanjjj · March 27, 2021, 7:21pm

For padding, in your experience, do you recommend padding at the start or end? I’ve seen discussions for both.

Brando_Miranda · July 14, 2021, 8:14pm

Hi Chris @vdw,

I was wondering, how have your practices for padding changed since the rise of the transformer?

It seems to me that the best thing to do in that case it to pad with zero vector (equal to the size of the embed dim). That way with the self-attention, when we compare a real token and a padd token (which is just zero) the output attention results in being zero. Is this not what is done?

Brando_Miranda · July 14, 2021, 8:21pm

Nvm, you take care of my intended padding with a -inf because of how the softmax works in MHA, see: How to add padding mask to nn.TransformerEncoder module?