Padding sequence in LSTM

I have a few doubts regarding padding sequences in a LSTM/GRU:-

  • If the input data is padded with zeros and suppose 0 is a valid index in my Vocabulary, does it hamper the training

  • After doing a pack_padded_sequence , does Pytorch take care of ensuring that the padded sequences are ignored during a backprop

  • Is it fine to compute loss on the entire padded sequence

  • While evaluating, I use value of the final hidden state h_n as the one for my prediction tasks. Is this value already corrupted by my input sequence padding?

1 Like
  • If you use 0 for padding, then it shouldnā€™t be also used for any word in your vocabulary. Otherwise the model cannot distinguish between padding and the respective word.
  • pack_padded_sequence should take care of correctly backpropagating. Thatā€™s the overall purpose

Regarding the last 3 points in general: You might want to consider creating batches where all sequences within one batch have the same length. No need for padding or packing; see this older post. Itā€™s simply convenient, and I use it all the time.

1 Like

@vdw thank you so much for the reply. I am using glove pretrained embeddings. Do you think it would be a better idea to start using ā€˜unkā€™ token for padding? For the time being, I started adding ā€˜eosā€™ token at the end of each sentence. My idea is, network would learn to ignore all padded tokens beyond this ā€˜unkā€™ token. Since I am still new to using lstm, your help would be highly appreciated

I think unk and pad have different purpose. For example, you might have some unknown word in your test/evaluate datasets while the vocabulary is built on training dataset. In that case, you want to use unk for the unknown words. For padding, itā€™s obviously for the scenario when the sequences have different lengths.
Anyway, it would be an interesting experiment to using unk token for padding.

As @zhangguanheng66 rightly said, <unk> and <pad> serve different purposes.unk represents tokens that are not in your vocabulary, while <pad> is some kind of non-token to create batches with sequences of equal length. Some very simple consquences:

  • <pad> can only be at the end or start of sequence (depending which ā€œsideā€ you pad), while <unk> can be anywhere in the non-padded part of the sequence
  • If, say, you pad at the end, there can never be an unk (or any other token) after the first <pad>
  • Since pad are kind of filler tokens they are potentially not great to have; itā€™s not obvious (to me) how <pad> effects the learning. You generally cannot avoid <unk>.

In short, there are arguably differences between <unk> and <pad>. Sure, you can use the same index value to represent both tokens ā€“ and maybe it doesnā€™t cause any problems ā€“ but I wouldnā€™t rob the network of the knowledge that these two token serve different purposes.

Personally, I never use padding (and packing), but organize my batches in such a way that all sequences have the same length. This has a better performance, and I donā€™t have to worry whether <pad> effects the learning. You also might want to look at PyTorchā€™s BucketIterator which minimizes padding out of the box.

1 Like

For padding, in your experience, do you recommend padding at the start or end? Iā€™ve seen discussions for both.

Hi Chris @vdw,

I was wondering, how have your practices for padding changed since the rise of the transformer?

It seems to me that the best thing to do in that case it to pad with zero vector (equal to the size of the embed dim). That way with the self-attention, when we compare a real token and a padd token (which is just zero) the output attention results in being zero. Is this not what is done?

Nvm, you take care of my intended padding with a -inf because of how the softmax works in MHA, see: How to add padding mask to nn.TransformerEncoder module?