How BiLSTM works with padding/pack_padded_sequence

XezXey · September 27, 2021, 9:02am

I understand how padding and pack_padded_sequence work, but I have a question about how it’s applied to Bidirectional.

Does the BiLSTM (from nn.LSTM) automatically applied the inverse of the sequence (also in case of using the pad_packed_sequence)?

If yes, so does the padding affect the first-last timestep?

for example : seq1=[a, b, c, d, e], seq2=[x, y, z]
and after padding : seq1=[a, b, c, d, e], seq2=[x, y, z, 0, 0]
If we input the seq2, this means that BiLSTM
      - take input of [x, y, z, 0, 0] and [0, 0, z, y, z]
  or  - take input of [x, y, z, 0, 0] and [z, y, x, 0, 0]

Would you please help me clarify these points? Thank you very much

huahuanZ · September 27, 2021, 9:18am

As far as I know, case 2 is the correct way.
Edit: I remove the misleading expression, please refer below discussion.

XezXey · September 27, 2021, 9:20am

Is it because of lengths parameters that we use while packing to make the padding mask not be included in a network input?

huahuanZ · September 27, 2021, 9:33am

I think the length information is even not required.
Note that we pad the short sequences with zeros. RNN with input hidden=None (the first frame) would actually creates hidden=torch.zeros(...). In this case, [z, y, x, 0, 0] and [0, 0, z, y, x] just produce same results.
But for theoretically, an inverse ([x, y, z, 0, 0] -> [0, 0, z, y, x]) and a shift ([0, 0, z, y, x] -> [z, y, x, 0, 0]) are required in case paddings are not zeros.