Batch processing of sequences with different length


I am only beginning with pytorch and I am stuck with the problem of processing sequences with variable length.
My task is to take two sequences and infer how similar are they.

I have came up with something like this. q1 and q2 are just list of integer word indices.

 def SimilarityModule(nn.Module)
     def __init__():
         self.embedding = nn.Embedding(...)
    def forward(self, q1, q2):
        q1_embeds = self.embeddings(q1)
        q2_embeds = self.embeddings(q2)
        #mean over whole sequence
        q1_repre = torch.mean( q1_embeds, 0 ) 
        q2_repre = torch.mean( q2_embeds, 0 )

        dot_product = q1_repre, q2_repre ) / torch.norm( q1_repre, 2 ) / torch.norm( q1_repre, 2 )

        return dot_product

I can run the model on one sample:
similarity = model( q1, q2)

I have a question how properly create a batch of the sequences with variable length and feed it to the module? Thank you in advance for help!

nn.Embedding has the option for a padding_idx (check docs).
padding_idx controls the padding index. When you send in q1 or q2 containing this index, zeros are returned and they are ignored for gradient.

So you can send q1 and q2 to be mini-batches padded to the maximum length of your sequence, and padded with the value padding_idx.

after getting q1_embeds and q2_embeds and before the mean operation, you might have to do a torch.gather of specific indices before doing mean, or do the mean operation in a for loop.


Thank you ,

Is there any way to work with “truly” variable length sequences? For instance, I would want to have a CNN with a RNN on top, I would also need masking?

You would use padding in the CNN, then take the output and pass it to pack_padded_sequence for the RNN.

I just wanted to clarify - so pytorch does not support variable length sequences in the batch and masking is mandatory? I hoped it would be possible to avoid it :frowning:

Masking is mandatory but the PyTorch RNNs natively support variable length sequences (created by pack_padded_sequence) and will correctly avoid processing the padding tokens, even for the reverse direction of a bidirectional RNN.


Could you plz point to some example where variable length sequences are being passed in a batch

1 Like

I do not understand what is really meant by “masking” - could anyone explain please?

However, when the sequence is very sparse, i.e the maximum length is very large, while each sequence might very small. In this case, padding will waste a lot of memory.
Is there any other way to resolve this?