I am only beginning with pytorch and I am stuck with the problem of processing sequences with variable length.
My task is to take two sequences and infer how similar are they.
I have came up with something like this. q1 and q2 are just list of integer word indices.
nn.Embedding has the option for a padding_idx (check docs).
padding_idx controls the padding index. When you send in q1 or q2 containing this index, zeros are returned and they are ignored for gradient.
So you can send q1 and q2 to be mini-batches padded to the maximum length of your sequence, and padded with the value padding_idx.
after getting q1_embeds and q2_embeds and before the mean operation, you might have to do a torch.gather of specific indices before doing mean, or do the mean operation in a for loop.
Is there any way to work with “truly” variable length sequences? For instance, I would want to have a CNN with a RNN on top, I would also need masking?
I just wanted to clarify - so pytorch does not support variable length sequences in the batch and masking is mandatory? I hoped it would be possible to avoid it
Masking is mandatory but the PyTorch RNNs natively support variable length sequences (created by pack_padded_sequence) and will correctly avoid processing the padding tokens, even for the reverse direction of a bidirectional RNN.
However, when the sequence is very sparse, i.e the maximum length is very large, while each sequence might very small. In this case, padding will waste a lot of memory.
Is there any other way to resolve this?