Write DataLoader's collate_fn with pad_packed_sequence for LSTMs

I’m very new to PyTorch and my problem involves LSTMs with inputs of variable sizes.

Because each training example has a different size, what I’m trying to do is to write a custom collate_fn to use with DataLoader to create mini-batches of my data. To my understanding, I’d need to implement my own collate_fn and use pad_packed_sequence somehow…

I have tried looking at examples online, but nobody seems to be doing it like me :frowning:
The problem arises from the fact that collate_fn gets passed the input batch which is a list of tuples of (training_tensor, label) and I can’t seem to figure out how to properly convert it to a suitable datatype and pass the tensors to pad_packed_sequence.

What is the correct way of doing this?
A more general question: is there something better than DataLoader that works nicely with LSTMs?

3 Likes

I figured this out myself a while back. I posted a detailed solution here:

https://www.codefull.org/2018/11/use-pytorchs-dataloader-with-variable-length-sequences-for-lstm-gru/

4 Likes

@Maghoumi thanks for your post! Someone also posted this here later in 2019 in case it might be useful to someone:

Also note that a boolean parameter enforce_sorted was added to torch.nn.utils.rnn.pack_padded_sequence() in December 2018 (see this issue). So there is no need anymore to provide a sorted sequence, it is sorted internally when enforce_sorted=False.