Pytorch Dataloader with variable sequence lengths inputs

Imran_Rashid1 · October 6, 2020, 10:58am

Hi, I am trying to create a Dataloader that takes in variable input lengths however I get an error when doing this for a batch_size greater than 1.

I have heard that there are ways around this using torch.nn.utils.rnn.pack_sequence and creating a collate_func however it is unclear to me how this works so I have created a very simple example of a Pytorch Dataset that creates random sequences of variable lengths below.

I am unsure how to fiddle with the collate_func together with the torch.nn.utils.rnn.pack_sequence to create Dataloader that takes accepts variable input lengths.

For clarity, I intend to use this with an LSTM and so the rnn.pack_sequence function looks relevant as well.

import numpy as np
from numpy.random import rand
from random import randint

import torch
from torch.utils.data import DataLoader, Dataset

class SequenceFactory(Dataset):
   """ 
   A Dataset that spits out arrays with a random size
         between 1 and 8. 
   """

    def __init__(self):

        max_len = 8
        no_of_sequences = 100

        #create list of arrays of variable lengths between 1 and 8
        self.data = [rand(randint(1,max_len)) for seq in range(no_of_sequences)] 

    def __getitem__(self, index):

        return self.data[index]

    def __len__(self):
        return len(self.data) #100

data = SequenceFactory()
dataloader = DataLoader(data, batch_size=2,
                        shuffle=True)

next(iter(dataloader))

RuntimeError: stack expects each tensor to be equal size, but got [3] at entry 0 and [7] at entry 1

ptrblck · October 8, 2020, 11:32pm

This post provides some implementations to deal with variable input shapes.