Hi, I am trying to create a Dataloader that takes in variable input lengths however I get an error when doing this for a batch_size greater than 1.
I have heard that there are ways around this using torch.nn.utils.rnn.pack_sequence and creating a collate_func
however it is unclear to me how this works so I have created a very simple example of a Pytorch Dataset
that creates random sequences of variable lengths below.
I am unsure how to fiddle with the collate_func together with the torch.nn.utils.rnn.pack_sequence to create Dataloader that takes accepts variable input lengths.
For clarity, I intend to use this with an LSTM and so the rnn.pack_sequence
function looks relevant as well.
import numpy as np
from numpy.random import rand
from random import randint
import torch
from torch.utils.data import DataLoader, Dataset
class SequenceFactory(Dataset):
"""
A Dataset that spits out arrays with a random size
between 1 and 8.
"""
def __init__(self):
max_len = 8
no_of_sequences = 100
#create list of arrays of variable lengths between 1 and 8
self.data = [rand(randint(1,max_len)) for seq in range(no_of_sequences)]
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return len(self.data) #100
data = SequenceFactory()
dataloader = DataLoader(data, batch_size=2,
shuffle=True)
next(iter(dataloader))
RuntimeError: stack expects each tensor to be equal size, but got [3] at entry 0 and [7] at entry 1