Custom DataLoader (not Dataset)

Andreas_Bloch · June 20, 2019, 9:36am

I’m sampling minibatches of sequences and I want to store them in a tensor of dimension:

(num_sequences, max_sequence_length)

where max_sequence_length is the length of the longest sequence. Sequences shorter than max_sequence_length will be padded with an appropriate padding value.

Now, ideally I’d like to do the padding already in the DataLoader (as I can parallelize that step in the workers on the CPUs) instead of doing it in the network’s forward function. The Dataset class however only provides the getitem() function. In that function I yet don’t know the length of the longest sequence that will be in the minibatch.

Do you know about a tutorial to write a custom DataLoader that would could do the dimension adaptions to a batch?

Andreas_Bloch · June 20, 2019, 2:02pm

I’ve found the solution, all you need to do is to implement a custom function called:

collate_fn

It takes a batch (list) of examples returned by the Dataset getitem() function, and should return a batch as desired by the network (or next consumer).