Build batch faster than with __getitem__?

In a case where my data fits in memory as numpy array, I realize that batching the data using getitem from the Dataset interface is much slower than when it is indexed manually with numpy.

I am sure it’s because the Dataloader build batches sample by sample by calling getitem to fetch each sample.

Is there any workaround to build batches faster while still using the standard Dataset / Dataloader interface ?

Hi @veda101, could you clarify a small detail?

You mentioned that the data fits in memory, and so, do you read the entire data in __init__? Or do you read it lazily in __getitem__?

I can for instance load all the data in memory in the __init__, and then access each row of the dataset in the __getitem__, but because getitem fetch each row one by one, it is definitely slower than fetching with a slice in numpy like data[0:batch_size]

You could use BatchSampler to pass a batch of indices to __getitem__ and create multiple samples in a single call, if that fits your use case.

1 Like