Build batch faster than with getitem?

veda101 · October 22, 2020, 12:46pm

In a case where my data fits in memory as numpy array, I realize that batching the data using getitem from the Dataset interface is much slower than when it is indexed manually with numpy.

I am sure it’s because the Dataloader build batches sample by sample by calling getitem to fetch each sample.

Is there any workaround to build batches faster while still using the standard Dataset / Dataloader interface ?

faizan_shaikh · October 22, 2020, 1:19pm

Hi @veda101, could you clarify a small detail?

You mentioned that the data fits in memory, and so, do you read the entire data in __init__? Or do you read it lazily in __getitem__?

veda101 · October 22, 2020, 2:37pm

I can for instance load all the data in memory in the __init__, and then access each row of the dataset in the __getitem__, but because getitem fetch each row one by one, it is definitely slower than fetching with a slice in numpy like data[0:batch_size]

ptrblck · October 24, 2020, 12:06am

You could use BatchSampler to pass a batch of indices to __getitem__ and create multiple samples in a single call, if that fits your use case.

Build batch faster than with __getitem__?

Build batch faster than with getitem?