In a case where my data fits in memory as numpy array, I realize that batching the data using getitem from the Dataset interface is much slower than when it is indexed manually with numpy.
I am sure it’s because the Dataloader build batches sample by sample by calling getitem to fetch each sample.
Is there any workaround to build batches faster while still using the standard Dataset / Dataloader interface ?
Hi @veda101, could you clarify a small detail?
You mentioned that the data fits in memory, and so, do you read the entire data in
__init__? Or do you read it lazily in
I can for instance load all the data in memory in the
__init__, and then access each row of the dataset in the
__getitem__, but because getitem fetch each row one by one, it is definitely slower than fetching with a slice in numpy like data[0:batch_size]
You could use
BatchSampler to pass a batch of indices to
__getitem__ and create multiple samples in a single call, if that fits your use case.