Fast dataloader for numpy arrays in memory?

Let’s say I have a dataset which corresponds to some numpy array “data” already in memory.

If I use a DataLoader on top of this dataset to generate batches of size 1000, it seems the dataloader will call the method “getitem” 1000 times and cat the individual items together to create the batch.

However if I do it manually, I can directly access data[k : (k+1) * 1000], which is faster than getting the items individually.

Is there a way to customize the batch generation of the data loader so that I can optimize the batch generation in the case of a numpy array in memory?


Do you have any transformations in __getitem__ or do you just index a single sample of your data?
In the latter case, you could use your slice logic manually so that each call to __getitem__ returns a complete batch. The batch_size of you DataLoader would have to be set to 1 in that case.
However, this would always generate a batch of subsequent samples, if you don’t shuffle the data before.

Another obvious way would be to just work without a DataLoader and create the batches manually.


Hi @ptrblck , Sorry for replying here after more than 1y but it is related to what I am looking for, please tell me if I have to create another topic.
My question is:
Do you have an example of how can I manually generate a batch of images from multiple images (numpy arrays) already in memory and then apply transformation like To_tensor, Normalize…? This will avoid to save all the images in a folder then load them back using ImageFolder.


You could use torchvision.transforms.ToPILImage() to transform the numpy arrays to PIL.Images, apply all image transformations, and convert it to a tensor via ToTensor().
The same approach is used in e.g. in the MNIST dataset, since the data is directly into memory.

1 Like

Thank you @ptrblck. I will try this and get back to you if I have some problem. I was not thinking of using a custom dataset because it is for inference, not training, but indeed it could work.