How to speed up the data loader

I’m not sure if the problem is same like my, but I’ve problem with reading a very big images (3k by 3k).
If this is your case, it is my advice:

  1. Can you divide images to smaller sub-region, ex. 1k vs 1k? If yes, just make some crops of images to make it smaller. Then reading speed should be much faster.
  2. You can use jpeg4py, library dedicated to encode big jpeg files much faster than PIL. Just read image using this library, then transform it to PIL.
  3. The fastest available option, which I found is using jpeg4py library together with OpenCV data-augmentation (so no PIL image). I used OpenCV technique from this pull request.

After reading the some codes of, I find it does not work like caffe, which prefetches next batch data during the GPUs are working. I find a blog which tries to do it. I will try it.


After I read the blog, I made a mistake which the dataloader tries to prefetch the next batch data . But I find that it can NOT make full use of CPUs (it shows that it only uses about 60% ability of CPU). In the blog it shows that the data precoessing takes less than 18% of time. Actually, if it could fully achieve its target and made full use of CPUs and the disk is fast enough, it should be near 0%, not 18%.

1 Like

have you finally fixed this annoying problem of “in\nKeyError: ‘Unable to open object (bad object header version number)’\n’” ?

I have the same issue with you when I set num_workers greater than 2.

Sorry, I don’t fix this bug. Now I write another process that prepares data on the fly when the gpus are runing. Hope that can help you.

Check out a potential solution to HDF5 dataloader concurrency issues:

1 Like

For anyone reading this, Nvidia DALI is a great solution:

It’s got simple to use Pytorch integration.

I was running into the same problems with the pytorch dataloader. On ImageNet, I couldn’t seem to get above about 250 images/sec. On a Google cloud instance with 12 cores & a V100, I could get just over 2000 images/sec with DALI. However in cases where the dataloader isn’t the bottleneck, I found that using DALI would impact performance 5-10%. This makes sense I think, as you’re using the GPU to some of the decoding & preprocessing

Edit: Dali also has a CPU only mode, meaning no GPU performance hit


@Hou_Qiqi Were you able to speed up your dataloader? Did you try preparing the data while GPUs are running?

Just found this thread a few days ago and implemented NVIDIA DALI to load my data while doing some transfer learning with Alexnet.

On an AWS p2.x8large instance (8 Tesla K80 GPUs), using DALI speeds up my epoch time from 480 s / epoch to 10 s / epoch. No need for any code that explicitly prepares data while the GPUs are running. One important thing to note is that if you’re using a DALI dataloader from an external source (i.e. if you have your image classes grouped by folder), you have to manually reset the dataloader using loader_name.reset() at the end of every training epoch. That’s not how the pytorch dataloader works, so it took me a while to realize that was what was going on here.

The only irritating thing I’ve found about DALI is that there is no immediately obvious way (to me, anyway) to convert pixel values from uint8 with a 0-255 range to float with a 0-1 range, which is needed for transfer learning with pytorch’s pretrained models. Dividing by 255.0 within a pipeline runs into data type mismatch, the ops.Cast() function only converts to float but doesn’t rescale, and all the various flavors of normalize functions don’t allow for it either. The only way I was able to do it was by manually scaling the mean and std values given by pytorch.

Other than that, agree that the pytorch integration is simple and fairly clean.


Hi, I got the same problem. You can try to clear your cache to solve the problem and try many times.

Hi, I also meet with this problerm, do you have sovle it ?

I want to train resnet50 on ImageNet, but the data loading is a bottleneck.

do you try the NVIDIA-DALI, resnet50-by-dali ??


I am facing the similar problem. However, I am more interested in knowing how to make my dataloader efficient.
Ideally, I want to read multiple temporal blocks for a video, perform transformations like random crop, rescale, extract I3D features for them, concat them in a tensor and return with other information like text embeddings for question, multiple answers.

For now, I am preprocessing my videos and extracting features beforehand, which does not allow me to use multiple random crops and not a good way. The video frames are 60GB size in total. Any tips?

I am using Ubuntu 16.04, 24gb RAM, 245gb/959gb disk space is free right now, 1 Titan Xp GPU, Python 3.6.4, Pytorch 0.4.1, cuda 8.0.61, cudnn 7102.
Dataset is on local disk.

1 Like

This worked for me. Thanks. hdf “file opening has to happen inside of the __getitem__ function of the Dataset wrapper.” -

1 Like

I wrote the code of prefetching and I confirmed that it improves the performance of data loader.
My code is based on the implementation here:

However, if you run the program on your local machine, I highly recommend buying a NVMe drive (e.g., This investment completely solves the problem of slow image loading.

1 Like

so, the solution is employing the DALI and then change:
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
normalize = transforms.Normalize(mean=[0.485*255, 0.456*255, 0.406*255], std=[0.229*255, 0.224*255, 0.225*255])


Is DALI helpful in such cases?

If you have any questions or request feel free to drop them directly in
Sorry, but we are not able to track all other forum threads about DALI, while we doing our best to be responsive on the GitHub.

A noticeable speedup with h5py would be seen only when h5 file is written without the chunked option.

1 Like

Hi @Hou_Qiqi, I saw you had similar problem that want the dataloader to prefetch data while training ongoing, basically let GPU training and CPU dataloader run in parallel.

Here is our code

for fi, batch in enumerate(my_data_loader):

and in our dataloader, we have define some collate_fn to cook_data


we observed it seems GPU needs to block waiting the dataloader to process, is there a way to prefetch as you mentioned? if we use a Map style dataset, not iterative, dose it work?

I don’t recommend solution 1. Because .bmp is dramatically storage-cusuming (80 x origin image in my case). And can you explain more how to use solution 2?

@Hou_Qiqi can you share snippet of how you logged runtime details of the dataloader?