How to load images from large dataset?

Assume I have dataset with ~1.5 billion images in jpg on HDD. If I just use naive solution (get_item loads image using default_loader from, one iteration by dataset takes around 13 hours. What is best approach to work with large dataset in PyTorch? Using h5py or DataBase? Or I should move data to SSD and it is only good option?

1 Like


The imagenet example should give you some ideas.
In your case I would say use the builtin dataloader with enough cpu processes to load images fast enough to feed your GPU.

Note that if your CPU is not good enough, you can take a look at Nvidia’s DALI library to do the decoding on gpu which will free up the CPU quite a lot.


Okay, thanks! But what about LMDB? Can it help me?

If you have a decent CPU, a SSD and use multiprocess to load images, it should be fast enough reading the files directly.
You can use LMDB as well but I don’t know how much improvement it will give you.