How to load images from large dataset?

Oktai15 · September 29, 2018, 12:24pm

Assume I have dataset with ~1.5 billion images in jpg on HDD. If I just use naive solution (get_item loads image using default_loader from https://github.com/pytorch/vision/blob/master/torchvision/datasets/folder.py), one iteration by dataset takes around 13 hours. What is best approach to work with large dataset in PyTorch? Using h5py or DataBase? Or I should move data to SSD and it is only good option?

albanD · September 29, 2018, 12:43pm

Hi,

The imagenet example should give you some ideas.
In your case I would say use the builtin dataloader with enough cpu processes to load images fast enough to feed your GPU.

Note that if your CPU is not good enough, you can take a look at Nvidia’s DALI library to do the decoding on gpu which will free up the CPU quite a lot.

Oktai15 · September 29, 2018, 12:46pm

Okay, thanks! But what about LMDB? Can it help me?

albanD · September 29, 2018, 3:49pm

If you have a decent CPU, a SSD and use multiprocess to load images, it should be fast enough reading the files directly.
You can use LMDB as well but I don’t know how much improvement it will give you.