Handling a large dataset without loading to memory

miladiouss · August 15, 2018, 5:21am

I have a few TB of tiny images and I would like to scan/classify them with a model I have trained. Currently, I load a bunch to memory, create a DataLoader object, run them through the model, and move to the next bunch. By using the DataLoader I get to use the same image transformation I used for the model. However, this is a little painful. Is there a better way of running a model over millions of files?

ptrblck · August 15, 2018, 10:14am

You could use a Dataset to lazily load all images.
Have a look at this tutorial.
Basically you can pass the image paths to your Dataset and just load and transform the samples in __getitem__.

miladiouss · August 15, 2018, 11:17pm

That was easy, I got it to work. I keep forgetting how flexible PyTorch is compared to other frameworks. Thanks @ptrblck.