Small sub-datasets using Subset from large ImageFolder dataset

mehdi_boston · October 22, 2020, 5:13pm

I am trying to build customized subsets for different phases (training, validation and testing) using ImageFolder template. Let’s say there’s 500k images occupying 100GB of space. I make the initial dataset using ImageFolder and by parsing dataset.imgs object I am able to create the appropriate lists that have the all indices that I want in each of the phases. My questions are the following:

From my understanding a dataset built this way, contains only addresses of the files and their classes according to the folder structure, so if I a make new copies of this dataset object to apply different transformations for each phase and then Subset them it shouldn’t create memory problems?
What about the dataloading process in each of those phases: if my subset is only 20k indices of the total 500k, will the dataloader based on this subset consume more memory compared to an alternative case where I have a custom dataset containing only those 20k address of interest?
Is there a simple way of writing a custom Dataset class that on top of taking a root folder (similar to the way ImageFolder works) takes the list of indices I generated by parsing dataset.imgs object? I’m very new to making custom classes so if it requires too much manipulation of the existing classes I probably will opt for the Subset solution described above, for now, if it doesn’t create memory issues.
What about a simple way of deleting elements from the main dataset object? so I can simply delete all the 480k samples not needed in a specific phase.
This is somewhat secondary to my problem, but I noticed that Subset method can take any index even if it’s outside the range of the dataset and it will only return an error when dataloading is happening, is this the expected behaviour? the simple fact that Subset method is completely blind to the indices provided to it makes testing the sub-datasets hard, is there a way to test what’s contained in a subset without iterating over it with a dataloader?

ptrblck · October 24, 2020, 6:09am

The ImageFolder will store the image paths and will lazily load the images, transform them if wanted, and will return the data as well as the target tensors. The Subset will store additionally passed indices and the underlying dataset will still contain all image paths. Since they are just paths, you usually wouldn’t care too much about them or the used memory, as it should be tiny compared to the real loaded images.
You could create a custom Dataset as described here and e.g. use internally the ImageFolder as a class attribute. In this way you could pass indices or other arguments to your custom class, while you would still call the ImageFolder to load the data.
Yes, I think there is no limitation to allow flexible use cases, such as iterating the Dataset multiple times, filtering out or duplicating indices.