I am trying to build customized subsets for different phases (training, validation and testing) using ImageFolder template. Let’s say there’s 500k images occupying 100GB of space. I make the initial dataset using ImageFolder
and by parsing dataset.imgs
object I am able to create the appropriate lists that have the all indices that I want in each of the phases. My questions are the following:
-
From my understanding a dataset built this way, contains only addresses of the files and their classes according to the folder structure, so if I a make new copies of this dataset object to apply different transformations for each phase and then Subset them it shouldn’t create memory problems?
What about the dataloading process in each of those phases: if my subset is only 20k indices of the total 500k, will the dataloader based on this subset consume more memory compared to an alternative case where I have a custom dataset containing only those 20k address of interest? -
Is there a simple way of writing a custom Dataset class that on top of taking a root folder (similar to the way ImageFolder works) takes the list of indices I generated by parsing dataset.imgs object? I’m very new to making custom classes so if it requires too much manipulation of the existing classes I probably will opt for the Subset solution described above, for now, if it doesn’t create memory issues.
What about a simple way of deleting elements from the main dataset object? so I can simply delete all the 480k samples not needed in a specific phase. -
This is somewhat secondary to my problem, but I noticed that
Subset
method can take any index even if it’s outside the range of the dataset and it will only return an error when dataloading is happening, is this the expected behaviour? the simple fact that Subset method is completely blind to the indices provided to it makes testing the sub-datasets hard, is there a way to test what’s contained in a subset without iterating over it with a dataloader?