Dataloader inside a for loop

SU801T · September 25, 2021, 11:27pm

Hi,

I was wondering if there is a significant disadvantage placing a dataloader and dict inside a for loop.

I am trying something where a random sample of training data is used for each epoch as I have millions of examples but not enough resources for effective training. Therefore, I want to use a bash script that randomly copies a certain amount of images from several directories and use this for training. The same is also applied for validation.

The pseudocode would look something like this:


for epoch in epochs:

     run bash script to copy random number of train and validation data per epoch
     
     create dataloader dicts and dataloader
    
     do training
     do validation


    delete train and validation data

Would this be ok?

ptrblck · September 26, 2021, 10:14pm

Your workflow should generally be alright, but I would be concerned about the performance hit caused by the copies of the data. If possible, leave all data in the current structure and use a Subset or SubsetRandomSampler to load only the desired samples instead.

SU801T · September 28, 2021, 1:43am

Do you mean in terms of the time it takes to run the experiment or in terms of accuracy when you mention performance?

The issue is I am worried if it means loading an entire dataset into a dataloader or dict before the random sampling. Is this something I should be concerned about or does random sampler not even load the entire dataset and then select random samples?

The issue is my dataset is also incredibly imbalanced. I want to copy all of the images from one class but from the other class, I would like to select random samples. There is a further abstraction where there are small patch images selected from entire images. I want to select a certain number of these patches from each entire image. The directory looks something like this:


class 1 --- image 1 --- patch 1
                               --- patch 1
                               --- patch 3
                                --- patch ...

             --- image 2 --- patch 1
                               --- patch 1
                               --- patch 3
                                --- patch ...

                --- image ... --- patch 1
                               --- patch 1
                               --- patch 3
                                --- patch ...


class 2  --- image 1 --- patch 1
                               --- patch 1
                               --- patch 3
                                --- patch ...

             --- image 2 --- patch 1
                               --- patch 1
                               --- patch 3
                                --- patch ...

                --- image ... --- patch 1
                               --- patch 1
                               --- patch 3
                                --- patch ...

ptrblck · September 28, 2021, 3:54am

I meant the speed, as moving a large dataset around might be quite expensive compared to the actual training.

If your Dataset is lazily loading each sample in its __getitem__ the recreating of DataLoaders using different Subsets or samplers should be cheap as only the Dataset.__init__ will be called (which should not load any large amount of data if you are applying lazy loading).