Dataloader and workers

Hello guys¡

Today I am preparing birds and cars dataset from stanford university to finetune some pretrained imagenet models. I am preparing the dataset in different folders as keeping the entire dataset in RAM needs more than 20 GB and use torch.datasets.ImageFolder

I want to use several threads to upload the dataset as reading from disk is slow so I do not want to have a bottleneck on data loading (I am pretty sure i will have it because cannot use several threads to load as I will be reading from a hard disk).

I have been reading tutorials and other posts. I want to know two things:

-first: Does each worker load a batch or each workers loads samples from the next batch? I think each worker load a batch but I am not very used to multiprocessing package and after reading the code I cannot be certain about that.

-second: I read this in the tutorial https://pytorch.org/docs/stable/notes/faq.html#dataloader-workers-random-seed but I am not really sure on what should I do to not get replicated data.

Thanks.

Currently the workers load batches of data each.

As long as you don’t get any random numbers in your Dataset, everything should work as expected.

1 Like

Thanks.

I pretend to use random transformations from the torchvision package. How should I proceed?

You can just create your transformations using torchvision.transforms and pass it to your Dataset as transform.

Yes, I know that. I mean that I would be using random transformation such as RandomHorizontalFlip. In such a case would I get replicated data?

Your first answers stated: “As long as you don’t get any random numbers in your Dataset, everything should work as expected.”

Because I would be using random transformation when loading data and based on your first answer I could have problems getting replicated data. How should I fix?

Sorry for the misleading statement.
Each worker will get its base seed, so it will be alright.
If you try to sample from another library like np.random, you might encounter problems.
Have a look at this code sample:

class MyDataset(Dataset):
    def __init__(self):
        self.data = torch.randn(10, 1)
        
    def __getitem__(self, index):
        print("numpy random: ", np.random.randint(0, 10, size=1))
        print("pytorch random: ", torch.randint(0, 10, (1,)))
        x = self.data[index]
        return x
    
    def __len__(self):
        return len(self.data)
    
    
dataset = MyDataset()
loader = DataLoader(dataset, num_workers=2)

for batch_idx, data in enumerate(loader):
    print("batch idx {}".format(batch_idx))
1 Like

Ok THANKs¡

One more thing. I have checked torchvision code at github and see that transform uses the random module from python. I understand that when using multiprocess the distribution of which samples are sampled from each thread are based on torch.random, and random is only user to (as example) decide if we rotate a particular image or not.

Moreover, how can one really ensure that we will not be replicating a sample? Because in (potentially) large datasets different seeds can end up producing the same number after several calls to random. As example consider:

numpy.random.seed(1)
numpy.random.randint(1,100,10)
array([10, 8, 64, 62, 23, 58, 2, 1, 61, 82])
numpy.random.seed(2)
numpy.random.randint(1,100,10)
array([41, 16, 73, 23, 44, 83, 76, 8, 35, 50]

The sampler is responsible for the creation of the indices.
As long as you don’t use a sampler which samples with replacement, the indices won’t be repeated.
You can add print("index ", index) into the __getitem__ method and see that each index is unique for the current example.

So the problems we might encounter are related to random modifications on the given index in getitem with a different random library, right?

Thanks for quick replies.

Yes. If you use something like if np.random(...) > 0, I would make sure the random number is not always the same.

1 Like