Simultaneous reads of the same torchvision.datasets object

Consider the following piece of code to fetch a data set for training from torchvision.datasets and to create a DataLoader for it.

import torch
from torchvision import datasets, transforms

training_set_mnist = datasets.MNIST('./mnist_data', train=True, download=True)
train_loader_mnist = torch.utils.data.DataLoader(training_set_mnist, batch_size=128,
                                                 shuffle=True)

Assume that several Python processes have access to the folder ./mnist_data and execute the above piece of code simultaneously; in my case, each process is a different machine on a cluster and the data set is stored in an NFS location accessible by everyone. You may also assume that the data is already downloaded in this folder so download=True should have no effect. Moreover, each process may use a different seed, as set by torch.manual_seed().

I would like to know whether this scenario is allowed in PyTorch. My main concern is whether the above code can change the data folders or files in ./mnist_data such that if ran by multiple processes it can potentially lead to unexpected behavior or other issues. Also, given that shuffle=True I would expect that if 2 or more processes try to create the DataLoader each of them will get a different shuffling of the data assuming that the seeds are different. Is this true?

Since you are only reading the files, there shouldn’t be any interactions between the processes.
Yes, different seeds should result in a different shuffling.

1 Like