Simultaneous reads of the same torchvision.datasets object

kkonstantinidis · August 16, 2020, 12:02am

Consider the following piece of code to fetch a data set for training from torchvision.datasets and to create a DataLoader for it.

import torch
from torchvision import datasets, transforms

training_set_mnist = datasets.MNIST('./mnist_data', train=True, download=True)
train_loader_mnist = torch.utils.data.DataLoader(training_set_mnist, batch_size=128,
                                                 shuffle=True)

Assume that several Python processes have access to the folder ./mnist_data and execute the above piece of code simultaneously; in my case, each process is a different machine on a cluster and the data set is stored in an NFS location accessible by everyone. You may also assume that the data is already downloaded in this folder so download=True should have no effect. Moreover, each process may use a different seed, as set by torch.manual_seed().

I would like to know whether this scenario is allowed in PyTorch. My main concern is whether the above code can change the data folders or files in ./mnist_data such that if ran by multiple processes it can potentially lead to unexpected behavior or other issues. Also, given that shuffle=True I would expect that if 2 or more processes try to create the DataLoader each of them will get a different shuffling of the data assuming that the seeds are different. Is this true?

ptrblck · August 18, 2020, 7:38am

Since you are only reading the files, there shouldn’t be any interactions between the processes.
Yes, different seeds should result in a different shuffling.