Opening same file in dataloader with different num_workers in parallel

amogh112 · February 2, 2020, 4:17pm

In my dataloader, I want to return images after sampling using random.sample(list,sample_size) from a folder. However, if the number of workers is more than 1, the workers might be accessing the same file at the same time. How do I deal with this?
eg -


class Loader(torch.utils.data.Dataset):
    def __init__(self, root_dir, length_sample):
        self.root_dir = root_dir
        self.length_sample =  length_sample
        
    def __getitem__(self,index):
        images = random.sample(glob.glob(root_dir+"/*"), self.length_sample)
        result_images = []
        for p in images:
             im = Image.open(p)
             result_images.append(transforms.ToTensor()(im))
        return torch.cat(result_images)

ptrblck · February 3, 2020, 4:43am

Are you concerned about the potential performance issue or about deadlocking the process?
I would assume PIL takes care of the latter problem.
Do you have any particular idea how to handle the first case?
You could try to use the current worker id in your __getitem__ method to e.g. reset the random seed with the current index.


class MyDataset(Dataset):
    def __init__(self):
        self.data = torch.randn(10, 1)
        
    def __getitem__(self, index):
        print(torch.utils.data.get_worker_info().id)
        x = self.data[index]
        return x
    
    def __len__(self):
        return len(self.data)


dataset = MyDataset()
loader = DataLoader(
    dataset,
    num_workers=2
)

next(iter(loader))