Sharing file access in DataLoader workers

ptoews · February 14, 2022, 7:47pm

I’m struggling with a bug where the tensors returned by my DataLoader are garbage if num_workers > 1, and I assume this is caused by the underlying dataset using a file pointer (the whole dataset is one file) which is forked to the workers.

What’s the best way to distribute such a dataset to multiple workers? Is there some easy way of instantiating the dataset separately/cloning the file pointers for each worker? Or are there other better approaches?

Thanks in advance.

nivek · February 14, 2022, 9:12pm

Are you currently passing in a worker_init_fn to DataLoader? It is a function that gets passed to each worker to initialize.

def worker_init_fn(worker_id):
    worker_info = torch.utils.data.get_worker_info()
    dataset = worker_info.dataset # the copy of the dataset object in this process. Note that this will be a different object in a different process than the one in the main process.
    # Either instantiate the dataset or create a new file pointer

These sections in the documentation may be helpful:

ptoews · February 15, 2022, 9:09am

Thank you, that was exactly what I needed!

Pietro_Cicalese · July 5, 2022, 11:03pm

Somewhat related, see here: