Sharing file access in DataLoader workers

I’m struggling with a bug where the tensors returned by my DataLoader are garbage if num_workers > 1, and I assume this is caused by the underlying dataset using a file pointer (the whole dataset is one file) which is forked to the workers.

What’s the best way to distribute such a dataset to multiple workers? Is there some easy way of instantiating the dataset separately/cloning the file pointers for each worker? Or are there other better approaches?

Thanks in advance.

1 Like

Are you currently passing in a worker_init_fn to DataLoader? It is a function that gets passed to each worker to initialize.

def worker_init_fn(worker_id):
    worker_info = torch.utils.data.get_worker_info()
    dataset = worker_info.dataset # the copy of the dataset object in this process. Note that this will be a different object in a different process than the one in the main process.
    # Either instantiate the dataset or create a new file pointer

These sections in the documentation may be helpful:

  1. Multi-process data loading
  2. Example 2 of this section

Thank you, that was exactly what I needed!

Somewhat related, see here: