Does setting num_workers > 0 implement sharding for Map style datasets


I have a question regarding the PyTorch DataLoaders. I know that when we set num_workers > 0, in the dataloader, it creates multiple process and not threads, so no shared memory among them. Each process is passed an object of the dataset class, collate_fn, and worker_init_fn (According to the docs here).

Now, if my dataset object has an attribute which contains all the data and say it takes n amount of memory on my RAM. Will my program take n * num_workers amount of memory or will the dataloader somehow shard the data such that each workers has access to only subset of data and sum of memory across all workers on RAM stays just n?

Your DataLoader would create copies of the data and use n * num_workers amount of memory.

1 Like