In DataLoader, is it possible to control which CPU workers are assigned which indices?

Let’s say I have 5 CPUs, a dataset of length 100, and batch size of 8. Let’s say the DataLoader assigns 8 indices in [0, 99]. For example: {1, 2, 33, 55, 61, 62, 77, 78}.

Now I want to be able to control which CPU workers are assigned to fetch the corresponding data. I want CPU 1 to fetch indices in [0, 19], CPU 2 to fetch indices in [20, 39], CPU 3 to fetch indices in [40, 59], CPU 4 to fetch indices in [60, 79], and CPU 5 to fetch indices in [80, 99].

Consequently, CPU 1 should be assigned to fetch {1, 2}, CPU 2 should be assigned to fetch {33}, CPU 3 should be assigned to fetch {55}, CPU 4 should be assigned to fetch {61, 62, 77, 78}, and CPU 5 should not be assigned to fetch any indices.

You can use [get_worker_info()](https://pytorch.org/docs/stable/data.html#torch.utils.data.get_worker_info) and worker_init_fn() that is being passed to DataLoader to configure the worker to only read certain fraction of the Dataset. For example, you can modify the Dataset object at each worker or you can make __getitem__ to ignore certain indices based on worker_id.

Would those discarded indices still be retrieved by other workers or no?

Each worker should have an independent sampler that generates all the indices, but you can decide what to do with them at each worker.

Okay, that was the solution I thought of

Hi, it’s a relief I am not the only one facing this issue. Thank you so much for posting this! However, I’ve some trouble understanding how you solved this. Can you really have multiple samplers, one per worker? And how do you achieve every worker getting all the indices of one batch and not just a subset?

Maybe you could even share some minimal code? I’ve detailed my current problem and thoughts in this new thread