In DataLoader, is it possible to control which CPU workers are assigned which indices?

Sam_Lerman · October 18, 2022, 2:47am

Let’s say I have 5 CPUs, a dataset of length 100, and batch size of 8. Let’s say the DataLoader assigns 8 indices in [0, 99]. For example: {1, 2, 33, 55, 61, 62, 77, 78}.

Now I want to be able to control which CPU workers are assigned to fetch the corresponding data. I want CPU 1 to fetch indices in [0, 19], CPU 2 to fetch indices in [20, 39], CPU 3 to fetch indices in [40, 59], CPU 4 to fetch indices in [60, 79], and CPU 5 to fetch indices in [80, 99].

Consequently, CPU 1 should be assigned to fetch {1, 2}, CPU 2 should be assigned to fetch {33}, CPU 3 should be assigned to fetch {55}, CPU 4 should be assigned to fetch {61, 62, 77, 78}, and CPU 5 should not be assigned to fetch any indices.

nivek · October 18, 2022, 9:02pm

You can use [get_worker_info()](https://pytorch.org/docs/stable/data.html#torch.utils.data.get_worker_info) and worker_init_fn() that is being passed to DataLoader to configure the worker to only read certain fraction of the Dataset. For example, you can modify the Dataset object at each worker or you can make __getitem__ to ignore certain indices based on worker_id.

Sam_Lerman · October 18, 2022, 10:11pm

Would those discarded indices still be retrieved by other workers or no?

nivek · October 18, 2022, 10:25pm

Each worker should have an independent sampler that generates all the indices, but you can decide what to do with them at each worker.

Sam_Lerman · October 19, 2022, 1:27am

Okay, that was the solution I thought of

s63jgn · November 30, 2024, 10:34pm

Hi, it’s a relief I am not the only one facing this issue. Thank you so much for posting this! However, I’ve some trouble understanding how you solved this. Can you really have multiple samplers, one per worker? And how do you achieve every worker getting all the indices of one batch and not just a subset?

Maybe you could even share some minimal code? I’ve detailed my current problem and thoughts in this new thread