Distributed datasets on multi-machines


I have big dataset, and it is stored on different machines. I am trying to run DistributedDataParallel to train models on it. So each machine has its own data to train. In each machine I have 3 GPUs, one process for each. I am trying to use DistributedSampler to distribute batches between GPUs in each machine. So I just set world_size to number of GPUs in each machine and also set rank in each process.
For 2 machines, my real world size is 6 and rank from, 0-6. But for DistributedSampler, I set 3 and rank from 0-3.
I am not sure if I am doing the correct way.
Anyway by doing this I got error on this assertion:

assert len(indices) == self.num_samples

Thank you

I think this might be the issue, for num_replicas=3, the ranks should be from 0-2 and not 0-3. I was able to reproduce the error you’re seeing if I do something like this (which is invalid):

sampler = DistributedSampler(dataset, num_replicas=3, rank=3)
1 Like

It was correct, I converted global rank to local rank incorrectly. I just send number replica and local rank and it works.

Thank you