Hey,
I have big dataset, and it is stored on different machines. I am trying to run DistributedDataParallel to train models on it. So each machine has its own data to train. In each machine I have 3 GPUs, one process for each. I am trying to use DistributedSampler to distribute batches between GPUs in each machine. So I just set world_size to number of GPUs in each machine and also set rank in each process.
For 2 machines, my real world size is 6 and rank from, 0-6. But for DistributedSampler, I set 3 and rank from 0-3.
I am not sure if I am doing the correct way.
Anyway by doing this I got error on this assertion:
assert len(indices) == self.num_samples
Thank you