DistributedSampler

How does the DistributedSampler (together with ddp) split the dataset to different gpus? I know it will split the dataset to num_gpus chunks and each chunk will go to one of the gpus. Is it randomly sampled or sequentially?

3 Likes

First, it checks if the dataset size is divisible by num_replicas. If not, extra samples are added.

If shuffle is turned on, it performs random permutation before subsampling.
You should use set_epoch function to modify the random seed for that.

Then the DistributedSampler simply subsamples the data among the whole dataset.
https://github.com/pytorch/pytorch/blob/master/torch/utils/data/distributed.py#L68

# subsample
indices = indices[self.rank:self.total_size:self.num_replicas]

Note that adding extra data could cause at evaluation time due to the duplicated data.
I personally use a custom sampler (DistributedEvalSampler) when testing my models.

7 Likes

Thank you so much for your answer. Iā€™m now clear about it.

1 Like