DistributedSampler

First, it checks if the dataset size is divisible by num_replicas. If not, extra samples are added.

If shuffle is turned on, it performs random permutation before subsampling.
You should use set_epoch function to modify the random seed for that.

Then the DistributedSampler simply subsamples the data among the whole dataset.
https://github.com/pytorch/pytorch/blob/master/torch/utils/data/distributed.py#L68

# subsample
indices = indices[self.rank:self.total_size:self.num_replicas]

Note that adding extra data could cause at evaluation time due to the duplicated data.
I personally use a custom sampler (DistributedEvalSampler) when testing my models.

19 Likes