DistributedSampler

How does the DistributedSampler (together with ddp) split the dataset to different gpus? I know it will split the dataset to num_gpus chunks and each chunk will go to one of the gpus. Is it randomly sampled or sequentially?

4 Likes

First, it checks if the dataset size is divisible by num_replicas. If not, extra samples are added.

If shuffle is turned on, it performs random permutation before subsampling.
You should use set_epoch function to modify the random seed for that.

Then the DistributedSampler simply subsamples the data among the whole dataset.
https://github.com/pytorch/pytorch/blob/master/torch/utils/data/distributed.py#L68

# subsample
indices = indices[self.rank:self.total_size:self.num_replicas]

Note that adding extra data could cause at evaluation time due to the duplicated data.
I personally use a custom sampler (DistributedEvalSampler) when testing my models.

19 Likes

Thank you so much for your answer. I’m now clear about it.

1 Like

I understand that the distributed sampler chunks the dataset for each GPU. However, when using DDP, it loads the entire Dataset on N GPUs N times. Is this how it works?

1 Like

It depends on how you write your dataset.
In most cases, you have the list of file paths and __getitem__ function loads the data to memory.
But yes, each process has the full list.
If you only want to load the partial list for each process, you should write your custom dataset, sampler, etc.

1 Like