So after reading the code for DistributedDataParallel it is my understanding that when each process is assigned to use only one GPU i.e self.device_ids=[i] then it will not automatically pick a subset of the batch but instead use the whole input given to it. Am I supposed to also use the DistributedDataSampler here? This is not made clear in the documentation of DistributedDataParallel and one would assume that just plugging DistributedDataParallel would be enough.
It is okay not to use
However, without it, you have no guarantee that all examples are sampled once per epoch. In detail, some examples may be sampled multiple times or never.
But I’ve checked and the batch size is not reduced. So if I set the batch size to 256 I would expect to see 64 to each GPU. But instead I see 256. And by reading the code it seems that all processes will get the exact same input especially if you fix the seed in the beginning with torch.seed (which you should do otherwise the models don’t start from the same weights)
As far as I understand, you need to give the reduced batch size to
DataLoader. For examples, if total batch size is 256 and 4 gpus, then the given batch size is 64. Here is the example of code: link
DIstributedDataSample, each process has different rank, so sampled inputs will be different. In detail, the initial point of sampling is shifted according to
rank, so subsampled indices won’t be overlapped. (here is the line of the code)
In the example why divide by
ngpus_per_node though? Shouldn’t we divide by world size?
And if we make the batch size smaller then how are we consuming the data faster? If we iterate over the DataLoader it will only use 64 items at a time.
Sorry, the shown link above was irrelevant. As the explanation of
batch_size says in link, the batch size of all GPUs on the current node should be specified, so given
batch_size here should be 64 under 4 GPUs environment for 256 batch size per iteration.
I may not be clear about your second question.
Assuming 1 GPU per node and 4GPUs environment, each process consumes
len(dataset) / 4 amount of inputs per epoch. Then if the total batch size is 256, each process takes care of 64 batch per iteration (this 64 should be specified as the argument). Even if you specify
batch_size as 64, the amount of inputs to be processed for each iteration is 256 and total iteration per epoch is
len(dataset) / 256.
But let’s say we count the amount of time it takes for an epoch to finish. That means that it would take the same amount as it would take for a single GPU with a batch size of 64 right? Because we are iterating over the dataset with 64 samples at a time. It’s just that it would converge faster (in less epochs) because the total amount of samples seen in a single epoch is actually 4 times the amount seen when we are using 1 GPU. Am I correct?
If you use
DistributedDataSample, the total amount of samples seen in a single epoch is
If you don’t use
DistributedDataSampler, the total amount of samples seen in a single epoch is actually 4 times i.e.,
4*len(dataset). However, we should count this single epoch as 4 epoch though.