Does DistributedDataParallel split data at the batch dimension?

akashs · April 26, 2020, 2:17am

Let’s consider the batch size is 64 and we want to run in a cluster of two nodes, where each node contains 4 GPUs. How the batch will be split? Will the batch first split into two and thus each node will get a batch of data size 32, and finally, each node will split the data among the four GPUs, thus each GPU will get a batch of data size 8? is this the way the will be split in DistributedDataParallel mode?
Thanks in advance for the clarification.

mrshenli · April 26, 2020, 2:44am

If using the recommended DistributedDataParallel (DDP) mode, where there is a dedicated process for each GPU, DDP does not split input data. Each process will have its own data loader and its own DDP instance. DDP only help to automatically compute the global averaged gradient in the backward pass. DDP will run in this mode when device_ids only contains a single device, or there is only one visible device. See this for more details.

akashs · April 26, 2020, 2:51am

Thanks for the clarification. Just to confirm my understanding: that means each node will work with a separate data loader and thus each node the batch size will be 64, am I right?

mrshenli · April 26, 2020, 2:55am

In this case, as each node has 4 GPUs, each node will launch 4 processes with each process creating its own dataloader and DDP instance. So, each node will actually have 4 data loaders.

If you would like to run batch size of 64 across 2 node (8 gpus), then each data loader should load data size of 64 / 8 = 8.

akashs · April 26, 2020, 2:56am

Thanks a lot for the explanation

weikunhan · October 1, 2020, 12:00am

The above information is right. @akashs @mrshenli More details as following:

Actually, if your batch size is 64, and use two-node to process data. It actually uses two 32 data loaders to parallel feed the data. And, when you config your training hyperparameters, you should set the batch size is 32 if you know that want to use 2 nodes for training.

Here are the explanations. Let us print the len(data_loader), you will see clearly.

I use input config batch size 256 to train ImageNet. ImageNet have total 1,281,167 images. If using 256 for one node. it actually have 1,281,167 / 256 = 5,004.55859375 mini-batches. Here is the log info, it has 5025 mini-batches(I use 3 GPUs in this node, so the number is not 5,005)

[2020-09-29 15:03:12,875]-[SpawnProcess-1]-[MainThread]-[INFO]: ['Epoch: [42][  50/5025]', 'Time  8.593 ( 4.986)', 'Data  0.000 ( 1.278)', 'Loss 9.9792e-01 (5.8119e-01)', 'Acc@1  78.82 ( 83.85)', 'Acc@5  90.59 ( 97.07)']

Now, if we use 2 nodes to process data (each node with the same GPUs), and I use the same config (batch size is 256) to train ImageNet. you will see total minibatch is decreased by 2, which means there are 2 X 256 batch size. Here is the log info, it has 2513mini-batches.

[2020-09-30 16:51:24,623]-[SpawnProcess-2]-[MainThread]-[INFO]: ['Epoch: [45][1150/2513]', 'Time  4.003 ( 4.173)', 'Data  0.078 ( 0.088)', 'Loss 2.8992e-01 (3.7627e-01)', 'Acc@1  92.94 ( 90.02)', 'Acc@5  98.82 ( 98.62)']

Note, you can not consider as 256 batch size above example (even if your code config is 256 batch size), it is actually 512 batch size to train your model because there are two 256 data loaders to parallel feed the data. This is very important because sometimes will affect your model training.

Hope it will answer your question!