Does DistributedDataParallel split data at the batch dimension?

weikunhan · October 1, 2020, 12:00am

The above information is right. @akashs @mrshenli More details as following:

Actually, if your batch size is 64, and use two-node to process data. It actually uses two 32 data loaders to parallel feed the data. And, when you config your training hyperparameters, you should set the batch size is 32 if you know that want to use 2 nodes for training.

Here are the explanations. Let us print the len(data_loader), you will see clearly.

I use input config batch size 256 to train ImageNet. ImageNet have total 1,281,167 images. If using 256 for one node. it actually have 1,281,167 / 256 = 5,004.55859375 mini-batches. Here is the log info, it has 5025 mini-batches(I use 3 GPUs in this node, so the number is not 5,005)

[2020-09-29 15:03:12,875]-[SpawnProcess-1]-[MainThread]-[INFO]: ['Epoch: [42][  50/5025]', 'Time  8.593 ( 4.986)', 'Data  0.000 ( 1.278)', 'Loss 9.9792e-01 (5.8119e-01)', 'Acc@1  78.82 ( 83.85)', 'Acc@5  90.59 ( 97.07)']

Now, if we use 2 nodes to process data (each node with the same GPUs), and I use the same config (batch size is 256) to train ImageNet. you will see total minibatch is decreased by 2, which means there are 2 X 256 batch size. Here is the log info, it has 2513mini-batches.

[2020-09-30 16:51:24,623]-[SpawnProcess-2]-[MainThread]-[INFO]: ['Epoch: [45][1150/2513]', 'Time  4.003 ( 4.173)', 'Data  0.078 ( 0.088)', 'Loss 2.8992e-01 (3.7627e-01)', 'Acc@1  92.94 ( 90.02)', 'Acc@5  98.82 ( 98.62)']

Note, you can not consider as 256 batch size above example (even if your code config is 256 batch size), it is actually 512 batch size to train your model because there are two 256 data loaders to parallel feed the data. This is very important because sometimes will affect your model training.

Hope it will answer your question!