Should we split batch_size according to ngpu_per_node when DistributedDataparallel

ekurtic · July 15, 2021, 12:29pm

From my understanding of how DDP works:
There are a few gains that result in better performance (faster training) with DDP. One of them is that in a DDP experiment, your model and optimizer are “replicated” on each GPU (before training starts; notice this happens only once). After that, each GPU will optimize your model independently and exchange gradients with other GPUs to ensure that optimization steps are the same at each GPU.
On the other side, DataParallel has to scatter input data from the main GPU, replicate your model on each GPU, do a forward pass, gather back all outputs to the main GPU to compute loss, then send back the results to each GPU so they can compute gradients of the model, and in the end, collect all those gradients on the main GPU to compute the optimization step. And all of this happens for each batch of input data. As you can see, in contrast to the DDP, there is a lot of communication, copying, synchronizing and so on.

Regarding the LR question:
If you use batch_size/num_GPUs = 32/8 = 4 as your batch size in DDP, then you don’t have to change the LR. It should be the same as the one in DataParallel with batch_size = 32, because the effective batch size that your model is working with is the same: 32. It’s just handled in a different way with DDP.