Should we split batch_size according to ngpu_per_node when DistributedDataparallel

hhxx · March 11, 2020, 5:45pm

According to the discuss in Is average the correct way for the gradient in DistributedDataParallel, I think we should set 8×lr. I will state my reason under 1 node, 8gpus, local-batch=64(images processed by one gpu each iteration) scenario:
(1) Let us consider a batch images (batch-size=512), in DataParallel scenario, a complete forward-backforwad pipeline is:

the input data are split to 8 slices (each contains 64 images), each slice is feed to net to compute output
outputs are concated in master gpu (usually gpu 0) to form a [512, C] outputs
compute the loss with groundtruth(same dimension: [512, C]) : loss = \frac{1}{512} \sum_{i=1}^512 mse(output[i], groundtruth[i])( use mse loss as illustration)
use loss.backward to compute gradients.

So the finally [512, C] outputs are the same as computed on one gpu. So the learning rate here shall be set as 8×lr to keep same as 512 batchsize in one-gpu-one-node scenarior.

(2) Secondly, when DistributedDataparallel is used, the pipeline is

the input data are also split to 8 slices
outputs are computed in each gpu to form a [64, C] outputs
In each gpu, compute the loss loss = \frac{1}{64} \sum_{i=1}^64 mse(output[i], groundtruth[i]) and compute gradients grad_k (k is the gpu number, k=0,1,...,7): (this is different with Dataparallel, which need collect all outputs in master gpu)
Average the gradients between all gpus: avg_grad =\frac{1}{8} \sum_{k=1}^8 grad_k

By this way, the averaged gradients are also same as the gradients computed on one-gpu-one-node scenario. So I think learning rate here need to be set as 8×lr to keep same as 512 batchsize on one-gpu-one-node scenario.

The main difference between you and me is that when local batch is set as 64, I think local gradients will be averaged over local samples, resulting in torch.ones_like(param)*64/64, but you think the local gradients will be summed over local samples, resulting in torch.ones_like(param) * 64. I think local gradients will be averaged mainly because the loss function in pytroch, like mse(), will compute the average loss over all input samples, so the gradients computed from such loss also should be averaged over all input samples.

I do not know if I understand DistributedDataparallel in a right way. Please let me know if there has any wrong.