Should we split batch_size according to ngpu_per_node when DistributedDataparallel

Assume we have two nodes: node-A and node-B, each has 4gpus(i.e. ngpu_per_node=4). We set args.batch_size = 256 on each node, means that we want each node process 256 images in each forward.

(1) If we use DistributedDataparallel with 1gpu-per-process mode, shall we manually divide the batchsize by ngpu_per_node in torch.utils.data.DataLoader : torch.utils.data.DataLoader(batch_size = args.batch_size / 4)(the way used in pytorch-imagenet-official-example). In my original opinion, I think DistributedSampler can handle such thing, because we have passed world_size and rank to DistributedSampler. . If I was wrong, please point it out, thanks!

(2) If dividing the batchsize by ngpu_per_node is a correct way, I wonder what will happen if we do not that.

  • Does it means in each node, 4*batch_size images are processed per forward-process?

  • Will 4*len(dataset) images are processed in one epoch, or the forward frequency are four times less than usual(i.e. the total number images proceeded per epoch keep same)?

4 Likes

You are correct. Each DataLoader instance pairs with a DDP instances. If you do not divide the batch-size=256 by 4, then each DDP instance will process 256 images. As your env has 8-GPUs in total, there will be 8 DDP instances. So one iteration will process 256 * 8 images in total.

However, DDP does divide the gradients by the world_size by default code. So, when configuring learning rate, you only need to consider the batch_size for a single DDP instance.

2 Likes

Another question is if we do not divide batch-size by 8, the total images processed in one epoch will be the same as usual or eight times?

As for learning rate, if we have 8-gpus in total, there wiil be 8 DDP instances. If the batch-size in each DDP distances is 64 (has been divides manually), then one iteration will process 64×4=256 images per node. Taking all gpu into account (2 nodes, 4gpus per node), then one iteration will process 64×8=512 images. Assuming in one-gpu-one-node scenario, we set 1×lr when batch-size=64, 4×lr when batch-size=256 and 8×lr when batch-size=512(a universal strategy that increase learning rate with batch-size linearly). Let us back to DDP scenario (2 node, 4gpus per node), what learning rate shall we use? 1×lr or 4×lr or 8×lr?

The total number of images processed will be 8 times, because each DDP instance/process will process batch_size images.

Let us back to DDP scenario (2 node, 4gpus per node), what learning rate shall we use? 1×lr or 4×lr or 8×lr?

It should be 1x lr. Because DDP calculates the average instead of sum of all local gradients. Let’s use some number to explain this. Assuming every image leads to a torch.ones_like(param) gradient for each parameter.

  • For local training without DDP, if you set batch_size = 64, the gradient for each parameter will then be torch.ones_like(param) * 64.
  • For 8-process DDP training, if you set batch_size = 64, the local gradient for each parameter will also be torch.ones_like(param) * 64. Then DDP use collective communication to calculate the sum of gradients across all DDP instances which will be torch.ones_like(param) * 64 * 8, and then DDP divides that value by 8. So the final gradient in param.grad field will still be torch.ones_like(param) * 64 (the code actually first divide and then do globally sum). So, when set lr, you only need to consider local batch_size.

According to the discuss in Is average the correct way for the gradient in DistributedDataParallel, I think we should set 8×lr. I will state my reason under 1 node, 8gpus, local-batch=64(images processed by one gpu each iteration) scenario:
(1) Let us consider a batch images (batch-size=512), in DataParallel scenario, a complete forward-backforwad pipeline is:

  1. the input data are split to 8 slices (each contains 64 images), each slice is feed to net to compute output

  2. outputs are concated in master gpu (usually gpu 0) to form a [512, C] outputs

  3. compute the loss with groundtruth(same dimension: [512, C]) : loss = \frac{1}{512} \sum_{i=1}^512 mse(output[i], groundtruth[i])( use mse loss as illustration)

  4. use loss.backward to compute gradients.

So the finally [512, C] outputs are the same as computed on one gpu. So the learning rate here shall be set as 8×lr to keep same as 512 batchsize in one-gpu-one-node scenarior.

(2) Secondly, when DistributedDataparallel is used, the pipeline is

  1. the input data are also split to 8 slices

  2. outputs are computed in each gpu to form a [64, C] outputs

  3. In each gpu, compute the loss loss = \frac{1}{64} \sum_{i=1}^64 mse(output[i], groundtruth[i]) and compute gradients grad_k (k is the gpu number, k=0,1,...,7): (this is different with Dataparallel, which need collect all outputs in master gpu)

  4. Average the gradients between all gpus: avg_grad =\frac{1}{8} \sum_{k=1}^8 grad_k

By this way, the averaged gradients are also same as the gradients computed on one-gpu-one-node scenario. So I think learning rate here need to be set as 8×lr to keep same as 512 batchsize on one-gpu-one-node scenario.

The main difference between you and me is that when local batch is set as 64, I think local gradients will be averaged over local samples, resulting in torch.ones_like(param)*64/64, but you think the local gradients will be summed over local samples, resulting in torch.ones_like(param) * 64. I think local gradients will be averaged mainly because the loss function in pytroch, like mse(), will compute the average loss over all input samples, so the gradients computed from such loss also should be averaged over all input samples.

I do not know if I understand DistributedDataparallel in a right way. Please let me know if there has any wrong.

1 Like

I agree with all your analysis on the magnitude of the gradients, and I agree that it depends on the loss function. But even with MSE loss fn, it can lead to different conclusions:

  1. If the fw-bw has processed 8X data, we should set lr to 8X, meaning that the model should take a larger step if it has processed more data as the gradient is more accurate. (IIUC, this is what you advocate for)
  2. If the gradient is of the same magnitude, we should use 1X lr, especially when approaching convergence. Otherwise, if we use 8X lr, it is more likely to overshoot and hurt converged model accuracy.

After reading your analysis, I realized that, with MSE loss fn, the discussion is mostly irrelevant to DDP. The question would then be, if I increase batch size by k, how should I adjust the learning rate, which is an open question. :slight_smile:

Is it correct that when local batch-size is 64 (i.e. torch.utils.data.DataLoader(batch_size=64) and torch.utils.data.distributed.DistributedSampler() is used), and there are N processes totally in ddp (N processes distirbute in one node or more than one node), the forward-backward process is similar to the forward-backward process in 1-gpu-1-node using 64×N batch-size inputs?

For SGD this https://arxiv.org/abs/1706.02677 paper suggest When the minibatch size is multiplied by k, multiply the learning rate by k.