Should we split batch_size according to ngpu_per_node when DistributedDataparallel

Assume we have two nodes: node-A and node-B, each has 4gpus(i.e. ngpu_per_node=4). We set args.batch_size = 256 on each node, means that we want each node process 256 images in each forward.

(1) If we use DistributedDataparallel with 1gpu-per-process mode, shall we manually divide the batchsize by ngpu_per_node in torch.utils.data.DataLoader : torch.utils.data.DataLoader(batch_size = args.batch_size / 4)(the way used in pytorch-imagenet-official-example). In my original opinion, I think DistributedSampler can handle such thing, because we have passed world_size and rank to DistributedSampler. . If I was wrong, please point it out, thanks!

(2) If dividing the batchsize by ngpu_per_node is a correct way, I wonder what will happen if we do not that.

  • Does it means in each node, 4*batch_size images are processed per forward-process?

  • Will 4*len(dataset) images are processed in one epoch, or the forward frequency are four times less than usual(i.e. the total number images proceeded per epoch keep same)?

4 Likes

You are correct. Each DataLoader instance pairs with a DDP instances. If you do not divide the batch-size=256 by 4, then each DDP instance will process 256 images. As your env has 8-GPUs in total, there will be 8 DDP instances. So one iteration will process 256 * 8 images in total.

However, DDP does divide the gradients by the world_size by default code. So, when configuring learning rate, you only need to consider the batch_size for a single DDP instance.

2 Likes

Another question is if we do not divide batch-size by 8, the total images processed in one epoch will be the same as usual or eight times?

As for learning rate, if we have 8-gpus in total, there wiil be 8 DDP instances. If the batch-size in each DDP distances is 64 (has been divides manually), then one iteration will process 64×4=256 images per node. Taking all gpu into account (2 nodes, 4gpus per node), then one iteration will process 64×8=512 images. Assuming in one-gpu-one-node scenario, we set 1×lr when batch-size=64, 4×lr when batch-size=256 and 8×lr when batch-size=512(a universal strategy that increase learning rate with batch-size linearly). Let us back to DDP scenario (2 node, 4gpus per node), what learning rate shall we use? 1×lr or 4×lr or 8×lr?

The total number of images processed will be 8 times, because each DDP instance/process will process batch_size images.

Let us back to DDP scenario (2 node, 4gpus per node), what learning rate shall we use? 1×lr or 4×lr or 8×lr?

It should be 1x lr. Because DDP calculates the average instead of sum of all local gradients. Let’s use some number to explain this. Assuming every image leads to a torch.ones_like(param) gradient for each parameter.

  • For local training without DDP, if you set batch_size = 64, the gradient for each parameter will then be torch.ones_like(param) * 64.
  • For 8-process DDP training, if you set batch_size = 64, the local gradient for each parameter will also be torch.ones_like(param) * 64. Then DDP use collective communication to calculate the sum of gradients across all DDP instances which will be torch.ones_like(param) * 64 * 8, and then DDP divides that value by 8. So the final gradient in param.grad field will still be torch.ones_like(param) * 64 (the code actually first divide and then do globally sum). So, when set lr, you only need to consider local batch_size.

According to the discuss in Is average the correct way for the gradient in DistributedDataParallel, I think we should set 8×lr. I will state my reason under 1 node, 8gpus, local-batch=64(images processed by one gpu each iteration) scenario:
(1) Let us consider a batch images (batch-size=512), in DataParallel scenario, a complete forward-backforwad pipeline is:

  1. the input data are split to 8 slices (each contains 64 images), each slice is feed to net to compute output

  2. outputs are concated in master gpu (usually gpu 0) to form a [512, C] outputs

  3. compute the loss with groundtruth(same dimension: [512, C]) : loss = \frac{1}{512} \sum_{i=1}^512 mse(output[i], groundtruth[i])( use mse loss as illustration)

  4. use loss.backward to compute gradients.

So the finally [512, C] outputs are the same as computed on one gpu. So the learning rate here shall be set as 8×lr to keep same as 512 batchsize in one-gpu-one-node scenarior.

(2) Secondly, when DistributedDataparallel is used, the pipeline is

  1. the input data are also split to 8 slices

  2. outputs are computed in each gpu to form a [64, C] outputs

  3. In each gpu, compute the loss loss = \frac{1}{64} \sum_{i=1}^64 mse(output[i], groundtruth[i]) and compute gradients grad_k (k is the gpu number, k=0,1,...,7): (this is different with Dataparallel, which need collect all outputs in master gpu)

  4. Average the gradients between all gpus: avg_grad =\frac{1}{8} \sum_{k=1}^8 grad_k

By this way, the averaged gradients are also same as the gradients computed on one-gpu-one-node scenario. So I think learning rate here need to be set as 8×lr to keep same as 512 batchsize on one-gpu-one-node scenario.

The main difference between you and me is that when local batch is set as 64, I think local gradients will be averaged over local samples, resulting in torch.ones_like(param)*64/64, but you think the local gradients will be summed over local samples, resulting in torch.ones_like(param) * 64. I think local gradients will be averaged mainly because the loss function in pytroch, like mse(), will compute the average loss over all input samples, so the gradients computed from such loss also should be averaged over all input samples.

I do not know if I understand DistributedDataparallel in a right way. Please let me know if there has any wrong.

2 Likes

I agree with all your analysis on the magnitude of the gradients, and I agree that it depends on the loss function. But even with MSE loss fn, it can lead to different conclusions:

  1. If the fw-bw has processed 8X data, we should set lr to 8X, meaning that the model should take a larger step if it has processed more data as the gradient is more accurate. (IIUC, this is what you advocate for)
  2. If the gradient is of the same magnitude, we should use 1X lr, especially when approaching convergence. Otherwise, if we use 8X lr, it is more likely to overshoot and hurt converged model accuracy.

After reading your analysis, I realized that, with MSE loss fn, the discussion is mostly irrelevant to DDP. The question would then be, if I increase batch size by k, how should I adjust the learning rate, which is an open question. :slight_smile:

1 Like

Is it correct that when local batch-size is 64 (i.e. torch.utils.data.DataLoader(batch_size=64) and torch.utils.data.distributed.DistributedSampler() is used), and there are N processes totally in ddp (N processes distirbute in one node or more than one node), the forward-backward process is similar to the forward-backward process in 1-gpu-1-node using 64×N batch-size inputs?

For SGD this https://arxiv.org/abs/1706.02677 paper suggest When the minibatch size is multiplied by k, multiply the learning rate by k.

For my better understanding, @mrshenli - can you please answer?

Suppose we have 32*8 images in the dataset, and the batch size is 32. We want to train the model for 1 epoch only. Now consider following 3 scenarios. Note that we use same LR and optimizer function in all three cases below.

(1) Single Node - Single GPU: In this case, one epoch will require 8 steps to execute i.e. in each step, 32 images will be processed. The gradient will be calculated and applied 8 times. The model parameters will get updated 8 times.

(2) Single Node - Multiple GPU using DataParallel: Suppose we use 8 GPUs. In this case, one epoch will still require 8 steps to execute i.e. in each step, 32 images will be processed (albeit each GPU will process only 4 images and then the gradients will be summed over). The gradient will be calculated and applied 8 times. The model parameters will get updated 8 times.

(3) Single Node - Multiple GPU using DistributedDataParallel: Suppose we use 8 GPUs. In this case, one epoch will require just 1 step to execute as in each step 32*8 images will be processed. However, this also means that the gradient will be calculated and applied only 1 time. And consequently, the model parameters will get updated just 1 time. So, in order to get similar results as previous two points (#1 and #2 above), we will have to execute 8 epochs instead of 1, as gradients are averaged in the backward function and hence effective weight updates are almost similar in scenario 3 and those in scenarios 1 and 2 above.

Is this understanding correct?

I think that points 1) and 2) are correct. And if you want to get the same behaviour in the DDP experiment (3), you shouldn’t do 8 epochs with batch size 32.
You just have to pass batch_size/num_GPUs as your real batch size, i.e.: 32/8=4. Then all 8 GPUs will process 4 images, i.e. 4 images * 8 GPUs = 32 images per epoch (just like the (1) and (2)).

@ekurtic Thanks for quick response.

I am trying to find how is the gain in training time (without affecting accuracy as much) by using DistributedDataParallel. As per you, if I use batch size of 4, then yes the training will be faster. However in that case I will be using LR of batch size 32 for that of 4. Is this okay? And if I divide LR by 8, then I am effectively slowing down the training, and thus not gaining on training time. Am I missing something here?

From my understanding of how DDP works:
There are a few gains that result in better performance (faster training) with DDP. One of them is that in a DDP experiment, your model and optimizer are “replicated” on each GPU (before training starts; notice this happens only once). After that, each GPU will optimize your model independently and exchange gradients with other GPUs to ensure that optimization steps are the same at each GPU.
On the other side, DataParallel has to scatter input data from the main GPU, replicate your model on each GPU, do a forward pass, gather back all outputs to the main GPU to compute loss, then send back the results to each GPU so they can compute gradients of the model, and in the end, collect all those gradients on the main GPU to compute the optimization step. And all of this happens for each batch of input data. As you can see, in contrast to the DDP, there is a lot of communication, copying, synchronizing and so on.

Regarding the LR question:
If you use batch_size/num_GPUs = 32/8 = 4 as your batch size in DDP, then you don’t have to change the LR. It should be the same as the one in DataParallel with batch_size = 32, because the effective batch size that your model is working with is the same: 32. It’s just handled in a different way with DDP.

1 Like

@ekurtic I agree with your description of advantages of DDP in terms of communication, copy, sync etc. :+1:

On the LR part, however, I am not sure. As per the discussion in this thread and few others, each gpu will individually process 4 images, loss will be calculated and then during backward function the gradients will be calculated individually first. Then DDP calculates the average (instead of sum) of all local gradients. This means that the gradient magnitude (the average value) is effectively in the near-range of that for batch size of 4, and not 32. So I am not sure whether LR of batchsize 4 should be applied or that of 32.

Loosely speaking this is why I think we don’t need to update LR if we ensure that the effective batch sizes are the same:

4 Likes

Thank you. This is very good explanation.

1 Like

Thanks for this explanation, I wonder about the following sentence in the context of DDP:

We would have to multiply loss value with the number of GPUs only if the loss function had “reduction=sum” and not “reduction=mean” (to cancel out the DDP gradient averaging)

If reduction=sum is used for computing the loss, it shouldn’t matter whether:

  • the loss is multiply with the number of GPUs prior calling backward(),
  • the gradient is multiply with the number of GPUs after calling backward().

Do you agree?

Yes, I think I agree with that. As long as your “main loss” is a sum of some terms (either via reduction=sum or it’s just composed out of a few different losses), you would have to multiply it somewhere to cancel out the 1/num_GPUs averaging that comes with DDP. From what I have seen so far, the most common thing is to just multiply loss prior to calling the backward() (your first suggestion). However, multiplying after should be fine too but you have to be careful and multiply before the optimizer.step() is called. I guess this second approach can then be achieved in two different ways: either by multiplying gradients (which could be slow) or by multiplying your learning_rate so that the final weight update has a proper scale.

Good call, I will have to remember that! Thanks for your quick answer.

1 Like