Is average the correct way for the gradient in DistributedDataParallel with multi nodes?

When I use DataParallel in one machine with two GPUs with 8 batch size(4 on each GPU), I get a satisfied training result. But, if I use DistributedDataParallel on two single GPU machines with 8 batch size(4 on each node), the training result is dissatisfied and convergence speed is slower than the DataParallel.

After checking the doc of DataParallel and DistributedDataParallel, I noticed that DataParallel sum the gradient of each GPU, DistributedDataParallel average the gradient of each node(GPU under my condition).

I think this difference is the reason for the different training results.

Is average the correct way for the gradient in DistributedDataParallel with multi-node? Should I modify the DistributedDataParallel to sum the gradient of each node to reproduce the same training result in my exam?

Yes, average across processes is the expected behavior here.

Right now this behavior is not configurable.

@GeoffreyChen777 Yes, averaging is the correct way for gradient reduction among nodes. The reason you are seeing DataParallel adds gradients together is the correct way too,

The difference is that, DataParallel will split the batch size into sub-batches on each of the GPUs. When each GPU completes the computation, gradients are going to be reduced (added) onto the master GPU. Thinking about this as that: (1) this is a master-worker mode instead of true data parallel, since only the master GPU will scatter the batch and gather the results (2) we actually want to get the gradient of the total batch size, that’s why adding each worker’s gradient is the expected behavior. By comparison, Distributed Data Parallel goes completely parallel among distributed processes. And if the process itself has more than 1 GPU, the similar scatter and gather master worker mode will be employed similarly as DataParallel, and gradient will be added among worker GPU, and then averaged across distributed processes. The bottomline here is, the gradient will be averaged across data-parallel workers (processes), not slave workers (within a single process).

7 Likes

@teng-li Thank you!

I am reproducing a huge network. The author use 16 batch size with 4GPU training. And they use DataParallel. I don’t have 4GPU machine. So I want to use 2 machine(2GPUs on each machine) to traing the network with batchsize 16. If the averaging is the default operation of DisributedDataParallel, there is no way to reproduce the training process. Right?

@GeoffreyChen777

You can do this three ways:

(1) If you can put 16 batch size on 2 GPUs
(2) If you cannot, use two nodes (two processes) DistributedDataParallel, each node(process) has a batch size of 8. Here you should use the base LR for batch size of 8
(3) You can use four processes in two nodes with DistributedDataParallel, (this is the fastest way of doing distributed training), and each node will have two processes, and each process and DistributedDataParallel operates on one GPU (local rank, which is rank % number_of_gpu_per_node, here your rank is from 0 - 3, since you have four processes across two nodes). But then you have to use the base LR for the batch size of 4.

Hope this clarifies and helps

3 Likes

@teng-li Thank you! If LR for batch size 16 is 0.01, LR for batch size 8 should be set to 0.02. Right?

@GeoffreyChen777 No, it should be 0.005

@teng-li Thank you very much!

Thank you for your reply. After reading your answer, I have understood DataParallel but still confused by Distributed DataParallel. In my case, I have one machine with 4 GPUs. According to pytorch 1.0 tutorial , distributed dataparallel can also be used in single machine. Now I have 4 processes , each process has one GPU. So if batch_size is set to 128, that means each process (or single GPU) will be allocated batch_size 32 ? And some hyperparameters like LR should be set with batchsize 32 ?

@Lausanne I think you should keep the original learning rate.

If you use the DistributedDataParallel, the gradient will be averaged between each process. DataParallel sum the gradient. It is equal to the DistributedDataParallel. The reason is that the the loss will be averaged by 128 batchsize and then backward to DataParallel model. The gradient will be reduced across each minibatch. And in DistributedDataParallel, the loss will be averaged by 32 and backward to each distributed model. So, we need to average the gradients between each distributed process.

@GeoffreyChen777
Really thank you! Briefly speaking, Dataparallel firstly sum and then average, because each GPU calculates part of the 128 batch, then they must send loss to master GPU to update parameters. Distributed Dataparallel has independent model and parameters in each GPU, so the loss calculated on one GPU has been the average of batch_size 32, then we should average loss between different GPU. That is in Distributed Dataparallel , we do average inner model and then average between different GPU. If I do not understand correctly, please let me known, thank you again!

@Lausanne

You are right. I think the final gradient in DataParallel should be equal to the gradient of DistributedParallel. :slight_smile:

@GeoffreyChen777
Thank you for your timly reply, best wish with you !

Hi,

Just to make sure I have understood correctly. If I train on one gpu with batchsize=16 and lr=0.01, what would be the correct lr if I train on two gpus in torch.distributed mode with the same batchsize=16(8 on each gpu)?

@coincheung Your lr in torch.distributed mode should be 0.005

Thanks, does this mean that in the distributed mode, the gradient of different gpus are summed up rather than meaned, thus I should reduce the lr to make remedy for the summation?

My guess. I think it depends on how you compute the loss and backward.
– DataParallel: if you merge 2 batch at the end then compute a single loss = average loss over each example, and then do loss.backward(). Then summing is mathematically correct. 1 or 2 or more GPUs, the gradient computed this way should be the same
– DistributedDataParallel: if using 2 separate losses for each GPU: loss1 = average over example in batch1, and loss2 = average over example in batch 2. To simulate loss = average over examples = (loss1+loss2)/2. You can loss1.backward(), loss2back.ward(), then average params gradient, that is equivalent to loss.backward()

According @GeoffreyChen777 's answer, I think the learning rate should keep same, i.e. 0.01.

Why the gradients are averaged across distributed processes ?

I guess because the vast majority of loss functions in PyTorch have the default behavior to average losses across all samples in the batch, i.e. they have reduction=mean. To get the mathematically equivalent gradients in a DDP experiment (like the ones you’d get by running the 1-GPU experiment), you have to average them. If your loss function has reduction=sum, then you have to multiply the loss value at each GPU process with the world_size to cancel out this averaging.

2 Likes