Comparison Data Parallel Distributed data parallel

henry_Kang · August 18, 2020, 7:37pm

Hello. I hope you are very well.
I am finalizing my experiment with pytorch. When I finish my paper, I hope I can share my paper in here.
Anyway, is there any detailed documentation about data parallel(dp) and distributed data parallel(ddp)
During my experiment, DP and DDP have big accuracy difference with same dataset, network, learning rate, and loss function. I hope I can put this experiment results in my paper but my professor asks the detailed explanation of why it happens. My dataset is a very unique image dataset and it is not a normal object such as imagenet or city scape stuff, so it can be a very different result than usual computer science paper. In this reason, I look around and read some articles.
https://yangkky.github.io/2019/07/08/distributed-pytorch-tutorial.html
https://www.telesens.co/2019/04/04/distributed-data-parallel-training-using-pytorch-on-aws/

However, I am still confused about this two different multi gpu training strategies.

What is the “reduce” mean. The “reduce” is the weight update or loss reduction.
What is the major difference between DP and DDP in the weight update strategy? I think this is important.
DDP affects the batch normalization (BN) or DDP still needs the synchronized BN.
Thank you for reading my question.

mrshenli · August 18, 2020, 8:19pm

There are some comparison between DP and DDP here: PyTorch Distributed Overview — PyTorch Tutorials 2.1.1+cu121 documentation

What is the “reduce” mean. The “reduce” is the weight update or loss reduction.

What’s the context here? If you mean all_reduce, it is a collective communication operation. DDP uses it to synchronize gradients. see https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html#allreduce

What is the major difference between DP and DDP in the weight update strategy? I think this is important.

Weight update is done by the optimizer, so if you are using the same optimizer the weight update strategy should be the same. The difference between DP and DDP is how they handle gradients. DP accumulates gradients to the same .grad field, while DDP first use all_reduce to calculate the gradient sum across all processes and divide that by world_size to compute the mean. More details can be found in this paper.

The above difference has impact on how lr should be configured. See this discussion: Should we split batch_size according to ngpu_per_node when DistributedDataparallel

DDP affects the batch normalization (BN) or DDP still needs the synchronized BN.
Thank you for reading my question.

By default, DDP will broadcast buffers from rank 0 to all other ranks, so yes, it does affect BN.

BTW, for distributed training related questions, could you please add a “distributed” tag to the post? There is a oncall team monitoring that tag.

henry_Kang · August 18, 2020, 9:10pm

Well, many people talk about “reduce” but in the context “reduce” does not seem like the literary “reduce”. That is why I ask the question. Because it keeps coming but no one defines this term first when they use it.
what is different between reducing gradients and weight update. Do you mean, DP and DDP exactly update the same weight and same updated each layer right? It is also confusing to me.
Do you mean Batch size or LR size? You link the batch size about it.
I face that there is no improvement when I use the DDP with synchronized BN. That is why I am asking third question.

mrshenli · August 18, 2020, 9:40pm

what is different between reducing gradients and weight update.

There are many weight updating algorithms, e.g., Adam, SGD, Adagrad, etc. (see more here). And they are all independent from DP or DDP. So even if the gradient is the same, different optimizers can update the weight to a different value.

Reducing gradients in DDP basically means communicating gradients across processes.

Do you mean, DP and DDP exactly update the same weight and same updated each layer right?

Neither DP nor DDP touches model weight. In the following code, it is the optimzer.step() that updates model weights. What DP and DDP do are preparing the .grad field for all parameters.

output = model(input)
output.sum().backward()
# DP and DDP not involved in the below this point.
opt.step()

It is also confusing to me. Do you mean Batch size or LR size? You link the batch size about it.

Quoting some discussion from that link. If you search for “lr”, you will find almost all comments in that thread discusses how to configure LR and batch size.

I face that there is no improvement when I use the DDP with synchronized BN. That is why I am asking third question.

Right, SyncBatchNorm has its own way for communication, which is out of control of DDP. Using DDP won’t change how SyncBatchNorm behaves.

github.com

pytorch/pytorch/blob/f64d24c941a00bc81b3017008ae212cca761d393/torch/nn/modules/_functions.py#L79-L81


      
          torch.distributed.all_reduce(
              combined, torch.distributed.ReduceOp.SUM, process_group, async_op=False)
          sum_dy, sum_dy_xmu = torch.split(combined, num_channels)

henry_Kang · August 18, 2020, 11:03pm

Thank you for detail explanation.
Also, DDP and LR relationship are interesting. I used to find the LR with trial and error manner…

I got some understand about reduce and DDP. Please check my understanding.

So Basically DP and DDP do not directly change the weight “but it is a different way to calculate the gradient in multi GPU conditions”. If this is incorrect please let me know.
The input data goes through the network, and loss calculate based on output and ground truth.
During this loss calculation, DP or DDP work differently.
However I thought that gradient is basically calculated from loss.
Each loss in the GPU has the different loss result.
DP used mean value because DP send every output result to main GPU and calculate the loss.
If my understanding is incorrect please point out.
However DDP used the different. I still do not get it this parts. In the paper they also use the average value.
What is different between mean calculation and syncronized calculation?
For update the weight in network, the optimizer updates the network using by gradient value.

The update part is the optimizer part no DP or DDP related with it.
So the performance difference might come from LR difference? Because the bath size become different. weight = previous weight - (gradient*learning_rate)

Really thank you for helping me.

mrshenli · August 19, 2020, 2:14am

correct.

The input data goes through the network, and loss calculate based on output and ground truth.
During this loss calculation, DP or DDP work differently.

correct.

Each loss in the GPU has the different loss result.
DP used mean value because DP send every output result to main GPU and calculate the loss.

This is incorrect. DP’s forward pass 1) create a model replica on every GPU, 2) scatters input to every GPU 3) feed one input shard to a different model replica 4) use one thread per model replica to create output on each GPU 5) gather all outputs from different GPUs to one GPU and return. The loss with DP is calculated based on that gathered output, and hence there is only one loss with DP.

github.com

pytorch/pytorch/blob/d06f1818ada6405a30943f58548af958c2b83ff6/torch/nn/parallel/data_parallel.py#L147-L162


      
          def forward(self, *inputs, **kwargs):
              if not self.device_ids:
                  return self.module(*inputs, **kwargs)
          
              for t in chain(self.module.parameters(), self.module.buffers()):
                  if t.device != self.src_device_obj:
                      raise RuntimeError("module must have its parameters and buffers "
                                         "on device {} (device_ids[0]) but found one of "
                                         "them on device: {}".format(self.src_device_obj, t.device))
          
              inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
              if len(self.device_ids) == 1:
                  return self.module(*inputs[0], **kwargs[0])
              replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
              outputs = self.parallel_apply(replicas, inputs, kwargs)
              return self.gather(outputs, self.output_device)

DDP is multi-processing parallel, and hence it can scale across multiple machines. In this case, every process has its own loss and so there are multiple different losses. Gradients are synchronized during the backward pass using autograd hook and allreduce. For more details, I recommend reading the paper I linked above.

What is different between mean calculation and syncronized calculation?

Because DP is single-process-multi-thread, the scatter, parallel_apply, gather ops used in the forward pass are automatically recorded by the autograd graph. So during the backward pass, the gradients will be accumulated to the .grad feld. There is no grad synchronization in DP, because autograd engine does all grad accumulation already.

As DP is multi-process and every process has its own autograd engine, we need additional code to synchronize grad.

So the performance difference might come from LR difference?

Yep, that’s one possible source. It also relates to what loss function you are using. If the loss function cannot guarantee f([x, y]) == (f(x) + f(y)) /2, then the result can also be different, as it is not compatible with gradient averaging used in DDP.

henry_Kang · August 19, 2020, 4:41am

Well, what I facing is that, DDP has better result than DP.
However many question in here actually said that DP give better result than DDP.
Second, our task is class imbalance and binary semantic segmentation. The task is real world image with very complex background. In this cases, DP gives us the 82 % mIoU and DDP achieves the 88% in the same loss function and same learning rate.
The Loss function is the IoU Loss.
What grad syncronization and accumulation is another new question. I will read your paper first and ask question again. Thank you. It is really difficult but I hope I can make it.

mrshenli · August 19, 2020, 2:14pm

No, this is not guaranteed. The only conclusion we can draw is that DP should be able to produce the same result model as non-parallel training, and DDP cannot guarantee this. But regarding which one is better, it needs to quantitatively measured, as it is affected by a lot of factors, e.g. batch size, lr, loss function, etc.

henry_Kang · August 19, 2020, 4:51pm

Ok I see. So It can be really dangerous to say that DDP is better or DP is better. I will just keep it and do not put into my the paper. Anyway I will cite your paper since I am using DDP.

sakh251 · December 4, 2020, 10:08am

Hey @mrshenli
About the loss function and LR, Which loss functions are effected by LR?
optimizers are also important?

CDhere · February 5, 2021, 2:31am

Hi Shen, I’m also encountering performance drop with DDP. Could you please elaborate on what f([x, y]) == (f(x) + f(y)) /2 means? I don’t quite understand the notation here. Thanks!

111469 · November 14, 2023, 4:48am

Hi! Thanks for the fantastic explanation!

From my understanding, the DP is able to produce the same result model as non-parallel training only when the forward pass is fixed.

That is to say, if my forward function is like:

def forward(self, x)
    if random.random() > 0.5:
        forward_path 1
        loss = xxx
    else:
        forward_path 2
        loss = yyy
    return loss

Then DP may not be applicable since it gathers the outputs and the gradients are calculated by only one auto_grad machine. While the DDP strategy wont get any problem since the gradients are calculated separately in each different process.