Node -to-node communication traffic in Distributed Data Parallel & RPC Model Parallel

Hi @mrshenli,

I wonder if I can ask some questions on the implementation details of the communication traffic for both DDP data-parallel and RPC model-parallel using multiple machines.

In the documentation, https://pytorch.org/docs/stable/notes/ddp.html?highlight=ddp, if I understand correctly the .backforward() will work on the local gradients and the DDP will help to synchronize the local gradients with the gradients in other nodes (not sure, perhaps ring-all reduce?). For the SGD optimizer as an example, the local gradients are first averaged over the batch size, and then the locally averaged gradients will be sent to other nodes in buckets by DDP. The traffic size would be roughly the size of the NN’s gradient size of a batch size of 1. Do I understand this correctly? Or some optimizers or networks do need to send some size (gradients * batch size)?

https://pytorch.org/docs/stable/rpc.html For the RPC, the Distributed Autograd will take care of the communication in the backward direction at the RPC boundaries of each machine. So the communication happens when the user code calls .to_here for the RREF and the Distributed Autograd passes the gradients in backward? The traffic does relate to the batch size * the size of feature maps at the boundaries?

Please let me know if I understand the mechanisms of the communication behind DDP and RPC correctly. Thank you very much for your time!

Hey @ZiyiZhu

the .backforward() will work on the local gradients and the DDP will help to synchronize the local gradients with the gradients in other nodes (not sure, perhaps ring-all reduce?).

Yes, this is correct. By default Gloo uses ring allreduce IIUC. For NCCL, you can configure NCCL_TREE_THRESHOLD to tell NCCL when to enable tree allreduce. (I haven’t tested this myself yet.)

For the SGD optimizer as an example, the local gradients are first averaged over the batch size, and then the locally averaged gradients will be sent to other nodes in buckets by DDP . The traffic size would be roughly the size of the NN’s gradient size of a batch size of 1. Do I understand this correctly?

Yes, one minor note is that the batch size does not affect the amount of data sent over network, because AccumulateGrad will accumulate gradients from different samples into the same param.grad tensor. So the data sent over wire is always param.grad for all params in model.parameters(). And yes, gradients are organized into buckets to speed up communication.

Or some optimizers or networks do need to send some size (gradients * batch size)?

No, batch size does not change the amount of traffic.

For the RPC, the Distributed Autograd will take care of the communication in the backward direction at the RPC boundaries of each machine. So the communication happens when the user code calls .to_here for the RREF and the Distributed Autograd passes the gradients in backward?

Yes, when the application calls RRef.to_here() in the forward pass, it inserts a send autograd function on the callee and a recv autograd function on the caller. These two functions contains some globally unique IDs to identify each other, and they help to connect multiple local autograd graphs into one global autograd graph. So that in the backward pass, when the autograd engine reaches the recv function, it knows where to propagate the gradient to, i.e., the send autograd function, and the backward can then proceed there. So from the application’s perspective, calling backward on the final loss will trigger a global backward across RPC boundaries. This distributed autograd note might help to explain more.

The traffic does relate to the batch size * the size of feature maps at the boundaries?

No, batch size is still irrelevant here. This is because, the local autograd engine will only run a function once in the backward pass when all its input gradients are ready. More specifically, there is an input buffer for every autograd function. If a function x have multiple fanout in the forward pass, those fanout will accumulate gradient into the same input buffer and the local autograd engine will only trigger x when all gradients from those fanout are accumulated in the input buffer. So, the amount of data communicated through an RPC in the backward pass should be the same as the total tensor size went through that RPC in the forward pass.

Hi @mrshenli,

Thank you very much for the detailed explanations and they are very helpful.

Perhaps I do not understand it correctly or maybe I did not make it clear but I hope if you could explain more on your last comment in your previous reply.

Take a nn.con2d as an example(nn.Conv2d). If this is the one at the RPC boundary and in the forward pass, the total tensor size to be transmitted to the following node should be equal to Batchsize x Channels xW x H, IIUC. So in the backward path, should it be the same size as this? Or it is accumulated as Channels xW x H?

Thanks.

My bad. If there is a batch dimension in the RPC argument tensor, then yes, the batch size will have an impact on the data size sent over wire. It will be B * C * W * H.

1 Like

Sure. Thank you for the detailed explanations.

Best,
Ziyi