Hey @ZiyiZhu
the .backforward() will work on the local gradients and the DDP will help to synchronize the local gradients with the gradients in other nodes (not sure, perhaps ring-all reduce?).
Yes, this is correct. By default Gloo uses ring allreduce IIUC. For NCCL, you can configure NCCL_TREE_THRESHOLD
to tell NCCL when to enable tree allreduce. (I haven’t tested this myself yet.)
For the SGD optimizer as an example, the local gradients are first averaged over the batch size, and then the locally averaged gradients will be sent to other nodes in buckets by DDP . The traffic size would be roughly the size of the NN’s gradient size of a batch size of 1. Do I understand this correctly?
Yes, one minor note is that the batch size does not affect the amount of data sent over network, because AccumulateGrad
will accumulate gradients from different samples into the same param.grad
tensor. So the data sent over wire is always param.grad
for all params in model.parameters()
. And yes, gradients are organized into buckets to speed up communication.
Or some optimizers or networks do need to send some size (gradients * batch size)?
No, batch size does not change the amount of traffic.
For the RPC, the Distributed Autograd will take care of the communication in the backward direction at the RPC boundaries of each machine. So the communication happens when the user code calls .to_here for the RREF and the Distributed Autograd passes the gradients in backward?
Yes, when the application calls RRef.to_here()
in the forward pass, it inserts a send
autograd function on the callee and a recv
autograd function on the caller. These two functions contains some globally unique IDs to identify each other, and they help to connect multiple local autograd graphs into one global autograd graph. So that in the backward pass, when the autograd engine reaches the recv
function, it knows where to propagate the gradient to, i.e., the send
autograd function, and the backward can then proceed there. So from the application’s perspective, calling backward on the final loss will trigger a global backward across RPC boundaries. This distributed autograd note might help to explain more.
The traffic does relate to the batch size * the size of feature maps at the boundaries?
No, batch size is still irrelevant here. This is because, the local autograd engine will only run a function once in the backward pass when all its input gradients are ready. More specifically, there is an input buffer for every autograd function. If a function x
have multiple fanout in the forward pass, those fanout will accumulate gradient into the same input buffer and the local autograd engine will only trigger x
when all gradients from those fanout are accumulated in the input buffer. So, the amount of data communicated through an RPC in the backward pass should be the same as the total tensor size went through that RPC in the forward pass.