How is Multiple node, Multiple worker Allreduce implemented in PyTorch?
I know that in a single node multi-worker setting, allreduce is implemented with a ring allreduce algorithm. How does this change in a multinode setting?
How is Multiple node, Multiple worker Allreduce implemented in PyTorch?
I know that in a single node multi-worker setting, allreduce is implemented with a ring allreduce algorithm. How does this change in a multinode setting?
Hey @vineeths, PyTorch distributed all_reduce
calls into the allreduce API provided by the communication backend (Gloo, NCCL, and MPI). Gloo uses ring allreduce. NCCL has both ring and tree allreduce. See this discussion: https://github.com/NVIDIA/nccl/issues/256