Multiple node, Multiple worker Allreduce

How is Multiple node, Multiple worker Allreduce implemented in PyTorch?

I know that in a single node multi-worker setting, allreduce is implemented with a ring allreduce algorithm. How does this change in a multinode setting?

Hey @vineeths, PyTorch distributed all_reduce calls into the allreduce API provided by the communication backend (Gloo, NCCL, and MPI). Gloo uses ring allreduce. NCCL has both ring and tree allreduce. See this discussion: https://github.com/NVIDIA/nccl/issues/256