Multiple node, Multiple worker Allreduce

vineeths · October 10, 2020, 5:47am

How is Multiple node, Multiple worker Allreduce implemented in PyTorch?

I know that in a single node multi-worker setting, allreduce is implemented with a ring allreduce algorithm. How does this change in a multinode setting?

mrshenli · October 12, 2020, 2:16pm

Hey @vineeths, PyTorch distributed all_reduce calls into the allreduce API provided by the communication backend (Gloo, NCCL, and MPI). Gloo uses ring allreduce. NCCL has both ring and tree allreduce. See this discussion: https://github.com/NVIDIA/nccl/issues/256