I followed the code #12012 to implement a ring-allreduce algorithm, but I cannot find any improvement compared to the openMPI allreduce. So is there a way to do so just use the send
and recv
method in PyTorch?
Did you expect an improvement over the MPI implementation? If so, what kind of improvement?
Different MPI implementations use different algorithms. You can look at the OpenMPI configuration parameters/tunables to figure out how to tweak which algorithm it should use.
The gloo backend implements ring allreduce in C++. You can build it yourself on top of send and recv as well of course, but this won’t be faster compared to the existing implementation.
Thank you for your reply.
-
Did you expect an improvement over the MPI implementation? If so, what kind of improvement?
Yes. Since I testedgloo
,NCCL
andMPI
among nearly 32 nodes(ResNet-50 data-parallel), I founddist.all_reduce
using MPI is the most time-consuming one. I think the reason is the MPI backend has not implemented ring allreduce yet. -
Different MPI implementations use different algorithms. You can look at the OpenMPI configuration parameters/tunables to figure out how to tweak which algorithm it should use.
So which one may be the fastest one? I am not familiar with MPI implementations. -
The gloo backend implements ring allreduce in C++. You can build it yourself on top of send and recv as well of course, but this won’t be faster compared to the existing implementation.
gloo
is good, but it cannot supportgather
andscatter
in GPU, that 's why I chose cuda-aware MPI. By the way, you said just usesend
andrecv
won’t be faster compared to the existing implementation, could you give me more details? I wonder why usesend
andrecv
couldn’t reduce latency since I used them to implement ring-allreduce.