How to implement ring-allreduce using MPI backend?

I followed the code #12012 to implement a ring-allreduce algorithm, but I cannot find any improvement compared to the openMPI allreduce. So is there a way to do so just use the send and recv method in PyTorch?

Did you expect an improvement over the MPI implementation? If so, what kind of improvement?

Different MPI implementations use different algorithms. You can look at the OpenMPI configuration parameters/tunables to figure out how to tweak which algorithm it should use.

The gloo backend implements ring allreduce in C++. You can build it yourself on top of send and recv as well of course, but this won’t be faster compared to the existing implementation.

Thank you for your reply.

  1. Did you expect an improvement over the MPI implementation? If so, what kind of improvement?
    Yes. Since I tested gloo, NCCL and MPI among nearly 32 nodes(ResNet-50 data-parallel), I found dist.all_reduce using MPI is the most time-consuming one. I think the reason is the MPI backend has not implemented ring allreduce yet.

  2. Different MPI implementations use different algorithms. You can look at the OpenMPI configuration parameters/tunables to figure out how to tweak which algorithm it should use.
    So which one may be the fastest one? I am not familiar with MPI implementations.

  3. The gloo backend implements ring allreduce in C++. You can build it yourself on top of send and recv as well of course, but this won’t be faster compared to the existing implementation.
    gloo is good, but it cannot support gather and scatter in GPU, that 's why I chose cuda-aware MPI. By the way, you said just use send and recv won’t be faster compared to the existing implementation, could you give me more details? I wonder why use send and recv couldn’t reduce latency since I used them to implement ring-allreduce.