Whether distributed RPC supports collective communication?

Midhilesh · April 12, 2021, 6:30am

I am currently studying distributed RPC for hybrid parallelism. From the documentation, I figured out RPC supports TensorPipe backend and it is a point-to-point communication. But for hybrid parallelism, I need all-to-all collective communication. Are there any ways for implementing hybrid parallelism with collective communication using distributed RPC?.

I kindly request anyone to provide a solution for this issue.

agolynski · April 12, 2021, 5:15pm

cc Luca @lcw

I think the main usecase of tensorpipe is not collectives, you can use other solutions, e.g. NCCL, GLOO, UCC.

Luca, are there plans for tensorpipe to be a backend for such collectives?

lcw · April 16, 2021, 9:10am

Yes, correct, we currently don’t provide a way to do collectives on top of RPC/TensorPipe. The rationale is that the “native” collective libraries (NCCL, Gloo, MPI) are already doing a much better job at this, hence we’re not optimizing TensorPipe and RPC for that use case. However you should be able to combine RPC with the collective libraries very easily. Here is a tutorial showing how to do so with DDP, but if you prefer to use the “lower-level” API that should work too: Combining Distributed DataParallel with Distributed RPC Framework — PyTorch Tutorials 1.8.1+cu102 documentation