Run RPC over MPI for Parameter Server DRL

Matthew_Landen · June 23, 2021, 4:16pm

I am currently developing an drl framework that can run on a cluster with mpi. i am able to perform synchronous training using DDP over MPI. Now, I want to explore a different structure using a parameter sever and MPI. I saw that RPC would be the right tool, but I cannot figure out how/if rpc can run with mpi.

I saw this example, but it only works when all ranks are running on the same node. Is there a way to accomplish this with pytorch alone or is an additional tool needed?

Yanli_Zhao · June 26, 2021, 12:35am

you do not have to run rpc on MPI, pytorch distributed provides gloo and nccl backends, you can pass ‘gloo’ or ‘nccl’ to init_process_group().

for rpc, to get better performance, you can use tensor pipe as backend option