I am currently developing an drl framework that can run on a cluster with mpi. i am able to perform synchronous training using DDP over MPI. Now, I want to explore a different structure using a parameter sever and MPI. I saw that RPC would be the right tool, but I cannot figure out how/if rpc can run with mpi.
I saw this example, but it only works when all ranks are running on the same node. Is there a way to accomplish this with pytorch alone or is an additional tool needed?