Manually specifying ip addresses and ports of all nodes for distributed data parallel

ruppesh · July 23, 2019, 11:21pm

I noticed that for distributed data parallel, you only need to specify the ip address and port of rank0 node, and then during initialization all nodes discover each other through rank0 node. But due to certain firewall restrictions, I want to manually specify the ip address and port of each node via which they should communicate for all reduce operations. Is there a way to do that? I am open to make changes in the pytorch source code.

pietern · July 24, 2019, 8:49am

There is no way to do this today. It also depends on which distributed backend you’re using whether this would be possible in the first place. If you’re using Gloo, it might be possible, but it’s quite a bit of work. If you’re using NCCL, it’s up to NVIDIA. If you’re using MPI, I don’t know.