I noticed that for distributed data parallel, you only need to specify the ip address and port of rank0 node, and then during initialization all nodes discover each other through rank0 node. But due to certain firewall restrictions, I want to manually specify the ip address and port of each node via which they should communicate for all reduce operations. Is there a way to do that? I am open to make changes in the pytorch source code.
There is no way to do this today. It also depends on which distributed backend you’re using whether this would be possible in the first place. If you’re using Gloo, it might be possible, but it’s quite a bit of work. If you’re using NCCL, it’s up to NVIDIA. If you’re using MPI, I don’t know.