Distributed Data Parallel over the Internet

I am trying to train a network with “Distributed Data Parallel” on multiple nodes, each having a different public IP address by sshing simultaneously into these nodes using “pdsh” coordination tool as suggested in this tutorial.

Specifically, given a local machine with public IP address “Ip0” and 2 remote nodes with Public IP address “Ip1” and “Ip2” respectively on which training is to be performed remotely from the local machine, how to go about making such a set-up?

Also, how to ensure before running the training script on each remote node that the 2 remote nodes have access to each other?

Thanks in advance.

Can you try ssh to one of the remote machine and then ping/ssh another remote machine?

Specifically, given a local machine with public IP address “Ip0” and 2 remote nodes with Public IP address “Ip1” and “Ip2” respectively on which training is to be performed remotely from the local machine, how to go about making such a set-up?

One of the remote machine can serve as the master, i.e., using it’s IP address as the MASTER_ADDR and pick a port for MASTER_PORT.

If you mean ssh-ing into node-2 from node-1, yes I can do that. However, what if I were to do this with n (say 10) nodes? Should I then ssh into remote node-1 and then from the remote node-1 terminal use “pdsh” to ssh into all other nodes simultaneously? @mrshenli

No, you don’t need to ssh from node-1 to other nodes to launch the script. The ping/ssh I mentioned is only to check what IP would work. If you confirm that the IP of one node is accessible for all other nodes, you can set that node as master. This is only for rendezvous, and all nodes will use the rendezvous process to discover each other automatically.

1 Like

Understood, thanks for such a subtle answer. @mrshenli