Training model across multiple remote servers- each having multiple GPUs

motor_junkie · September 15, 2020, 6:37am

Is it possible to train a model across multiple remote servers in my department? These servers are not connected to each other. I want to use GPUs of both the servers (with different IP addresses) so that I can train with larger batch size.

I have seen nn.DistributedDataParallel but how do I mention the IP address of multiple servers?

mrshenli · September 15, 2020, 2:23pm

What does this mean? Their IPs are not reachable from each other?

I have seen nn.DistributedDataParallel but how do I mention the IP address of multiple servers?

If they can reach each other through network, yes, DistributedDataParallel can work across multiple machines. You need to provide the master address and master IP for all peers to do rendezvous. See this example.

If you want to choose a specific network interface, you can configure the following two env vars (more details). You only need one of them depending on which backend you are using.

NCCL_SOCKET_IFNAME, for example export NCCL_SOCKET_IFNAME=eth0
GLOO_SOCKET_IFNAME, for example export GLOO_SOCKET_IFNAME=eth0

motor_junkie · September 17, 2020, 6:44am

I think I didn’t describe my situtation correctly. What I meant is that they are different systems. One has IP: a.b.c.d, and the other has IP: a.b.c.e.

Okay, thanks! I’ll try it out

mrshenli · September 17, 2020, 3:08pm

I think I didn’t describe my situtation correctly. What I meant is that they are different systems. One has IP: a.b.c. d , and the other has IP: a.b.c. e .

I see. This should be fine. You only need to specify one of them as the master, and set MASTER_ADDR and MASTER_PORT for all peers to point to that master. This will allow all peers to do rendezvous, and the rendezvous process will create connections between pairs.