Use two GPUs from two different virtual machines parallelly

Santhosh · March 26, 2021, 12:12am

Does PyTorch provide any options to train a model using two GPUs installed in two different virtual machines parallelly? If so, Could someone please help me to implement it?
I found that the DataParallel function is supposed to run multi GPUs present on the same machine, but couldn’t able to find any suitable information for my requirement. TIA

wayi · March 26, 2021, 5:30am

Does PyTorch provide any options to train a model using two GPUs installed in two different virtual machines parallelly?

Relevant link: Distributed communication package - torch.distributed — PyTorch 2.1 documentation
Not sure if you can do this with virtual machines.

I found that the DataParallel function is supposed to run multi GPUs present on the same machine, but couldn’t able to find any suitable information for my requirement.

You are right: DataParallel is not designed for multi-node training.

Santhosh · March 26, 2021, 6:30am

Thanks for your response. I’ve checked the distributed documentation and I don’t really understand the Multi-Node multi-process distributed training section.
Why --master_addr=“192.168.1.1” and --master_port=1234 arguments have the same values on both the nodes? why the IP and port values aren’t different? will it work if I assign two respective virtual machine IP addresses and port numbers in those arguments?

wayi · March 26, 2021, 6:43am

According to “Environment variable initialization” section in PyTorch Distributed tutorial, rank 0 node will be used to set up all connections, and both MASTER_PORT and MASTER_ADDR must be free port and address of rank 0 node. Therefore, you should always use the same values for all the nodes in the same process group.

Santhosh · March 26, 2021, 7:00am

How do the two virtual machines connect to each other and run GPUs from another machine parallelly? Out of two virtual machine, on which machine should I run the below distribution command?

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
–nnodes=2 --node_rank=0 --master_addr=“192.168.1.1”
–master_port=1234 YOUR_TRAINING_SCRIPT.py (–arg1 --arg2 --arg3
and all other arguments of your training script)

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
–nnodes=2 --node_rank=1 --master_addr=“192.168.1.1”
–master_port=1234 YOUR_TRAINING_SCRIPT.py (–arg1 --arg2 --arg3
and all other arguments of your training script)

wayi · March 26, 2021, 7:15am

How do the two virtual machines connect to each other and run GPUs from another machine parallelly?

Once the two machines forms a process group, you can use collective APIs like allreduce for communication.

Out of two virtual machine, on which machine should I run the below distribution command?

You should run these commands on two virtual machines, respectively. There is no “out of two virtual machine”.

Santhosh · March 26, 2021, 7:28am

I don’t have an idea of the process group and allreduce API. Could you please elaborate on this process or share relevant links to follow up?

Do you mean to run the first distribution command on one VM and the second command on the other VM or run both the commands on both the VMs?

wayi · March 26, 2021, 7:42am

I don’t have an idea of the process group and allreduce API. Could you please elaborate on this process or share relevant links to follow up?

Check out “Collective functions” in the tutorial. This is lower-level API for communication. Ideally, you probably don’t need to do any explicit communication, and you can just try DDP, which does not let you feel any difference compared with the training on a single process.

Do you mean to run the first distribution command on one VM and the second command on the other VM or run both the commands on both the VMs?

Run the first distribution command on one VM and the second command on the other VM