The ddp seem to be disable to find the second node

I train my model in two nodes(4 gpus) with ddp.

When I log in the first node, it seems to functions well. when I use ps aux|grep python. there are two tasks running

but when I log in the second node, there are no any tasks running

so how do the ddp find the second node?

If I understand correctly, you are trying to train with 4 GPUs, 2 on one machine and 2 on another machine? If this is the case, then you will need to launch your training script separately on each machine. The node_rank for launch script on the first machine should be 0 and node_rank passed to the launch script on the second machine should be 1. It seems here like you are passing 2 separate node_ranks for processes launched on the same machine.

See the multi-node multi-process distributed launch example here: Distributed communication package - torch.distributed — PyTorch 1.7.0 documentation

thanks! I got it. By the way, is there any ways to automatically do that? manuallying launch the task in each node is not convenient.

We don’t provide a way of doing this natively in PyTorch. Writing a simple bash script to do this should be doable though (take a list of hostnames, ssh into each one, copy over pytorch program, and run). You could also use slurm/mpirun if those are available in your environment.

If you decide to use torchelastic for distributed training, then there are some plugins available to simplify training in the cloud: elastic/kubernetes at master · pytorch/elastic · GitHub

1 Like