Migrating from torch.distributed.launch to torchrun

haowxu · February 7, 2022, 10:32am

Hi guys I am working rewriting some previous training code using multiple nodes, on Azure ML.
The previous code uses the command
python -m torch.distributed.launch --nnodes $NUM_NODES --nproc_per_node $NUM_TRAINERS --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT $TRAINING_COMMAND
and works well now.
When I followed this to migrate the code to
torchrun --nnodes=$NUM_NODES --nproc_per_node $NUM_TRAINERS --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT $TRAINING_COMMAND
I got RendezvousTimeoutError on the master node and RendezvousConnectionError on the other node.
Could anyone help find solution or figure what I did wrong? Thanks!

pbelevich · February 8, 2022, 4:08pm

@aivanou @Kiuk_Chung ?

Kiuk_Chung · February 8, 2022, 4:36pm

Could you provide us with the actual command (with the real values for nnodes, nprocs_per_node, etc)? We’re you running across multiple hosts for both commands? torchrun and torch.distributed.launch both use the same underlying main entrypoint (torch.distributed.run), so I’m wondering if the two commands were invoked in exactly the same setup.

haowxu · February 10, 2022, 8:21am

Sure. The command for torch.distributed.launch is
python3 -m torch.distributed.launch --nnodes 2 --nproc_per_node 4 --node_rank 0 --master_addr 10.0.0.8 --master_port 6000 $TRAINING_COMMAND, where the node rank is changed to 1, 2 or 3 for other processes.
For torchrun the command is
torchrun --nodes=2 --nproc_per_node=4 --rdzv_id=0 --rdzv_backend=c10d --rdzv_endpoint=10.0.0.8:6000 $TRAINING_COMMAND. I just use the node rank as rendezvous ID.
Actually I’ve tried passing other rendezvous backend like etcd but also failed.

haowxu · February 15, 2022, 8:16am

Please check the reply above. Sorry for not replying your post.

Kiuk_Chung · March 11, 2022, 12:46am

Could you paste the full log output of both nodes? I believe what is happening is that the fully-qualified-domain-name (socket.getfqdn()) on node0 returns the hostname.domain that has no DNS route from node1 (hence, the timeout). There’s two ways around this:

Make sure that the 2 nodes have a public ip

Run with “static” rendezvous (just skip the --rdzv_backend option, it will default to “static”), which has the same options as torch.distributed.launch:

$ torchrun --node_rank=0 # or 1 
                 --nnodes=2 
                 --nproc_per_node=4 
                 --master_addr=10.0.0.8 
                 --master_port=6000
                 $TRAINING_CMD