Torchrun launched jobs hang on multiple machines

Hi,

I am trying to launch RPC-based jobs on multiple machines via torchrun, but it gets stuck: PRINT is not printed.

machineA: MASTER_ADDR='xxxxx' MASTER_PORT=12348 torchrun --nnodes=2 --nproc_per_node=2 --node_rank=0 demo.py
machineB: MASTER_ADDR='xxxxx' MASTER_PORT=12348 torchrun --nnodes=2 --nproc_per_node=2 --node_rank=1 demo.py

Run on single machine withe same demo.py works as expected.
MASTER_ADDR='xxxxx' MASTER_PORT=12348 torchrun --nnodes=1 -nproc_per_node=2

demo.py is almost empty which just contains print(os.environ['RANK']) only.

Hi, I am facing the same problem. Can you deal with this problem now?

run on machine A: torchrun --nnodes=2 --nproc_per_node=4 --node_rank=0 --rdzv_id=0 --rdzv_endpoint=IP:12348 demo.py

run on machine B: torchrun --nnodes=2 --nproc_per_node=4 --node_rank=1 --rdzv_id=0 --rdzv_endpoint=IP:12348 demo.py

1 Like

thank you. Do I need to change the training script from the example of torch.distributed.launch, besides the local_rank loading? I cannot find any example of torchrun in Internet.

pls refer to torchrun (Elastic Launch) — PyTorch 1.12 documentation.