Torchrun launched jobs hang on multiple machines

Rhett_Ying (Rhett Ying) March 9, 2022, 10:05am 1

Hi,

I am trying to launch RPC-based jobs on multiple machines via torchrun, but it gets stuck: PRINT is not printed.

machineA: MASTER_ADDR='xxxxx' MASTER_PORT=12348 torchrun --nnodes=2 --nproc_per_node=2 --node_rank=0 demo.py
machineB: MASTER_ADDR='xxxxx' MASTER_PORT=12348 torchrun --nnodes=2 --nproc_per_node=2 --node_rank=1 demo.py

Run on single machine withe same demo.py works as expected.
MASTER_ADDR='xxxxx' MASTER_PORT=12348 torchrun --nnodes=1 -nproc_per_node=2

demo.py is almost empty which just contains print(os.environ['RANK']) only.

1 Like

RPC + Torchrun hangs in ProcessGroupGloo

zizhao.mo (z) March 25, 2022, 2:51am 2

Hi, I am facing the same problem. Can you deal with this problem now?

Rhett_Ying (Rhett Ying) March 25, 2022, 7:37am 3

run on machine A: torchrun --nnodes=2 --nproc_per_node=4 --node_rank=0 --rdzv_id=0 --rdzv_endpoint=IP:12348 demo.py

run on machine B: torchrun --nnodes=2 --nproc_per_node=4 --node_rank=1 --rdzv_id=0 --rdzv_endpoint=IP:12348 demo.py

1 Like

zizhao.mo (z) March 26, 2022, 6:51am 4

thank you. Do I need to change the training script from the example of torch.distributed.launch, besides the local_rank loading? I cannot find any example of torchrun in Internet.

Rhett_Ying (Rhett Ying) March 26, 2022, 1:18pm 5

pls refer to torchrun (Elastic Launch) — PyTorch 1.12 documentation.