Hi,
I am trying to launch RPC-based jobs on multiple machines via torchrun
, but it gets stuck: PRINT
is not printed.
machineA: MASTER_ADDR='xxxxx' MASTER_PORT=12348 torchrun --nnodes=2 --nproc_per_node=2 --node_rank=0 demo.py
machineB: MASTER_ADDR='xxxxx' MASTER_PORT=12348 torchrun --nnodes=2 --nproc_per_node=2 --node_rank=1 demo.py
Run on single machine withe same demo.py
works as expected.
MASTER_ADDR='xxxxx' MASTER_PORT=12348 torchrun --nnodes=1 -nproc_per_node=2
demo.py
is almost empty which just contains print(os.environ['RANK'])
only.