How could we use torchrun to start multi node training?

coincheung · November 29, 2021, 12:12am

Hi,

Firstly, I set my code as link. And I can use torchrun --nproc_per_node=8 train.py to train on single node. However, if I want to use multi-node, I run the following command for 4 times on 4 nodes separately:

IP=10.128.11.6
PORT=33221
NNODES=4
NRANK=0
EXE=train.py
JOBID="01"

ENDPORT=${IP}:${PORT}
echo ${ENDPORT}

# python -m torch.distributed.launch --nproc_per_node=8 --master_addr=${IP} --nnodes=${NNODES} --node_rank ${NRANK} ${EXE}

# this work on single node, which shows that training script is not wrong
# torchrun \
#     --nproc_per_node=8 \
#     --rdzv_id=$JOB_ID \
#     --rdzv_backend=c10d \
#     --max_restarts=1 \
#     ${EXE}

# run this on 4 nodes separately, each set different node_rank
torchrun \
    --nnodes=$NNODES \
    --node_rank=$NRANK \
    --nproc_per_node=8 \
    --rdzv_id=$JOB_ID \
    --rdzv_backend=c10d \
    --rdzv_endpoint=${ENDPORT} \
    --max_restarts=1 \
    ${EXE}

The training cannot start. I just set different --node_rank on each node, and other parameters are same. How could I make it work please?