Hi,
Firstly, I set my code as link. And I can use torchrun --nproc_per_node=8 train.py
to train on single node. However, if I want to use multi-node, I run the following command for 4 times on 4 nodes separately:
IP=10.128.11.6
PORT=33221
NNODES=4
NRANK=0
EXE=train.py
JOBID="01"
ENDPORT=${IP}:${PORT}
echo ${ENDPORT}
# python -m torch.distributed.launch --nproc_per_node=8 --master_addr=${IP} --nnodes=${NNODES} --node_rank ${NRANK} ${EXE}
# this work on single node, which shows that training script is not wrong
# torchrun \
# --nproc_per_node=8 \
# --rdzv_id=$JOB_ID \
# --rdzv_backend=c10d \
# --max_restarts=1 \
# ${EXE}
# run this on 4 nodes separately, each set different node_rank
torchrun \
--nnodes=$NNODES \
--node_rank=$NRANK \
--nproc_per_node=8 \
--rdzv_id=$JOB_ID \
--rdzv_backend=c10d \
--rdzv_endpoint=${ENDPORT} \
--max_restarts=1 \
${EXE}
The training cannot start. I just set different --node_rank
on each node, and other parameters are same. How could I make it work please?