Hi,
I’m trying to run a PyTorch DDP code on 2 nodes with 8 GPUs each with mpirun. I want to use 1 mpi
rank per node to launch the DDP job per node and let DDP launch 8 worker threads in each node. The command I’m using is
let N_NODES=2
let N_RANKS=2
for (( node_rank=0; node_rank<$N_NODES; node_rank++ ))
do
mpirun -np $N_RANKS -npernode 1 -hostfile $HOSTFILE --map-by node \
python -u -m torch.distributed.launch --master_addr=$master_addr --nproc_per_node=8 --nnodes=$N_NODES \
--node_rank=$node_rank train.py &
pids[${node_rank}]=$!
done
When I check the activity on the GPUs with nvidia-smi, I see the code running on few GPUs on both nodes (6 on first node and 4 on second node). What is missing here to run on all 8 GPUs in each node, 16 in total? Any help is kindly appreciated.
Also, if I just launch 2 ranks with 1 rank per node with the below command, I see the expected execution, one instance running on a single gpu in each node.
mpirun -np 2 -npernode 1 -hostfile $HOSTFILE --map-by node \
python -u -m torch.distributed.launch --master_addr=$master_addr train.py
With the following way, I see 8 process on 1st node and 16 processes on 2nd node
mpirun -np 2 -npernode 1 -hostfile $HOSTFILE --map-by node \
python -u -m torch.distributed.launch --nproc_per_node=8 train.py
Or is there another way to launch the distributed implementation across multi-node multi-gpu system with mpirun.
Thanks,
K