PyTorch distributed with MPI on Multi-node Multi-GPUs

Hi,
I’m trying to run a PyTorch DDP code on 2 nodes with 8 GPUs each with mpirun. I want to use 1 mpi
rank per node to launch the DDP job per node and let DDP launch 8 worker threads in each node. The command I’m using is

let N_NODES=2
let N_RANKS=2
for (( node_rank=0; node_rank<$N_NODES; node_rank++ ))
do
        mpirun -np $N_RANKS -npernode 1 -hostfile $HOSTFILE --map-by node \ 
        python -u -m torch.distributed.launch  --master_addr=$master_addr --nproc_per_node=8  --nnodes=$N_NODES \
        --node_rank=$node_rank train.py &

        pids[${node_rank}]=$!
done

When I check the activity on the GPUs with nvidia-smi, I see the code running on few GPUs on both nodes (6 on first node and 4 on second node). What is missing here to run on all 8 GPUs in each node, 16 in total? Any help is kindly appreciated.

Also, if I just launch 2 ranks with 1 rank per node with the below command, I see the expected execution, one instance running on a single gpu in each node.

mpirun -np 2 -npernode 1 -hostfile $HOSTFILE --map-by node \
python -u -m torch.distributed.launch --master_addr=$master_addr train.py

With the following way, I see 8 process on 1st node and 16 processes on 2nd node

mpirun -np 2 -npernode 1 -hostfile $HOSTFILE --map-by node \
python -u -m torch.distributed.launch --nproc_per_node=8 train.py

Or is there another way to launch the distributed implementation across multi-node multi-gpu system with mpirun.

Thanks,
K

1 Like

Did you find any drop in data loader performance while running through mpirun ??