Cannot launch more than one process per node

I am trying to run the example here elastic_ddp.py with a print statement in the end indicating completion. I have 2 nodes with 4 GPUs each. The following is my job script:

#!/bin/bash
...
cd $PBS_O_WORKDIR

NPROC_PER_NODE=1
MASTER=`/bin/hostname -s`
MPORT=`ss -tan | awk '{print $5}' | cut -d':' -f2 | \
        grep "[2-9][0-9]\{3,3\}" | sort | uniq | shuf -n 1`
cat $PBS_NODEFILE>nodelist
#Make sure this node (MASTER) comes first
SLAVES=`cat nodelist | grep -v $MASTER | uniq`
#We want names of master and slave nodes
HOSTLIST="$MASTER $SLAVES"

module load conda
conda activate base

RANK=0
for node in $HOSTLIST; do
        ssh -q $node \
                module load conda
                conda activate base
                torchrun \
                --nproc_per_node=$NPROC_PER_NODE \
                --nnodes=2 \
                --node_rank=$RANK \
                --master_addr="$MASTER" --master_port="$MPORT" \
                elastic_ddp.py &
        RANK=$((RANK+1))
        export NCCL_DEBUG=INFO
done
wait

which returns

Start running basic DDP example on rank 1.
Start running basic DDP example on rank 0.
x3111c0s19b1n0:16338:16338 [1] NCCL INFO Bootstrap : Using hsn1:10.201.3.176<0>
x3111c0s19b1n0:16338:16338 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
x3111c0s19b1n0:16338:16338 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE ; OOB hsn1:10.201.3.176<0>
x3111c0s19b1n0:16338:16338 [1] NCCL INFO Using network IB
x3111c0s19b1n0:16338:16368 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] -1/-1/-1->1->0 [7] -1/-1/-1->1->0
x3111c0s19b1n0:16338:16368 [1] NCCL INFO Setting affinity for GPU 1 to ff0000,00ff0000
x3111c0s19b1n0:16338:16368 [1] NCCL INFO Channel 00 : 1[46000] -> 0[7000] via P2P/IPC/read
x3111c0s19b1n0:16338:16368 [1] NCCL INFO Channel 01 : 1[46000] -> 0[7000] via P2P/IPC/read
x3111c0s19b1n0:16338:16368 [1] NCCL INFO Channel 02 : 1[46000] -> 0[7000] via P2P/IPC/read
x3111c0s19b1n0:16338:16368 [1] NCCL INFO Channel 03 : 1[46000] -> 0[7000] via P2P/IPC/read
x3111c0s19b1n0:16338:16368 [1] NCCL INFO Channel 04 : 1[46000] -> 0[7000] via P2P/IPC/read
x3111c0s19b1n0:16338:16368 [1] NCCL INFO Channel 05 : 1[46000] -> 0[7000] via P2P/IPC/read
x3111c0s19b1n0:16338:16368 [1] NCCL INFO Channel 06 : 1[46000] -> 0[7000] via P2P/IPC/read
x3111c0s19b1n0:16338:16368 [1] NCCL INFO Channel 07 : 1[46000] -> 0[7000] via P2P/IPC/read
x3111c0s19b1n0:16338:16368 [1] NCCL INFO Connected all rings
x3111c0s19b1n0:16338:16368 [1] NCCL INFO Connected all trees
x3111c0s19b1n0:16338:16368 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
x3111c0s19b1n0:16338:16368 [1] NCCL INFO 8 coll channels, 8 p2p channels, 8 p2p channels per peer
x3111c0s19b1n0:16338:16368 [1] NCCL INFO comm 0x14b238002fb0 rank 1 nranks 2 cudaDev 1 busId 46000 - Init COMPLETE
rank 0 finished.
rank 1 finished.

However, if I change NPROC_PER_NODE=4, then I see the following:

Start running basic DDP example on rank 7.
Start running basic DDP example on rank 0.
Start running basic DDP example on rank 1.
Start running basic DDP example on rank 4.
Start running basic DDP example on rank 2.
Start running basic DDP example on rank 6.
Start running basic DDP example on rank 5.
Start running basic DDP example on rank 3.
x3111c0s19b0n0:39576:39576 [0] NCCL INFO Bootstrap : Using hsn1:10.201.3.174<0>
x3111c0s19b0n0:39576:39576 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
x3111c0s19b0n0:39576:39576 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE ; OOB hsn1:10.201.3.174<0>
x3111c0s19b0n0:39576:39576 [0] NCCL INFO Using network IB
x3111c0s19b0n0:39578:39578 [1] NCCL INFO Bootstrap : Using hsn1:10.201.3.174<0>
x3111c0s19b0n0:39578:39578 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
x3111c0s19b0n0:39578:39578 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE ; OOB hsn1:10.201.3.174<0>
x3111c0s19b0n0:39578:39578 [1] NCCL INFO Using network IB
x3111c0s19b0n0:39580:39580 [2] NCCL INFO Bootstrap : Using hsn1:10.201.3.174<0>
x3111c0s19b0n0:39580:39580 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
x3111c0s19b0n0:39580:39580 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE ; OOB hsn1:10.201.3.174<0>
x3111c0s19b0n0:39580:39580 [2] NCCL INFO Using network IB
x3111c0s19b0n0:39582:39582 [3] NCCL INFO Bootstrap : Using hsn1:10.201.3.174<0>
x3111c0s19b0n0:39582:39582 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
x3111c0s19b0n0:39582:39582 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE ; OOB hsn1:10.201.3.174<0>
x3111c0s19b0n0:39582:39582 [3] NCCL INFO Using network IB

x3111c0s19b0n0:39582:39671 [3] init.cc:521 NCCL WARN Duplicate GPU detected : rank 7 and rank 3 both on CUDA device c7000

x3111c0s19b0n0:39578:39656 [1] init.cc:521 NCCL WARN Duplicate GPU detected : rank 5 and rank 1 both on CUDA device 46000

x3111c0s19b0n0:39580:39665 [2] init.cc:521 NCCL WARN Duplicate GPU detected : rank 6 and rank 2 both on CUDA device 85000

x3111c0s19b0n0:39576:39653 [0] init.cc:521 NCCL WARN Duplicate GPU detected : rank 4 and rank 0 both on CUDA device 7000
x3111c0s19b0n0:39582:39671 [3] NCCL INFO init.cc:904 -> 5
x3111c0s19b0n0:39578:39656 [1] NCCL INFO init.cc:904 -> 5
x3111c0s19b0n0:39580:39665 [2] NCCL INFO init.cc:904 -> 5
x3111c0s19b0n0:39576:39653 [0] NCCL INFO init.cc:904 -> 5
x3111c0s19b0n0:39582:39671 [3] NCCL INFO group.cc:72 -> 5 [Async thread]
x3111c0s19b0n0:39578:39656 [1] NCCL INFO group.cc:72 -> 5 [Async thread]
x3111c0s19b0n0:39580:39665 [2] NCCL INFO group.cc:72 -> 5 [Async thread]
x3111c0s19b0n0:39576:39653 [0] NCCL INFO group.cc:72 -> 5 [Async thread]

and fails.