Ddp with slurm hangs when ntasks-per_node>1

Hello @ptrblck ,

I am trying to run a ddp code with slurm. When I use the below sbatch commands, the model is training in parallel.

#!/bin/bash
#SBATCH --cpus-per-task=4      # CPU cores per task
#SBATCH --gres=gpu:2             # Number of allocated GPUs per node
#SBATCH --ntasks-per-node=1              # Number of tasks (processes)
#SBATCH --nodes=1                # Number of nodes
#SBATCH --job-name=videomae      # Job name
#SBATCH --output=videomae_%j.log # Standard output and error log

However, when I set --ntasks-per-node=2, which is how people usually do. The code is stuck as distributed init as shown below.

World_size 2
| distributed init (rank 0): env://, gpu 1
World_size 2
node04:163609:163609 [0] NCCL INFO Bootstrap : Using ib0:XXX.XXX.XXX.XXX<0>
node04:163609:163609 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
node04:163609:163609 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.20.5+cuda11.8
node04:163610:163610 [1] NCCL INFO Bootstrap : Using ib0:XXX.XXX.XXX.XXX<0>
node04:163610:163610 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
node04:163610:163610 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.20.5+cuda11.8
node04:163609:164050 [0] NCCL INFO NET/IB : Using [0]hfi1_0:1/IB ; OOB ib0:XXX.XXX.XXX.XXX<0>
node04:163609:164050 [0] NCCL INFO Using non-device net plugin version 0
node04:163609:164050 [0] NCCL INFO Using network IB
node04:163610:164059 [0] NCCL INFO NET/IB : Using [0]hfi1_0:1/IB ; OOB ib0:XXX.XXX.XXX.XXX<0>
node04:163610:164059 [0] NCCL INFO Using non-device net plugin version 0
node04:163610:164059 [0] NCCL INFO Using network IB

Please help,