Torchrun assigns same LOCAL_RANK to processes sharing node

Hello, I am attempting to do DDP on a SLURM cluster. More specifically, each node is 2 GPU, and I have 4 nodes I can assign to my job. Because of this, I am submitting an sbatch to run 8 tasks, with 2 tasks per node:

#!/bin/bash

#SBATCH --job-name=csinet-rewrite
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=4
#SBATCH --partition=gpu
#SBATCH --gres=gpu:2
#SBATCH --constraint=gpu_32gb

export OMP_NUM_THREADS=4
export NCCL_DEBUG=INFO
export MASTER_ADDR=$(scontrol show hostname ${SLURM_JOB_NODELIST} | head -n 1)

module load anaconda
conda activate torch

#Checking ports on Master
netstat -lnt

srun nvidia-smi -L
srun torchrun \
	--nnodes 4 \
	--nproc_per_node 2 \
	--rdzv_id $RANDOM \
	--rdzv_backend c10d \
	--rdzv_endpoint $MASTER_ADDR:29400 \
	main.py

While running this, my model fails with a NCCL error stating that separate ranks are attempting to access the same GPU. This should not be possible, as I am running torch.cuda.set_device(environ[“LOCAL_RANK”]) first thing:

Looking at the debug info, it appears that torchrun is assigning LOCAL_RANK values without regard to which node each task is on. How do I fix this?

Hi, try setting #SBATCH --ntasks-per-node=1 instead of 2!