Hello, I am attempting to do DDP on a SLURM cluster. More specifically, each node is 2 GPU, and I have 4 nodes I can assign to my job. Because of this, I am submitting an sbatch to run 8 tasks, with 2 tasks per node:
#!/bin/bash
#SBATCH --job-name=csinet-rewrite
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=4
#SBATCH --partition=gpu
#SBATCH --gres=gpu:2
#SBATCH --constraint=gpu_32gb
export OMP_NUM_THREADS=4
export NCCL_DEBUG=INFO
export MASTER_ADDR=$(scontrol show hostname ${SLURM_JOB_NODELIST} | head -n 1)
module load anaconda
conda activate torch
#Checking ports on Master
netstat -lnt
srun nvidia-smi -L
srun torchrun \
	--nnodes 4 \
	--nproc_per_node 2 \
	--rdzv_id $RANDOM \
	--rdzv_backend c10d \
	--rdzv_endpoint $MASTER_ADDR:29400 \
	main.py
While running this, my model fails with a NCCL error stating that separate ranks are attempting to access the same GPU. This should not be possible, as I am running torch.cuda.set_device(environ[“LOCAL_RANK”]) first thing:
Looking at the debug info, it appears that torchrun is assigning LOCAL_RANK values without regard to which node each task is on. How do I fix this?