DDP with slurm CUDA error: invalid device ordinal

Hi, I am doing DistributedDataParallel with slurm. My code is written below. The error I am getting is RuntimeError: CUDA error: invalid device ordinal at line 14.

  1. import os
  2. import hostlist

get SLURM variables

  1. rank = int(os.environ[‘SLURM_PROCID’])
  2. local_rank = int(os.environ[‘SLURM_LOCALID’])
  3. size = int(os.environ[‘SLURM_NTASKS’])
  4. cpus_per_task = int(os.environ[‘SLURM_CPUS_PER_TASK’])

get node list from slurm

  1. hostnames = hostlist.expand_hostlist(os.environ[‘SLURM_JOB_NODELIST’])

get IDs of reserved GPU

  1. gpu_ids = os.environ[‘SLURM_STEP_GPUS’].split(“,”)

define MASTER_ADD & MASTER_PORT

  1. os.environ[‘MASTER_ADDR’] = hostnames[0]

  2. os.environ[‘MASTER_PORT’] = str(12345 + int(min(gpu_ids))) # to avoid port conflict on the same node

  3. NODE_ID = os.environ[‘SLURM_NODEID’]

  4. MASTER_ADDR = os.environ[‘MASTER_ADDR’]

  5. dist.init_process_group(“nccl”, init_method=‘env://’, world_size=size, rank=rank)

  6. torch.cuda.set_device(local_rank)

I am using this below configuration in slurm
15. #SBATCH --nodes=2
16. #SBATCH --ntasks-per-node=4
17. #SBATCH --gres=gpu:2
18. #SBATCH --cpus-per-task=4
19. #SBATCH --hint=nomultithread2
20. #SBATCH --gres=gpu:V100:2

I get the outputs as

rank 0

local_rank 0

size 8

cpus_per_task 4

gpu_ids [‘2’, ‘3’]

hostnames [‘acidsgcn001’, ‘acidsgcn002’]

Training on 2 nodes and 8 processes, master node is acidsgcn001

  • Process 0 corresponds to GPU 0 of node 0

rank 3

local_rank 3

size 8

cpus_per_task 4

gpu_ids [‘2’, ‘3’]

hostnames [‘acidsgcn001’, ‘acidsgcn002’]

  • Process 3 corresponds to GPU 3 of node 0

rank 2

local_rank 2

size 8

cpus_per_task 4

gpu_ids [‘2’, ‘3’]

hostnames [‘acidsgcn001’, ‘acidsgcn002’]

  • Process 2 corresponds to GPU 2 of node 0

rank 1

local_rank 1

size 8

cpus_per_task 4

gpu_ids [‘2’, ‘3’]

hostnames [‘acidsgcn001’, ‘acidsgcn002’]

  • Process 1 corresponds to GPU 1 of node 0

rank 4

local_rank 0

size 8

cpus_per_task 4

gpu_ids [‘2’, ‘3’]

hostnames [‘acidsgcn001’, ‘acidsgcn002’]

  • Process 4 corresponds to GPU 0 of node 1

rank 6

local_rank 2

size 8

cpus_per_task 4

gpu_ids [‘2’, ‘3’]

hostnames [‘acidsgcn001’, ‘acidsgcn002’]

  • Process 6 corresponds to GPU 2 of node 1

rank 7

local_rank 3

size 8

cpus_per_task 4

gpu_ids [‘2’, ‘3’]

hostnames [‘acidsgcn001’, ‘acidsgcn002’]

  • Process 7 corresponds to GPU 3 of node 1

rank 5

local_rank 1

size 8

cpus_per_task 4

gpu_ids [‘2’, ‘3’]

hostnames [‘acidsgcn001’, ‘acidsgcn002’]

  • Process 5 corresponds to GPU 1 of node 1