DDP with slurm CUDA error: invalid device ordinal

shrutishrestha · July 14, 2022, 6:55pm

Hi, I am doing DistributedDataParallel with slurm. My code is written below. The error I am getting is RuntimeError: CUDA error: invalid device ordinal at line 14.

import os
import hostlist

get SLURM variables

rank = int(os.environ[‘SLURM_PROCID’])
local_rank = int(os.environ[‘SLURM_LOCALID’])
size = int(os.environ[‘SLURM_NTASKS’])
cpus_per_task = int(os.environ[‘SLURM_CPUS_PER_TASK’])

get node list from slurm

hostnames = hostlist.expand_hostlist(os.environ[‘SLURM_JOB_NODELIST’])

get IDs of reserved GPU

gpu_ids = os.environ[‘SLURM_STEP_GPUS’].split(“,”)

define MASTER_ADD & MASTER_PORT

os.environ[‘MASTER_ADDR’] = hostnames[0]
os.environ[‘MASTER_PORT’] = str(12345 + int(min(gpu_ids))) # to avoid port conflict on the same node
NODE_ID = os.environ[‘SLURM_NODEID’]
MASTER_ADDR = os.environ[‘MASTER_ADDR’]
dist.init_process_group(“nccl”, init_method=‘env://’, world_size=size, rank=rank)
torch.cuda.set_device(local_rank)

I am using this below configuration in slurm
15. #SBATCH --nodes=2
16. #SBATCH --ntasks-per-node=4
17. #SBATCH --gres=gpu:2
18. #SBATCH --cpus-per-task=4
19. #SBATCH --hint=nomultithread2
20. #SBATCH --gres=gpu:V100:2

I get the outputs as

rank 0

local_rank 0

size 8

cpus_per_task 4

gpu_ids [‘2’, ‘3’]

hostnames [‘acidsgcn001’, ‘acidsgcn002’]

Training on 2 nodes and 8 processes, master node is acidsgcn001

Process 0 corresponds to GPU 0 of node 0

rank 3

local_rank 3

size 8

cpus_per_task 4

gpu_ids [‘2’, ‘3’]

hostnames [‘acidsgcn001’, ‘acidsgcn002’]

Process 3 corresponds to GPU 3 of node 0

rank 2

local_rank 2

size 8

cpus_per_task 4

gpu_ids [‘2’, ‘3’]

hostnames [‘acidsgcn001’, ‘acidsgcn002’]

Process 2 corresponds to GPU 2 of node 0

rank 1

local_rank 1

size 8

cpus_per_task 4

gpu_ids [‘2’, ‘3’]

hostnames [‘acidsgcn001’, ‘acidsgcn002’]

Process 1 corresponds to GPU 1 of node 0

rank 4

local_rank 0

size 8

cpus_per_task 4

gpu_ids [‘2’, ‘3’]

hostnames [‘acidsgcn001’, ‘acidsgcn002’]

Process 4 corresponds to GPU 0 of node 1

rank 6

local_rank 2

size 8

cpus_per_task 4

gpu_ids [‘2’, ‘3’]

hostnames [‘acidsgcn001’, ‘acidsgcn002’]

Process 6 corresponds to GPU 2 of node 1

rank 7

local_rank 3

size 8

cpus_per_task 4

gpu_ids [‘2’, ‘3’]

hostnames [‘acidsgcn001’, ‘acidsgcn002’]

Process 7 corresponds to GPU 3 of node 1

rank 5

local_rank 1

size 8

cpus_per_task 4

gpu_ids [‘2’, ‘3’]

hostnames [‘acidsgcn001’, ‘acidsgcn002’]

Process 5 corresponds to GPU 1 of node 1