Hi, I am doing DistributedDataParallel with slurm. My code is written below. The error I am getting is RuntimeError: CUDA error: invalid device ordinal at line 14.
- import os
- import hostlist
get SLURM variables
- rank = int(os.environ[‘SLURM_PROCID’])
- local_rank = int(os.environ[‘SLURM_LOCALID’])
- size = int(os.environ[‘SLURM_NTASKS’])
- cpus_per_task = int(os.environ[‘SLURM_CPUS_PER_TASK’])
get node list from slurm
- hostnames = hostlist.expand_hostlist(os.environ[‘SLURM_JOB_NODELIST’])
get IDs of reserved GPU
- gpu_ids = os.environ[‘SLURM_STEP_GPUS’].split(“,”)
define MASTER_ADD & MASTER_PORT
-
os.environ[‘MASTER_ADDR’] = hostnames[0]
-
os.environ[‘MASTER_PORT’] = str(12345 + int(min(gpu_ids))) # to avoid port conflict on the same node
-
NODE_ID = os.environ[‘SLURM_NODEID’]
-
MASTER_ADDR = os.environ[‘MASTER_ADDR’]
-
dist.init_process_group(“nccl”, init_method=‘env://’, world_size=size, rank=rank)
-
torch.cuda.set_device(local_rank)
I am using this below configuration in slurm
15. #SBATCH --nodes=2
16. #SBATCH --ntasks-per-node=4
17. #SBATCH --gres=gpu:2
18. #SBATCH --cpus-per-task=4
19. #SBATCH --hint=nomultithread2
20. #SBATCH --gres=gpu:V100:2
I get the outputs as
rank 0
local_rank 0
size 8
cpus_per_task 4
gpu_ids [‘2’, ‘3’]
hostnames [‘acidsgcn001’, ‘acidsgcn002’]
Training on 2 nodes and 8 processes, master node is acidsgcn001
- Process 0 corresponds to GPU 0 of node 0
rank 3
local_rank 3
size 8
cpus_per_task 4
gpu_ids [‘2’, ‘3’]
hostnames [‘acidsgcn001’, ‘acidsgcn002’]
- Process 3 corresponds to GPU 3 of node 0
rank 2
local_rank 2
size 8
cpus_per_task 4
gpu_ids [‘2’, ‘3’]
hostnames [‘acidsgcn001’, ‘acidsgcn002’]
- Process 2 corresponds to GPU 2 of node 0
rank 1
local_rank 1
size 8
cpus_per_task 4
gpu_ids [‘2’, ‘3’]
hostnames [‘acidsgcn001’, ‘acidsgcn002’]
- Process 1 corresponds to GPU 1 of node 0
rank 4
local_rank 0
size 8
cpus_per_task 4
gpu_ids [‘2’, ‘3’]
hostnames [‘acidsgcn001’, ‘acidsgcn002’]
- Process 4 corresponds to GPU 0 of node 1
rank 6
local_rank 2
size 8
cpus_per_task 4
gpu_ids [‘2’, ‘3’]
hostnames [‘acidsgcn001’, ‘acidsgcn002’]
- Process 6 corresponds to GPU 2 of node 1
rank 7
local_rank 3
size 8
cpus_per_task 4
gpu_ids [‘2’, ‘3’]
hostnames [‘acidsgcn001’, ‘acidsgcn002’]
- Process 7 corresponds to GPU 3 of node 1
rank 5
local_rank 1
size 8
cpus_per_task 4
gpu_ids [‘2’, ‘3’]
hostnames [‘acidsgcn001’, ‘acidsgcn002’]
- Process 5 corresponds to GPU 1 of node 1