I am running PYTORCH DDP (multinode) with horovod, slurm. Here when I use
- 2 nodes, 1-1 gpu each it works perfectly
but when I try
-
2 nodes, 2-2 gpu each
each process only sees
Number of available CUDA devices: 1and when I do:
torch.cuda.set_device(int(os.environ[“LOCAL_RANK”]))
First GPU takes cuda 0, works fine but another GPU in same node throwspackages/torch/cuda/init.py", line 350, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
I am guessing, slurn internally chooses 1 gpu at random, and we are providing something else in the setdevice so device id doesnot match
- how can we solved this issue?
2: how can we get available device id in each machine in ddp? (i am only able to get name of GPUS, but that doesnot help as all GPU have same names)