Pytorch DDP with torchrun and slurm invalid device ordinal error

shrutishrestha · August 30, 2023, 10:42pm

I am running PYTORCH DDP (multinode) with horovod, slurm. Here when I use

but when I try

I am guessing, slurn internally chooses 1 gpu at random, and we are providing something else in the setdevice so device id doesnot match

how can we solved this issue?
2: how can we get available device id in each machine in ddp? (i am only able to get name of GPUS, but that doesnot help as all GPU have same names)