Pytorch DDP with torchrun and slurm invalid device ordinal error

I am running PYTORCH DDP (multinode) with horovod, slurm. Here when I use

  1. 2 nodes, 1-1 gpu each it works perfectly

but when I try

  1. 2 nodes, 2-2 gpu each
    each process only sees
    Number of available CUDA devices: 1

    and when I do:
    torch.cuda.set_device(int(os.environ[“LOCAL_RANK”]))
    First GPU takes cuda 0, works fine but another GPU in same node throws

    packages/torch/cuda/init.py", line 350, in set_device
    torch._C._cuda_setDevice(device)
    RuntimeError: CUDA error: invalid device ordinal

I am guessing, slurn internally chooses 1 gpu at random, and we are providing something else in the setdevice so device id doesnot match

  1. how can we solved this issue?
    2: how can we get available device id in each machine in ddp? (i am only able to get name of GPUS, but that doesnot help as all GPU have same names)