No option to change gpus when using torch.distributed.init_process_group

When we run the NVIDIA code using the following:

$ python -m torch.distributed.launch --nproc_per_node=n main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O2 ./

as soon as torch.distributed.init_process_group(backend='nccl', init_method='env://') is executed, Pytorch spawns n processes on n gpus as the following argument makes it to do so torch.distributed.launch --nproc_per_node=n. All these processes started from the 0-th index up to n-1. Unfortunately, there is no way to choose the indices of gpus starting them from anything else rather than zero or chose optional gpus.
I have also tried to use the following:

$ CUDA_VISIBLE_DEVICES=4,5,6,7 python -m torch.distributed.launch --nproc_per_node=4 main_amp.py -a resnet50 --b 224 --workers 4 --opt-level O2 ./

and I get torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal.

Can you please help me with this?

We recommend to use the native mixed-precision training via torch.cuda.amp as well as the native DDP implementation, as described here.