Should local rank be equal to torch.cuda.current_device()?

I’m running models using DistributedDataParallel, and I noticed that while the local rank is correctly up to the number of GPUs I’m using, when I print torch.cuda.current_device() I notice that the output is always 0. Printing torch.cuda.current_device() correctly outputs 4. What does this mean?

Printing torch.cuda.current_device() correctly outputs 4 . What does this mean?

I assume you mean torch.cuda.device_count() outputs 4?

I’m running models using DistributedDataParallel, and I noticed that while the local rank is correctly up to the number of GPUs I’m using, when I print torch.cuda.current_device() I notice that the output is always 0 .

The current device is by default cuda:0, unless you explicitly set torch.cuda.set_device(local_rank). When using DDP, it is recommended to either a) call set_device upfront, or b) setting CUDA_VISIBLE_DEVICES to local_rank before calling any torch.cuda APIs or create CUDA tensors. If you do a), each process will still see all devices, but by default tensor.cuda() will move it to the specified device. If you do b), each process will only see one physical GPU that corresponds to its local_rank, i.e., cuda:0 in different processes will map to a different physical device.

Both solution will work with DDP, as long as you make sure that each DDP instance exclusively operates on a dedicated device.