Does DDP with torchrun need torch.cuda.set_device(device)?

I have a training scripts that runs on single node, multiple GPUs, implemented following PyTorch DDP tutorial. I run the script with torchrun --standalone --nproc_per_node=8 main.py

According to the docs:

To use DistributedDataParallel on a host with N GPUs, you should spawn up N processes, ensuring that each process exclusively works on a single GPU from 0 to N-1. This can be done by either setting CUDA_VISIBLE_DEVICES for every process or by calling:
torch.cuda.set_device(i) where i is from 0 to N-1.

However, this is not mentioned at all in the PyTorch tutorial on DDP.
Is this really needed? What does set_device() do in practice? Which are the implications of not calling it? Should I set CUDA_VISIBLE_DEVICES instead?

I am trying to understand more on this topic, thank you in advance!

@coni Tutorial author here, thanks for bringing up this discrepancy! Although scripts can work without set_device(i), you should include it on your training script. Doing so changes the default GPU from cuda:0 to cuda:i. Not doing this in some cases can result in your training job hanging, or too much memory utilization on cuda:0.

I will update the tutorial to explicitly call this out.

1 Like

Thanks you for the prompt response and for the tutorial!

@suraj.pt Thanks for the response, I have a follow-up.
What if I wanted to set env variable CUDA_VISIBLE_DEVICES, instead of calling set_device?

I have tried to set in the command line, by calling

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --standalone --nproc_per_node=8 main.py

However when checking the current device inside main.py with print(torch.cuda.current_device()), I always get 0, is this behavior correct? Does cuda:0 in different processes map to a different physical device?

cuda:0 across all the processes maps to the same GPU. You are always getting 0 because cuda:0 is the default device. Unless you run set_device (or something equivalent like torch.device) this will not change.

See Adds torch.cuda.set_device calls to DDP examples by subramen · Pull Request #1142 · pytorch/examples · GitHub.

You can set CUDA_VISIBLE_DEVICES=i from each ith process, which will have the same effect as set_device. Replace torch.cuda.set_device(rank) with os.environ["CUDA_VISIBLE_DEVICES"]=rank and it will set the visible devices for that specific process.

2 Likes

Setting os.environ["CUDA_VISIBLE_DEVICES"]=os.environ["LOCAL_RANK"] still results in torch.cuda.current_device() equal to 0.

However, from this topic this behavior seems correct. It seems that by setting CUDA_VISIBLE_DEVICES directly “each process will only see one physical GPU that corresponds to its local_rank , i.e., cuda:0 in different processes will map to a different physical device”.

Is this correct?

each process will only see one physical GPU

Given the naming of the env var, that does make sense… I am not sure if set_devices works differently than setting CUDA_VISIBLE_DEVICES

Paging @rvarm1 / @mrshenli to chime in here

Adding something more to this, torch.device() produces the same behavior of manually setting CUDA_VISIBLE_DEVICES.

Summarizing:

torch.set_device(rank) # results in torch.cuda.current_device() equal to rank
torch.device(rank) # results in torch.cuda.current_device() equal to 0
os.environ["CUDA_VISIBLE_DEVICES"]=rank # results in torch.cuda.current_device() equal to 0

Setting CUDA_VISIBLE_DEVICES=id masks the available GPUs and allows the process to only use the visible device. This visible device will be mapped to cuda:0 inside your Python script and no other devices can be accessed thus torch.cuda.device_count() will also return 1. (Multiple visible devices will map to cuda:0, cuda:1, etc. in the same order specified in the env variable).
Using torch.cuda.set_device will set the specified device as the current device and will execute all operation on it if no cuda:ID is specified explicitly. All other visible devices are still accessible in this process and your script.

3 Likes

@coni your findings match with @ptrblck’s explanation here.

Looks like CUDA_VISIBLE_DEVICES / torch.device() are actually more straightforward to use for typical DDP applications.

The only time you would need to use torch.set_device() instead of the above is when your script needs to access multiple GPUs from the same process - not a typical need when using torchrun.
That’s probably why the docs also mention using CUDA_VISIBLE_DEVICES / torch.device() instead of torch.set_device()

1 Like