Torchrun command specifies the GPU

TTYee · May 23, 2023, 8:43am

When I use the “torchrun” command to run .sh file in Single-node multi-worker, it seems like it will start training on the fisrt n GPU by default by using “–nproc-per-node=n”. For some reason, my GPU1 has been in use. I’m wondering how to use torchrun command to get files training on the specified GPU. (e.g. GPU2, 3,4,5)

H-Huang · May 23, 2023, 1:50pm

To my knowledge, torchrun doesn’t distinguish between which device to use and treats the python script you are running as a blackbox. So in your user code you are probably doing something like .cuda() which defaults to the first device, but instead you should do something like .to(local_rank) to set it to a specific device (1 rank for each device)

Alternatively you can set the default device via code or environment variable CUDA_VISIBLE_DEVICES torch.cuda.set_device — PyTorch 2.0 documentation