Torchrun command specifies the GPU

When I use the “torchrun” command to run .sh file in Single-node multi-worker, it seems like it will start training on the fisrt n GPU by default by using “–nproc-per-node=n”. For some reason, my GPU1 has been in use. I’m wondering how to use torchrun command to get files training on the specified GPU. (e.g. GPU2, 3,4,5)

To my knowledge, torchrun doesn’t distinguish between which device to use and treats the python script you are running as a blackbox. So in your user code you are probably doing something like .cuda() which defaults to the first device, but instead you should do something like .to(local_rank) to set it to a specific device (1 rank for each device)

Alternatively you can set the default device via code or environment variable CUDA_VISIBLE_DEVICES torch.cuda.set_device — PyTorch 2.0 documentation