Hello,
I am using PyTorch on a GPU cluster. When we send a job, the job runs on a machine with 4 GPUs. In the code, I do torch.device('cuda'). The problem is that this uses the GPU with id 0, but sometimes it’s the second or third GPU which is available.
I’d like my code to choose the first available GPU. I could iterate over torch.device('cuda:1'), torch.device('cuda:2'), etc. but it’s not very clean.
There is a difference, between the global GPU index (the GPU which is available) and the local one PyTorch uses. If your GPU cluster is well configured it should set the environment Variable CUDA_VISIBLE_DEVICES to keep track of which GPU is available.
The PyTorch index must be given w.r.t. the available GPUs.
For Example, if GPUs 0 and 3 would be available, the environment variable CUDA_VISIBLE_DEVICES would be set to "0,3". If you now specify a device with "cuda:0" you take the first device out of that list (GPU0), with "cuda:1"you get the second device out of that list, which should be GPU 3 in a global context.
After this short explanation: If this environment variable is set, you should be able to always use "cuda:0" if you don’t overwrite CUDA_VISIBLE_DEVICES manually.
I’d say it’s set as it’s a by-default cuda variable.
It’s more like a flag, there is no by-default value so if you try to check it, probably won’t be defined. However, if you use
CUDA_VISIBLE_DEVICES = id1,…,idn python yourscript.py
That python code will only see those devices.
Yeah, but this won’t work if the Cluster schedules the resources itself, because that way you would end up to overwrite this choice which could cause other jobs (and your own) to crash.