Use first available GPU

Noonan · April 16, 2019, 8:34am

Hello,
I am using PyTorch on a GPU cluster. When we send a job, the job runs on a machine with 4 GPUs. In the code, I do torch.device('cuda'). The problem is that this uses the GPU with id 0, but sometimes it’s the second or third GPU which is available.

I’d like my code to choose the first available GPU. I could iterate over torch.device('cuda:1'), torch.device('cuda:2'), etc. but it’s not very clean.

Do you have solutions ?

justusschock · April 16, 2019, 12:17pm

There is a difference, between the global GPU index (the GPU which is available) and the local one PyTorch uses. If your GPU cluster is well configured it should set the environment Variable CUDA_VISIBLE_DEVICES to keep track of which GPU is available.

The PyTorch index must be given w.r.t. the available GPUs.

For Example, if GPUs 0 and 3 would be available, the environment variable CUDA_VISIBLE_DEVICES would be set to "0,3". If you now specify a device with "cuda:0" you take the first device out of that list (GPU0), with "cuda:1"you get the second device out of that list, which should be GPU 3 in a global context.

After this short explanation: If this environment variable is set, you should be able to always use "cuda:0" if you don’t overwrite CUDA_VISIBLE_DEVICES manually.

Noonan · April 16, 2019, 12:54pm

Thanks for your answer. It seems that the environment variable is not set on the cluster, I am gonna ask.

JuanFMontesinos · April 16, 2019, 12:57pm

I’d say it’s set as it’s a by-default cuda variable.
It’s more like a flag, there is no by-default value so if you try to check it, probably won’t be defined. However, if you use
CUDA_VISIBLE_DEVICES = id1,…,idn python yourscript.py
That python code will only see those devices.

justusschock · April 16, 2019, 1:31pm

Yeah, but this won’t work if the Cluster schedules the resources itself, because that way you would end up to overwrite this choice which could cause other jobs (and your own) to crash.

Noonan · April 16, 2019, 1:33pm

Yes, the cluster schedules resources by itself.

JuanFMontesinos · April 16, 2019, 5:20pm

But if it’s the case you will get a the amount of gpus that you need assigned, being the environment unable to see the rest of them. So?