Torch.cuda.device_count() returns 1 even if the environment variable CUDA_VISIBLE_DEVICES set to 1,2,3,4,5,6,7,8,9,10

I have about 10 GPU host indexes to be run on distributed mode. I need to use all the GPU machines available. But the problem is that torch.cuda.device_count() returns 1. I verified that the environment variable do have proper values ( 1,2,3,4,5,6,7,8,9,10 → indicating all 10 device indexes) .
Can anyone tell me whats going wrong here? Really appreciate your time.

Hi,

Just to clarify: you have 10 nodes and each node has a few GPUs (how many)? (torch.cuda.device_count returns number of GPU devices on a given machine)

There are 4 nodes and 10 GPU indices in total.
Two nodes have 3 GPU indices each.
Two nodes have 2 GPU indices each.

if everything is setup properly torch.cuda.device_count() should return 2 or 3 respectively, not 1 or 10.

What environment variables do you mean here?

I’m referring to environment variable: CUDA_VISIBLE_DEVICES
This is set to 1,2,3,4,5,6,7,8,9,10 and verified the same while debugging.

So, I’m not sure what is going wrong here. Device count is proper if I print out from console, but getting 1 on code execution. There’s nothing on the code that could mess up the device count.

  1. Is there any way by which the device count gets modified? (For example, with the use of CUDA_VISIBLE_DEVICES)
  2. if everything is setup properly torch.cuda.device_count() should return 2 or 3 respectively

Can you elaborate more on the setup? I’m trying to figure out what is causing the issue.

A couple of things: I think CUDA_VISIBLE_DEVICES is 0-based, so it should be set to something like “0, 1, …”

You have 4 machines with 3, 3, 2, 2 GPUs respectively, so CUDA_VISIBLE_DEVICES should be set on each of machines independently or you may just omit setting CUDA_VISIBLE_DEVICES and it should work as well (you’ll get all avail devices by default)