I have about 10 GPU host indexes to be run on distributed mode. I need to use all the GPU machines available. But the problem is that torch.cuda.device_count() returns 1. I verified that the environment variable do have proper values ( 1,2,3,4,5,6,7,8,9,10 → indicating all 10 device indexes) .
Can anyone tell me whats going wrong here? Really appreciate your time.
Hi,
Just to clarify: you have 10 nodes and each node has a few GPUs (how many)? (torch.cuda.device_count returns number of GPU devices on a given machine)
There are 4 nodes and 10 GPU indices in total.
Two nodes have 3 GPU indices each.
Two nodes have 2 GPU indices each.
if everything is setup properly torch.cuda.device_count() should return 2 or 3 respectively, not 1 or 10.
What environment variables do you mean here?
I’m referring to environment variable: CUDA_VISIBLE_DEVICES
This is set to 1,2,3,4,5,6,7,8,9,10 and verified the same while debugging.
So, I’m not sure what is going wrong here. Device count is proper if I print out from console, but getting 1 on code execution. There’s nothing on the code that could mess up the device count.
- Is there any way by which the device count gets modified? (For example, with the use of CUDA_VISIBLE_DEVICES)
-
if everything is setup properly torch.cuda.device_count() should return 2 or 3 respectively
Can you elaborate more on the setup? I’m trying to figure out what is causing the issue.
A couple of things: I think CUDA_VISIBLE_DEVICES is 0-based, so it should be set to something like “0, 1, …”
You have 4 machines with 3, 3, 2, 2 GPUs respectively, so CUDA_VISIBLE_DEVICES should be set on each of machines independently or you may just omit setting CUDA_VISIBLE_DEVICES and it should work as well (you’ll get all avail devices by default)