I am trying to split my data to run on multiple GPUs, but my program is only able to find 1 GPU. Here is what I have so far:
os.environ["CUDA_VISIBLE_DEVICES"]
Gives: 0,1
, which is correct as I have 2 GPUs in the node I want to train on. However,
torch.cuda.device_count()
Gives 1
, which is not what I was expecting. I am setting the torch device as cuda
and not specifying a device ID there.
Are you able to use both devices in other applications?
Also, what does nvidia-smi
show?
So, this is a compute node in our HPC cluster, which I cannot access directly. NVIDIA-SMI is able to see 2 GPUs (I am using this module to get the available GPUs).
This is my first time trying to leverage both the GPUs, so I am not sure how I can test that both are getting used.
I don’t think you could use PyTorch for this test, since it’s currently returning a device count of 1.
I was wondering, if you were able to use more than the one found device using any other application.
Yes, I am able to use the two GPUs in another application.
Unfortunately, I don’t know what might go wrong in that case and would recommend to either update the drivers or try to use a docker container and check, if both GPUs are found there.
Okay, thanks. If I manage to figure it out, I will post the solution here.