Multi-GPU training

I am trying to split my data to run on multiple GPUs, but my program is only able to find 1 GPU. Here is what I have so far:

os.environ["CUDA_VISIBLE_DEVICES"]

Gives: 0,1, which is correct as I have 2 GPUs in the node I want to train on. However,

torch.cuda.device_count()

Gives 1, which is not what I was expecting. I am setting the torch device as cuda and not specifying a device ID there.

Are you able to use both devices in other applications?
Also, what does nvidia-smi show?

So, this is a compute node in our HPC cluster, which I cannot access directly. NVIDIA-SMI is able to see 2 GPUs (I am using this module to get the available GPUs).

This is my first time trying to leverage both the GPUs, so I am not sure how I can test that both are getting used.

I don’t think you could use PyTorch for this test, since it’s currently returning a device count of 1.
I was wondering, if you were able to use more than the one found device using any other application.

Yes, I am able to use the two GPUs in another application.

Unfortunately, I don’t know what might go wrong in that case and would recommend to either update the drivers or try to use a docker container and check, if both GPUs are found there.

Okay, thanks. If I manage to figure it out, I will post the solution here.