Multi-GPU training

sarthakpati · July 31, 2020, 5:55pm

I am trying to split my data to run on multiple GPUs, but my program is only able to find 1 GPU. Here is what I have so far:

os.environ["CUDA_VISIBLE_DEVICES"]

Gives: 0,1, which is correct as I have 2 GPUs in the node I want to train on. However,

torch.cuda.device_count()

Gives 1, which is not what I was expecting. I am setting the torch device as cuda and not specifying a device ID there.

ptrblck · August 1, 2020, 7:55am

Are you able to use both devices in other applications?
Also, what does nvidia-smi show?

sarthakpati · August 1, 2020, 11:39am

So, this is a compute node in our HPC cluster, which I cannot access directly. NVIDIA-SMI is able to see 2 GPUs (I am using this module to get the available GPUs).

This is my first time trying to leverage both the GPUs, so I am not sure how I can test that both are getting used.

ptrblck · August 2, 2020, 2:36am

I don’t think you could use PyTorch for this test, since it’s currently returning a device count of 1.
I was wondering, if you were able to use more than the one found device using any other application.

sarthakpati · August 5, 2020, 6:14pm

Yes, I am able to use the two GPUs in another application.

ptrblck · August 5, 2020, 7:55pm

Unfortunately, I don’t know what might go wrong in that case and would recommend to either update the drivers or try to use a docker container and check, if both GPUs are found there.

sarthakpati · August 6, 2020, 8:20pm

Okay, thanks. If I manage to figure it out, I will post the solution here.