Difference between torch.device("cuda") and torch.device("cuda:0")

Hi, I am using a computation server with multiple nodes each of which has 4 GPUs and they are managed with SLURM. I want my code to send the data and model to one or multiple GPUs. I assumed if I use torch.device("cuda") it makes the device to be a GPU without particularly specifying the device name (0,1,2,3). I would like to make sure if I understand the difference between these two command correctly.

torch.device("cuda") # without specifying the cuda device number 
torch.device("cuda:0") # use cuda device 0

Is that correct?

5 Likes

If you run .device("cuda"), your tensor will be routed to the CUDA current device, which by default is the 0 device.

Is there any way to choose automatically what ever GPU index is available?

What do you mean by available ? Do you mean in terms of GPU utilization ? Memory ? Or simply existence ?

Existence. I launch a job with SLURM, so I cannot define CUDA_VISIBLE_DEVICES, it simply sends my code to a node but does not define which GPU to use, and as you said it goes automatically to index 0, in other words I can never use other GPUs on the node.

You could try torch.cuda.device_count() to get the number of GPUs available, and maybe torch.cuda.get_device_name(device_id) to get the name of the used device.

What will be the device_id in torch.cuda.get_device_name(device_id) then?

Any number in range(torch.cuda.device_count()).

1 Like

okay, so I am trying to run my model on 2 gpu’s I have. Following is my code and it is giving me
Type Error: '<' not supported between instances of 'range' and 'int'

device_id = torch.cuda.device_count()
device = torch.cuda.get_device_name(range(device_id))

if torch.cuda.device_count()>1:
   model = nn.DataParallel(model)
   model = model.to(device)
elif train_on_gpu:
   model = model.to(device)

torch.cuda.device_count() will give you the number of available devices, not a device number
range(n) will give you all the integers between 0 and n-1 (included). Which are all the valid device numbers.

1 Like

Yes, I am doing the same -

device_id = torch.cuda.device_count()
device = torch.cuda.get_device_name(range(device_id))

but it is throwing an error
Type Error: '<' not supported between instances of 'range' and 'int'

get_device_name expects the number corresponding to the device.
range() returns a list of numbers.

1 Like

Ohh, sorry I am stupid, I forgot to include it in a loop

okay, another doubt, after I stored name of gpu’s in the device
device = torch.cuda.get_device_name(range(device_id)) then how to use it in model like in the following calling -

if torch.cuda.device_count()>1:
   model = nn.DataParallel(model)
   model = model.to(device)
elif train_on_gpu:
   model = model.to(device)

You can just use the device id, you don’t need the name.
You can do .cuda(deviceid) or .to("cuda:{}".format(deviceid)).

I am getting
RuntimeError: CUDA error: invalid device ordinal
while running both ways.

If you have torch.cuda.device_count() == 1 Then you can use only 0 as a device id.
If you have torch.cuda.device_count() == 2 Then you can use 0 and 1 as valid device ids. etc.

This is how I am doing -

device_id = torch.cuda.device_count()
if torch.cuda.device._count()>1:
   model = nn.DataParallel(model)

model = model.to("cuda:{}".format(device_id))

device_id = torch.cuda.device_count() this does not give you a valid device id but the number of available devices !

okay, I am going to cry now.
Then how to include multiple gpu’s in this code, like in the following way-

device_id = torch.cuda.device_count()
if torch.cuda.device_count() == 1:
   model = model.to("cuda:{}".format(0))
elif torch.cuda.device_count ==2:
   model = nn.DataParallel(model)
   model = model.to("cuda:{}".format(1))

But it isn’t too much hard-coded way to do things?