Hi, I am using a computation server with multiple nodes each of which has 4 GPUs and they are managed with SLURM. I want my code to send the data and model to one or multiple GPUs. I assumed if I use torch.device("cuda") it makes the device to be a GPU without particularly specifying the device name (0,1,2,3). I would like to make sure if I understand the difference between these two command correctly.
torch.device("cuda") # without specifying the cuda device number
torch.device("cuda:0") # use cuda device 0
Existence. I launch a job with SLURM, so I cannot define CUDA_VISIBLE_DEVICES, it simply sends my code to a node but does not define which GPU to use, and as you said it goes automatically to index 0, in other words I can never use other GPUs on the node.
You could try torch.cuda.device_count() to get the number of GPUs available, and maybe torch.cuda.get_device_name(device_id) to get the name of the used device.
okay, so I am trying to run my model on 2 gpu’s I have. Following is my code and it is giving me Type Error: '<' not supported between instances of 'range' and 'int'
device_id = torch.cuda.device_count()
device = torch.cuda.get_device_name(range(device_id))
if torch.cuda.device_count()>1:
model = nn.DataParallel(model)
model = model.to(device)
elif train_on_gpu:
model = model.to(device)
torch.cuda.device_count() will give you the number of available devices, not a device number range(n) will give you all the integers between 0 and n-1 (included). Which are all the valid device numbers.
okay, another doubt, after I stored name of gpu’s in the device device = torch.cuda.get_device_name(range(device_id)) then how to use it in model like in the following calling -
if torch.cuda.device_count()>1:
model = nn.DataParallel(model)
model = model.to(device)
elif train_on_gpu:
model = model.to(device)
If you have torch.cuda.device_count() == 1 Then you can use only 0 as a device id.
If you have torch.cuda.device_count() == 2 Then you can use 0 and 1 as valid device ids. etc.
okay, I am going to cry now.
Then how to include multiple gpu’s in this code, like in the following way-
device_id = torch.cuda.device_count()
if torch.cuda.device_count() == 1:
model = model.to("cuda:{}".format(0))
elif torch.cuda.device_count ==2:
model = nn.DataParallel(model)
model = model.to("cuda:{}".format(1))
But it isn’t too much hard-coded way to do things?