How to specify a GPU as the "main" GPU in DataParallel?

111414 · June 17, 2022, 6:26am

I train a CNN based on torch.nn.DataParallel and specify GPUs by the following code:

...
import os
...
os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID'
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'
...
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
...
model = torch.nn.DataParallel(model).to(device)
...

The program will assign GPU-0 as the “main” GPU (i.e., data is finally collected in GPU-0) by default.

I want to know how to manually change the “main” GPU to be GPU-1?

I tried the following code and discovered that it may not work well as I expected, since GPU-0 is still the “main” GPU:

...
import os
...
os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID'
os.environ['CUDA_VISIBLE_DEVICES'] = '1,0'
...
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
...
model = torch.nn.DataParallel(model).to(device)
...

ptrblck · June 17, 2022, 6:42am

You would have to assign the index to the device as:

device = 'cuda:1'
model = torch.nn.DataParallel(model).to(device)

111414 · June 17, 2022, 7:13am

However, I want to use 2 GPUs (e.g. GPU-3 and GPU-4), and specify GPU-4 as the main GPU.

I encountered the following error with `device=‘cuda:1’:

device = cuda:1
gpu_num = 2
reading files...
training_image_num 91 read time 0.0007698535919189453
start training...
  0%|          | 0/1000 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "train.py", line 170, in <module>
    x_out = model(x, q)
  File "/home/ubuntu/anaconda3/envs/chenbin/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/anaconda3/envs/chenbin/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 156, in forward
    "them on device: {}".format(self.src_device_obj, t.device))
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1

It seems that torch.nn.DataParallel requires every input tensor be provided on the first device in its device_ids list .` [Reference]

Is there a way to change the “main” GPU to be GPU-4, instead of GPU-3 (by default), when simultaneously using GPU-3 and GPU-4?

ptrblck · June 17, 2022, 7:19am

Yes, and you can change the device_ids order in this case:

import torch
import torch.nn as nn
import torchvision.models as models


device = 'cuda:3'
model = models.resnet18()
model = nn.DataParallel(model, device_ids=[3, 0, 1, 2, 4, 5, 6, 7]).to(device)

x = torch.randn(8, 3, 224, 224, device=device)
out = model(x)
print(out.device)
> cuda:3