My code does not run on multiple GPUs


I used Faster R-CNN of PyTorch to train it on a dataset. It works well with one GPU. However, I have access to a system with 4 GPUs. I want to use 4 GPUs. However, when I check GPUs usage, only one GPU is used.

I select device in this manner:

    if torch.cuda.is_available() == False and device_name == 'gpu':
        raise ValueError('GPU is not available!')
    elif device_name == 'cpu':
        device = torch.device('cpu')
    elif device_name == 'gpu':

        if batch_size % torch.cuda.device_count() != 0:
            raise ValueError('Batch Size is no dividable by number of gpus')
        device = torch.device('cuda')

After that I do this:

# multi GPUs
    if torch.cuda.device_count() > 1 and device_name == 'gpu':
        print('=' * 50)
        print('=' * 50)
        print('=' * 50)
        print("Let's use", torch.cuda.device_count(), "GPUs!")
        # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
        # model = nn.DataParallel(model, device_ids=[i for i in range(torch.cuda.device_count())])
        model = nn.DataParallel(model)
        print('=' * 50)
        print('=' * 50)
        print('=' * 50)

    # transfer model to selected device

I move data to the device in this way:

# iterate over all batches
    counter_batches = 0
    for images, targets in metric_logger.log_every(data_loader, print_freq, header):

        # transfer tensors to device(gpu, if not available cpu)
        images = list( for image in images)
        targets = [{k: for k, v in t.items()} for t in targets]

        # in train mode, faster r-cnn gives losses
        loss_dict = model(images, targets)

        # sum of losses
        losses = sum(loss for loss in loss_dict.values())

I do not know what I did wrong.

Also, I get this warning:

/site-packages/torch/nn/parallel/ UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all ’

I think you’d want to use DistributedDataParallel or somesuch to wrap the model in.
TorchVision, in its git repository, contains the script used for providing the pretrained models in the references directory (an underused/underappreciated resource IMHO): vision/references/detection at main · pytorch/vision · GitHub. This does include multi-gpu training.

Best regards


DistributedDataParallel is for distributed systems. All of my GPUs are on a system. Am I right?

Not really, see: “use DDP instead” in CUDA semantics — PyTorch 1.10.0 documentation

Hi @ptrblck,

Can you help me in this case?