Arguments are located on different GPUs

While training a model when i use multi-gpus i get this error.

  File "/home/ashwin/anaconda3/lib/python3.6/site-packages/torch/autograd/", line 120, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/ashwin/anaconda3/lib/python3.6/site-packages/torch/autograd/", line 81, in backward
    variables, grad_variables, retain_graph, create_graph)
RuntimeError: arguments are located on different GPUs at /home/ashwin/pytorch/aten/src/THC/generated/../generic/

I just use model.cuda() and similarly while loading variables i use variable(<param_name>).cuda()
and while training in single gpu it is working fine but while multi-gpu it is giving this error. Any help in this is appreciated .
I could also see torch.cuda.device_count() = 4 and torch.cuda.current_device()=0


Could you post a small code snippet creating this error?

I am not sure where exactly this error happens. since it throws error after loss.backward() let me show how i calculate loss

inputs = Variable(inputs).cuda()
net(inputs, truth_boxes, truth_labels, truth_instances)
loss = net.loss(inputs, truth_boxes, truth_labels, truth_instances)

I do not use nn.DataParallel to parallelize the whole model instead i parallelize each module components in the entire model using data_parallel function from nn.parallel.data_parallel because some components are c-wrapper which throws error while i use nn.DataParallel on the model

Could you re-run the code with CUDA_LAUNCH_BLOCKING=1 python
Using this, you should find the exact location, where the error was thrown.

1 Like

Thanks. I tried running the command but the program hangs for a long time.

Maybe it hangs, because of the blocking statement and using data parallel.
Sorry, if that’s the case.

Could you post a small code snippet reproducing this error?
Currently I’m clueless, how to debug your error.

Can you run with CUDA_VISIBLE_DEVICES=0 python and report the result?

By setting it as CUDA_VISIBLE_DEVICES=0 it works but it uses only 1 gpu (obviously). If i set CUDA_VISIBLE_DEVICES=0,1,2,3 it throws the error as mentioned in the post.

I think this is the snippet that is giving the error. If i dont use this snippet then the model works in multi-gpu

class RcnnMultiHead(nn.Module):
    def __init__(self, scales, cfg, in_channels):
        super(RcnnMultiHead, self).__init__()
        self.scales = scales
        self.num_classes = cfg.num_classes
        self.crop_size = cfg.rcnn_crop_size
        self.fc1s = nn.ModuleList()
        self.fc2s = nn.ModuleList()
        self.logits = nn.ModuleList()
        self.deltas = nn.ModuleList()
        for i in range(self.scales):
            self.fc1s.append(nn.Linear(in_channels * cfg.rcnn_scales[i] * cfg.rcnn_scales[i], 784))
            self.fc2s.append(nn.Linear(784, 784))
            self.logits.append(nn.Linear(784, self.num_classes))
            self.deltas.append(nn.Linear(784, self.num_classes * 4))
    def forward(self, crops):
        logits_flat = []
        deltas_flat = []
        for i in range(self.scales):
            crop = crops[i]
            x = crop.view(crop.size(0), -1)
            x = F.relu(self.fc1s[i](x), inplace=True)
            x = F.relu(self.fc2s[i](x), inplace=True)
            x = F.dropout(x, 0.5,
            logit = self.logits[i](x)
            delta = self.deltas[i](x)
        logits_flat =, 0)
        deltas_flat =, 0)
        return logits_flat, deltas_flat

class RcnnFpn(nn.Module):
    def __init__(self, in_dim):
        super(RcnnFpn, self).__init__()
        self.in_dim = in_dim
        self.c1 = nn.Conv2d(256, 384, kernel_size=3, stride=2, padding=1)
        self.p1 = nn.Conv2d(512, 256, kernel_size=1, stride=1, padding=0)
        self.c1_2 = LateralBlock(384, 256, self.in_dim)
        self.c2 = nn.Conv2d(384, 512, kernel_size=3, stride=2, padding=1)
        self.c2_2 = LateralBlock(256, self.in_dim, self.in_dim)
    def forward(self, x):
        c1_out = F.leaky_relu(self.c1(x))
        c2_out = F.leaky_relu(self.c2(c1_out))
        p1 = self.p1(c2_out)
        p2 = self.c1_2(c1_out, p1)
        p3 = self.c2_2(x, p2)
        features = [p1, p2, p3]
        return features

class Model(nn.Module):
    def __init__(self, cfg):
        super(Model, self).__init__()
        self.rcnn_crop = CropRoi(cfg, cfg.rcnn_crop_size)
        self.rcnn_head = RcnnMultiHead(3, cfg, crop_channels)
        self.rcnn_fpn = RcnnFpn(in_dim=256)
    def forward(self, x):
        rcnn_crops = self.rcnn_crop(features, self.rpn_proposals)
        rcnn_features = data_parallel(self.rcnn_fpn, rcnn_crops)
        self.rcnn_logits, self.rcnn_deltas = data_parallel(self.rcnn_head, rcnn_features)  

The self.rcnn_fpn and self.rcnn_head is called by data_parallel function from nn.parallel.data_parallel and it is causing the error.

I cannot run the code, since some classes are undefined.
Could you post a placeholder for LateralBlock and CropRoi?
Also could you provide the settings for cfg?

I believe it is Heng’s implementation of mask-rcnn for Kaggle DS bowl 2018. I met exactly the same error but couldn’t find the root cause. My situation is more weird as the loss.backward() fails occasionally. My assumption is that the gather operation in data_parallel was not performed correctly and some data are left on other GPUs that leads to such error. Any suggestions on how to debug a complicated network and what might be the cause are appreciated.

I have the same problem with this code: (pytorch from source)

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
import numpy as np

class MyModule(nn.Module):
    def __init__(self):
        super(MyModule, self).__init__()
        self.conv = nn.Conv2d(3, 3, 3, 1, 0)

    def forward(self, x):
        size = [int(s * 0.5) for s in x.shape[2:]]
        a = self.conv(x)
        b = F.upsample(x, size=size, mode='bilinear', align_corners=True)
        b = self.conv(b)
        c = F.upsample(b, size=a.shape[2:], mode='bilinear', align_corners=True)

        return a, b, c

data = torch.rand(5, 3, 32, 32).cuda()

data = Variable(data)

model = MyModule()
model = nn.DataParallel(model)

outputs = model(data)

loss = 0

target_a = np.random.randint(0, 3, size=(5, 30, 30))
target_a = torch.from_numpy(target_a).long().cuda()
target_a = Variable(target_a, requires_grad=False)
loss += F.nll_loss(F.log_softmax(outputs[0], dim=1), target_a, ignore_index=-1)

target_b = np.random.randint(0, 3, size=(5, 14, 14))
target_b = torch.from_numpy(target_b).long().cuda()
target_b = Variable(target_b, requires_grad=False)
loss += F.nll_loss(F.log_softmax(outputs[1], dim=1), target_b, ignore_index=-1)

target_c = np.random.randint(0, 3, size=(5, 30, 30))
target_c = torch.from_numpy(target_c).long().cuda()
target_c = Variable(target_c, requires_grad=False)
loss += F.nll_loss(F.log_softmax(outputs[2], dim=1), target_c, ignore_index=-1)



Traceback (most recent call last):
  File "", line 68, in <module>
  File "/opt/conda/envs/pytorch-py36/lib/python3.6/site-packages/torch/", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/conda/envs/pytorch-py36/lib/python3.6/site-packages/torch/autograd/", line 89, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: arguments are located on different GPUs at /root/pytorch/aten/src/THC/generated/../generic/

Process finished with exit code 1

Any idea? @ptrblck

I tried your code on my machine and it’s working.
What PyTorch version are you using?

I use the source version (0.4.0a0+e46043a) and multiple GPUs (with one it works fine…)

Ok, I tested it with 4 GPUs on a server running 0.3.1_post2 and it worked fine.
I’ll try to compile your version and run it again.

I compiled 0.4.0a0+e46043a and got the same error.
The code runs fine in the current master (0.5.0a0+a4dbd37).

Probably it’s a knows bug and was fixed already. Could you try to update PyTorch?

1 Like

Works fine with 0.5.0a0+3b63be0 :+1:


have you solve this problem? I have meet the same problem?

Hi ptrblck, I also came across this problem. Do you have any easy way to update pytorch from 0.4.0a0+e46043a to 0.5.0a0+a4dbd37? Every time I tried to update pytroch, it costs me lots of time. Thank you.