DataParallel: Arguments are located on different GPUs

Hi, all,

I met following error when using parallel_model = nn.DataParallel(model). Running on one GPU is fine.

Traceback (most recent call last):
File “main_m.py”, line 113, in
train(epoch, train_batch_logger, train_loader)
File “main_m.py”, line 47, in train
end_point = parallel_model(sample)
File “/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py”, line 489, in call
result = self.forward(*input, **kwargs)
File “/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py”, line 143, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File “/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py”, line 153, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File “/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/parallel_apply.py”, line 83, in parallel_apply
raise output
RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:253

My pytorch version is v1.0. I checked all previous similar topics, none of them works for my issue. Any clue to solve this problem?

Do you use and cuda() or to() calls inside your forward method?
Would it be possible to post a small executable code snippet so that we could debug this issue?

Thanks for your reply.

Yeah, I use data.to(device) for the input data in the forward().

I set the batch size as 24 and use 2 GPUs. I print label of each data in this batch: first label is on the device ‘cuda:0’, other 23 labels are in ‘cuda:1’, then I torch.cat() these labels to one tensor, it is in ‘cuda:1’ . And the output of the final layer is in ‘cuda:1’.

the error occurs at: end_point, y = parallel_model(sample), end_point is the output of the final layer, y concatenated labels.

Traceback (most recent call last):
File “main_m.py”, line 113, in
train(epoch, train_batch_logger, train_loader)
File “main_m.py”, line 47, in train
end_point, y = parallel_model(sample)
File “/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py”, line 489, in call
result = self.forward(*input, **kwargs)
File “/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py”, line 143, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File “/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py”, line 153, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File “/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/parallel_apply.py”, line 83, in parallel_apply
raise output
RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:253

nn.DataParallel will create replicas of your model and chunk the input in dim0 into smaller batches for each GPU. If you use to(device) in your forward method, you force this tensor to be on a specific device, which will throw this error.
Usually you would push the input data to the default device in your training loop, not the forward method.
Is there a specific reason, why you would use it in forward?

Thanks, the input data is saved as class object (Creates a data object from a python dictionary), and they have various dimension in each key. When I generated data_loader using default collate_fn, I got error as
Traceback (most recent call last):
File “main_m.py”, line 118, in
train(epoch, train_batch_logger, train_loader)
File “main_m.py”, line 44, in train
for i, sample in enumerate(train_loader):
File “/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py”, line 615, in next
batch = self.collate_fn([self.dataset[i] for i in indices])
File “/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py”, line 234, in default_collate
raise TypeError((error_msg.format(type(batch[0]))))
TypeError: batch must contain tensors, numbers, dicts or lists; found <class ‘MyData.Data’>

then I modified the collate_fn as following to get train_loader:
train_loader = DataLoader(train_dataset, batch_size=24, shuffle=True, collate_fn=lambda x:x)

I tried to push data to device in training loop using .cuda(), I got error:
Traceback (most recent call last):
File “main_m.py”, line 118, in
train(epoch, train_batch_logger, train_loader)
File “main_m.py”, line 46, in train
sample = sample.cuda()
AttributeError: ‘list’ object has no attribute ‘cuda’

So, I used data.to(device) in the forward() method of model as following to push each data to device:
for i in range(len(sample)):
data = sample[i].to(device)

Hopefully I explained problem clearly. Any suggestions for this case ?

One possible approach would be to use the device of another parameter of your model. E.g. assuming your model contains another layer called self.fc:

for i in range(len(sample)):
    data = sample[i].to(self.fc.weight.device)

Alternatively, you could also apply the for loop inside the training loop, but that might need some code changes.

Thanks, I used ‘data = sample[i].to(self.fc.weight.device)’ as you suggested, the error still occurs.

The following is a part of my code:

define and parallel model:
device = torch.device(‘cuda’ if torch.cuda.is_available() else ‘cpu’)
model = Net()
parallel_model = nn.DataParallel(model).to(device)

loop for training:
for i, sample in enumerate(train_loader):
with autograd.detect_anomaly():
optimizer.zero_grad()
end_point, y = parallel_model(sample)

model:
class Net(torch.nn.Module):
def init(self):
super(Net, self).init()

    self. layer = layer(XXXXXXXX)
    self.conv2 = Conv2(XXXXXXXX)
    self.fc1 = torch.nn.Linear(16*512, 1024)

def forward(self, sample): 
    print(self.fc1.weight.device)

    for i in range(len(sample)):                  
        data = sample[i].to(self.fc1.weight.device)            
        print(data.y)

        x_1 = self. layer(data.x, data.add, data.fea)                                             
        x_2 = self.conv2(x_1)    
        x_2 = x_2.view(-1, self.fc1.weight.size(1))       
        x_2= self.fc1(x_2)

        if i == 0:       #  problem is here? how to return a y_1 in different GPUs
            y_1 = data.y
            end_point = x_2
        else:              
            y_1 = torch.cat((y_1, data.y),0)
            end_point = torch.cat((end_point, x_2),0)    

    return end_point, y_1

when print(self.fc1.weight.device), I can get (I am using 2 GPUs):
cuda:0
cuda:1

when print(data.y), only one sample in cuda:0
tensor([47], device=‘cuda:0’)
tensor([47], device=‘cuda:1’)
tensor([25], device=‘cuda:1’)
tensor([1], device=‘cuda:1’)
tensor([33], device=‘cuda:1’)
tensor([31], device=‘cuda:1’)
tensor([78], device=‘cuda:1’)
tensor([64], device=‘cuda:1’)
tensor([16], device=‘cuda:1’)
tensor([65], device=‘cuda:1’)
tensor([73], device=‘cuda:1’)
tensor([29], device=‘cuda:1’)
tensor([77], device=‘cuda:1’)

I guess the problem happens at torch.cat(), return results in different GPUs .

Do you have any suggestion? or whether I understand you rightl?

I struggled a lot with this error in the last days. In the end I found using self.register_buffer fixes the problem (see here for a similar thread that I learnt from).

In detail, let us assume we have a tensor temp_tensor which is torch.ones(1, 2). To put temp_tensor on the right gpu under parallel_model, we need to do something like this:

self.register_buffer("temp_tensor", torch.ones(1, 2))

Hope it helps!