Tensors are on different GPUS

ycszen · March 29, 2017, 12:12am

when I run my code with nn.DataParallel(resnet, device_ids=[1, 2, 3]).cuda(), I meet this problem. My input and target are showed below:

img = Variable(img).cuda()
label = Variable(label).cuda()

Traceback (most recent call last):
  File "trainer.py", line 53, in <module>
    outputs = resnet(rmap)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 202, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py", line 92, in forward
    outputs = self.parallel_apply(replicas, scattered, gpu_dicts)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py", line 102, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/parallel_apply.py", line 50, in parallel_apply
    raise output
RuntimeError: tensors are on different GPUs

smth · March 29, 2017, 4:00am

did you mean to do: device_ids=[0, 1, 2]?

ycszen · March 29, 2017, 12:48pm

Yes. I use the device_ids=[0, 1, 2]. It give me this problem.

nicklhy · March 31, 2017, 7:03am

I met the same problem in my program yesterday. It seems that this error only happens when device_ids[0] is not 0.

I tried to debug my code with pdb and found the DataParallel.forward may fail to replicate the original model’s parameters into gpu device_ids[0] if device_ids[0]!=0. You can check it in line 33 of torch/nn/parallel/replicate.py, param_copies[0].get_device() is always 0 after executing param_copies = Broadcast(device_ids)(param) no matter what device_ids[0] is.

(Pdb) l
 29         for param in module.parameters():
 30             if param in seen_params:
 31                 continue
 32             seen_params.add(param)
 33 B           param_copies = Broadcast(device_ids)(param)
 34  ->         for param_copy, remap in zip(param_copies, param_remap):
 35                 remap[param] = param_copy
(Pdb) p param_copies[0].get_device()
0
(Pdb) p param_copies[1].get_device()
3
(Pdb) p device_ids
[2, 3]

I am still working on this and have not got any solutions so far.

melody-rain · March 31, 2017, 7:45am

I got the same problem. Is it a bug of PyTorch?

nicklhy · March 31, 2017, 8:24am

I just found where the bug is and opened a new issue asking about it here.
If you can’t wait and need to run you code right now, you can change every xx.cuda() to xx.cuda(device=gpus[0]) in your training function. This can avoid the problem mentioned in the issue. But I have to say it’s not really a good solution.

mehdi-shiba · April 7, 2017, 6:28am

Hi!
I am having the same problem trying to run a module on 2 GPUs. I tried changing every cuda() reference to cuda(device_id=0) but I’m still getting the same error.
RuntimeError: tensors are on different GPUs
I am not sure I even understand the reason why this error happens, but would you mind telling me why moving all cuda tensors to GPU 0 fixes it in your case?

nicklhy · April 7, 2017, 7:02am

Please change your code to xx.cuda(device=gpus[0]) instead of xx.cuda(device=0).

In my case, this problem only happens if I run the parallel model but gpus[0]!=0(i.e. gpus=[1,2,3]).
As I mentioned in the issue, the broadcast function in pytorch will ignore gpus[0] if the tensor is already on one gpu device. This makes the data and model parameters distributed on different gpu devices by default(since cuda() usually put tensor to gpu0).

nicklhy · April 7, 2017, 7:06am

By moving all variable tensors and models into gpus[0](not gpu 0), the first model replica will be on the right device you selected(gpus=[xx, xx, xx]) even though the broadcast function will still ignore it.

mehdi-shiba · April 7, 2017, 9:44am

Got it! Thanks a lot for the explanation

lakehanne · April 23, 2017, 5:09am

has this been fixed in PyTorch?

hyqneuron · May 18, 2017, 10:31am

I did model=nn.DataParallel(model, device_ids=gpus).cuda(device_id=gpus[0])

All the data I pass to model exist on device 0 (gpus[0]). I still get arguments are located on different GPUS.

What’s wrong??

smth · May 19, 2017, 3:21pm

does your model have any Variable or Tensor buffers that you are using on self?

stacked.twix · June 4, 2017, 4:06pm

It also seems to happen when the Variable is cuda tensor, but cuda() was not called for the network.

botcs · June 28, 2017, 8:06pm

In case if I have, what is the solution?

jarrelscy · June 29, 2017, 11:12am

My 2 cents worth. I’m used to creating lists of Conv objects as I did in Keras and placing them into a class attribute. Turns out this doesn’t work if you’re using PyTorch and Data parallel as it has to be added using add_module.

smth · July 3, 2017, 1:20am

@jarrelscy you can create a nn.ModuleList as part of the class and put all your Conv objects inside it. It’s a subclass of Python list, but will work with Data parallel and parameters()

jarrelscy · July 3, 2017, 2:43am

Thanks! That is a great tip.

kits · July 18, 2017, 10:26pm

I am also curious what to do if my model has tensor and variables used in self?

furlat · October 26, 2017, 9:50pm

has this been replied somewhere?
for the definition of variables with self