Tensors are on different GPUS

when I run my code with nn.DataParallel(resnet, device_ids=[1, 2, 3]).cuda(), I meet this problem. My input and target are showed below:

img = Variable(img).cuda()
label = Variable(label).cuda()
Traceback (most recent call last):
  File "trainer.py", line 53, in <module>
    outputs = resnet(rmap)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 202, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py", line 92, in forward
    outputs = self.parallel_apply(replicas, scattered, gpu_dicts)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py", line 102, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/parallel_apply.py", line 50, in parallel_apply
    raise output
RuntimeError: tensors are on different GPUs

did you mean to do: device_ids=[0, 1, 2]?

Yes. I use the device_ids=[0, 1, 2]. It give me this problem.

I met the same problem in my program yesterday. It seems that this error only happens when device_ids[0] is not 0.

I tried to debug my code with pdb and found the DataParallel.forward may fail to replicate the original model’s parameters into gpu device_ids[0] if device_ids[0]!=0. You can check it in line 33 of torch/nn/parallel/replicate.py, param_copies[0].get_device() is always 0 after executing param_copies = Broadcast(device_ids)(param) no matter what device_ids[0] is.

(Pdb) l
 29         for param in module.parameters():
 30             if param in seen_params:
 31                 continue
 32             seen_params.add(param)
 33 B           param_copies = Broadcast(device_ids)(param)
 34  ->         for param_copy, remap in zip(param_copies, param_remap):
 35                 remap[param] = param_copy
(Pdb) p param_copies[0].get_device()
0
(Pdb) p param_copies[1].get_device()
3
(Pdb) p device_ids
[2, 3]

I am still working on this and have not got any solutions so far.

I got the same problem. Is it a bug of PyTorch?

I just found where the bug is and opened a new issue asking about it here.
If you can’t wait and need to run you code right now, you can change every xx.cuda() to xx.cuda(device=gpus[0]) in your training function. This can avoid the problem mentioned in the issue. But I have to say it’s not really a good solution.

3 Likes

Hi!
I am having the same problem trying to run a module on 2 GPUs. I tried changing every cuda() reference to cuda(device_id=0) but I’m still getting the same error.
RuntimeError: tensors are on different GPUs
I am not sure I even understand the reason why this error happens, but would you mind telling me why moving all cuda tensors to GPU 0 fixes it in your case?

Please change your code to xx.cuda(device=gpus[0]) instead of xx.cuda(device=0).

In my case, this problem only happens if I run the parallel model but gpus[0]!=0(i.e. gpus=[1,2,3]).
As I mentioned in the issue, the broadcast function in pytorch will ignore gpus[0] if the tensor is already on one gpu device. This makes the data and model parameters distributed on different gpu devices by default(since cuda() usually put tensor to gpu0).

3 Likes

By moving all variable tensors and models into gpus[0](not gpu 0), the first model replica will be on the right device you selected(gpus=[xx, xx, xx]) even though the broadcast function will still ignore it.

Got it! Thanks a lot for the explanation :slight_smile:

has this been fixed in PyTorch?

I did model=nn.DataParallel(model, device_ids=gpus).cuda(device_id=gpus[0])

All the data I pass to model exist on device 0 (gpus[0]). I still get arguments are located on different GPUS.

What’s wrong??

does your model have any Variable or Tensor buffers that you are using on self?

It also seems to happen when the Variable is cuda tensor, but cuda() was not called for the network.

4 Likes

In case if I have, what is the solution?

1 Like

My 2 cents worth. I’m used to creating lists of Conv objects as I did in Keras and placing them into a class attribute. Turns out this doesn’t work if you’re using PyTorch and Data parallel as it has to be added using add_module.

3 Likes

@jarrelscy you can create a nn.ModuleList as part of the class and put all your Conv objects inside it. It’s a subclass of Python list, but will work with Data parallel and parameters()

8 Likes

Thanks! That is a great tip.

I am also curious what to do if my model has tensor and variables used in self?

1 Like

has this been replied somewhere?
for the definition of variables with self