Traceback (most recent call last):
File "trainer.py", line 53, in <module>
outputs = resnet(rmap)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 202, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py", line 92, in forward
outputs = self.parallel_apply(replicas, scattered, gpu_dicts)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py", line 102, in parallel_apply
return parallel_apply(replicas, inputs, kwargs)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/parallel_apply.py", line 50, in parallel_apply
raise output
RuntimeError: tensors are on different GPUs
I met the same problem in my program yesterday. It seems that this error only happens when device_ids[0] is not 0.
I tried to debug my code with pdb and found the DataParallel.forward may fail to replicate the original model’s parameters into gpu device_ids[0] if device_ids[0]!=0. You can check it in line 33 of torch/nn/parallel/replicate.py, param_copies[0].get_device() is always 0 after executing param_copies = Broadcast(device_ids)(param) no matter what device_ids[0] is.
(Pdb) l
29 for param in module.parameters():
30 if param in seen_params:
31 continue
32 seen_params.add(param)
33 B param_copies = Broadcast(device_ids)(param)
34 -> for param_copy, remap in zip(param_copies, param_remap):
35 remap[param] = param_copy
(Pdb) p param_copies[0].get_device()
0
(Pdb) p param_copies[1].get_device()
3
(Pdb) p device_ids
[2, 3]
I am still working on this and have not got any solutions so far.
I just found where the bug is and opened a new issue asking about it here.
If you can’t wait and need to run you code right now, you can change every xx.cuda() to xx.cuda(device=gpus[0]) in your training function. This can avoid the problem mentioned in the issue. But I have to say it’s not really a good solution.
Hi!
I am having the same problem trying to run a module on 2 GPUs. I tried changing every cuda() reference to cuda(device_id=0) but I’m still getting the same error. RuntimeError: tensors are on different GPUs
I am not sure I even understand the reason why this error happens, but would you mind telling me why moving all cuda tensors to GPU 0 fixes it in your case?
Please change your code to xx.cuda(device=gpus[0]) instead of xx.cuda(device=0).
In my case, this problem only happens if I run the parallel model but gpus[0]!=0(i.e. gpus=[1,2,3]).
As I mentioned in the issue, the broadcast function in pytorch will ignore gpus[0] if the tensor is already on one gpu device. This makes the data and model parameters distributed on different gpu devices by default(since cuda() usually put tensor to gpu0).
By moving all variable tensors and models into gpus[0](not gpu 0), the first model replica will be on the right device you selected(gpus=[xx, xx, xx]) even though the broadcast function will still ignore it.
My 2 cents worth. I’m used to creating lists of Conv objects as I did in Keras and placing them into a class attribute. Turns out this doesn’t work if you’re using PyTorch and Data parallel as it has to be added using add_module.
@jarrelscy you can create a nn.ModuleList as part of the class and put all your Conv objects inside it. It’s a subclass of Python list, but will work with Data parallel and parameters()