Got it! Thanks a lot for the explanation
has this been fixed in PyTorch?
I did model=nn.DataParallel(model, device_ids=gpus).cuda(device_id=gpus[0])
All the data I pass to model exist on device 0 (gpus[0]
). I still get arguments are located on different GPUS
.
What’s wrong??
does your model have any Variable
or Tensor
buffers that you are using on self
?
It also seems to happen when the Variable is cuda tensor, but cuda()
was not called for the network.
In case if I have, what is the solution?
My 2 cents worth. I’m used to creating lists of Conv objects as I did in Keras and placing them into a class attribute. Turns out this doesn’t work if you’re using PyTorch and Data parallel as it has to be added using add_module.
@jarrelscy you can create a nn.ModuleList
as part of the class and put all your Conv
objects inside it. It’s a subclass of Python list, but will work with Data parallel
and parameters()
Thanks! That is a great tip.
I am also curious what to do if my model has tensor and variables used in self?
has this been replied somewhere?
for the definition of variables with self
For:
.cuda(device_id=gpus[0])
what do you mean by gpus? obviously it throws an undefined error.
gpus
is a list of gpu ids you want to use, i.e. gpus=[1, 3]
.
I am also wondering if this issue is related to the problem I just posted here : [ Solved] nn.DataParallel with ModuleList of custom modules fails on Multiple GPUs
yes in my case, tensors were placed on gpu but net was not placed on cuda so i got this error.
Hi, i also encounter this error, did you find any solution? thanks~
Hi, can you please help me with how to use the nn.ModuleList in order to get around the following error “RuntimeError: tensors are on different GPUs”
Hi @smth, I am facing this issue, I have a custom layer, it removes the standardization of the output, basically performs x*std+mean (opposite of standardization). So the tensors std and mean, are class variables for this layer, an error is thrown when the input passes to this layer - ‘RuntimeError: binary_op(): expected both inputs to be on same device, but input a is on cuda:1 and input b is on cuda:0’.
Do I need to copy both tensors to gpu:0 or is there another way?
Thanks.
did you register mean
and std
as buffers using .register_buffer
? that will help move them to GPU-x when wrapped in a DataParallel, otherwise PyTorch wouldn’t know that they have to be moved.
Reference: https://pytorch.org/docs/stable/nn.html#torch.nn.Module.register_buffer
I missed this. So this is the correct way to have variables in custom layers. Thanks @smth, it works.