Tensors are on different GPUS

Got it! Thanks a lot for the explanation :slight_smile:

has this been fixed in PyTorch?

I did model=nn.DataParallel(model, device_ids=gpus).cuda(device_id=gpus[0])

All the data I pass to model exist on device 0 (gpus[0]). I still get arguments are located on different GPUS.

What’s wrong??

does your model have any Variable or Tensor buffers that you are using on self?

It also seems to happen when the Variable is cuda tensor, but cuda() was not called for the network.

4 Likes

In case if I have, what is the solution?

1 Like

My 2 cents worth. I’m used to creating lists of Conv objects as I did in Keras and placing them into a class attribute. Turns out this doesn’t work if you’re using PyTorch and Data parallel as it has to be added using add_module.

3 Likes

@jarrelscy you can create a nn.ModuleList as part of the class and put all your Conv objects inside it. It’s a subclass of Python list, but will work with Data parallel and parameters()

8 Likes

Thanks! That is a great tip.

I am also curious what to do if my model has tensor and variables used in self?

1 Like

has this been replied somewhere?
for the definition of variables with self

For:
.cuda(device_id=gpus[0])

what do you mean by gpus? obviously it throws an undefined error.

gpus is a list of gpu ids you want to use, i.e. gpus=[1, 3].

I am also wondering if this issue is related to the problem I just posted here : [ Solved] nn.DataParallel with ModuleList of custom modules fails on Multiple GPUs

yes in my case, tensors were placed on gpu but net was not placed on cuda so i got this error.

Hi, i also encounter this error, did you find any solution? thanks~

1 Like

Hi, can you please help me with how to use the nn.ModuleList in order to get around the following error “RuntimeError: tensors are on different GPUs”

Hi @smth, I am facing this issue, I have a custom layer, it removes the standardization of the output, basically performs x*std+mean (opposite of standardization). So the tensors std and mean, are class variables for this layer, an error is thrown when the input passes to this layer - ‘RuntimeError: binary_op(): expected both inputs to be on same device, but input a is on cuda:1 and input b is on cuda:0’.
Do I need to copy both tensors to gpu:0 or is there another way?
Thanks.

did you register mean and std as buffers using .register_buffer? that will help move them to GPU-x when wrapped in a DataParallel, otherwise PyTorch wouldn’t know that they have to be moved.

Reference: https://pytorch.org/docs/stable/nn.html#torch.nn.Module.register_buffer

1 Like

I missed this. So this is the correct way to have variables in custom layers. Thanks @smth, it works.