DataParallel - runtime error: input and weights on different devices

Hello,

There’s a model I’m trying to train on two GPUs using nn.DataParallel, but I get the following error:

RuntimeError: Expected tensor for argument #1 ‘input’ to have the same device as tensor for argument #2 ‘weight’; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)

The network has an encoder - decoder architecture with both parts defined in their own module. The encoder is the one of resnet18 while the decoder is made of a few conv2d layers.

Here is the forward function of the encoder:

def forward(self, input_image): #encoder
    self.features = []
    x = (input_image - 0.45) / 0.225
    x = self.encoder.conv1(x)
    x = self.encoder.bn1(x)
    self.features.append(self.encoder.relu(x))
    self.features.append(self.encoder.layer1(self.encoder.maxpool(self.features[-1])))
    self.features.append(self.encoder.layer2(self.features[-1]))
    self.features.append(self.encoder.layer3(self.features[-1]))
    self.features.append(self.encoder.layer4(self.features[-1]))
    return self.features

and the forward function of the decoder:

def forward(self, input_features): #decoder
    self.outputs = {}
    x = input_features[-1]
    for i in range(4,-1,-1):
        x = self.convs[('pad', i, 0)](x)
        x = self.convs[('upconv', i, 0)](x) #line causing the error in the decoder
        x = self.convs[('nonlin', i, 0)](x)
        x = [upsample(x)]

        if self.use_skips and i > 0:
            x += [input_features[i-1]]
        x = torch.cat(x,1)
        x = self.convs[('pad', i, 1)](x)
        x = self.convs[('upconv', i, 1)](x)
        x = self.convs[('nonlin', i, 1)](x)
        if i in self.scales:
            xx = self.convs[('pad', i, 2)](x)
            xx = self.convs[('dispconv', i)](xx)
            self.outputs[('disp', i)] = self.sigmoid(xx)
    return self.outputs

and here is how the two modules are created:

self.device = torch.device("cpu" if self.opt.no_cuda else "cuda:0")
...
self.models["encoder"] = networks.ResnetEncoder(
    self.opt.num_layers, self.opt.weights_init == "pretrained")

if torch.cuda.device_count() > 1:
    self.models["encoder"] = nn.DataParallel(self.models["encoder"], device_ids=[0,1])

    self.models["depth"] = networks.DepthDecoderParallel(
        self.models["encoder"].module.num_ch_enc, self.opt.scales)

    self.models["depth"] = nn.DataParallel(self.models["depth"], device_ids=[0,1])
else:
    self.models["depth"] = networks.DepthDecoder(
        self.models["encoder"].num_ch_enc, self.opt.scales)

self.models["encoder"].to(self.device)
self.models["depth"].to(self.device)

then, these two modules are used as follows:

features = self.models["encoder"](inputs["color_aug", 0, 0])
outputs = self.models["depth"](features) 

The last line is where the problem occurs.

I believe this is because the output of the encoder is a list, not a tensor, and that’s why the input and the weights of the decoder are not on the same device, but I don’t know how to solve this.

1 Like

Ok, I tried to change a few things in the DataParallel example like setting the fc layer in an OrderedDict or in a list and the example stopped working. I guess the problem is caused by the fact that the layers in the decoder of my previous post are stored in an OrderedDict.

If you would like to store modules in a list, use nn.ModuleList, so that the modules will be properly registered inside the parent module.

Thanks, nn.ModuleList solved the problem!