Hello,
There’s a model I’m trying to train on two GPUs using nn.DataParallel, but I get the following error:
RuntimeError: Expected tensor for argument #1 ‘input’ to have the same device as tensor for argument #2 ‘weight’; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)
The network has an encoder - decoder architecture with both parts defined in their own module. The encoder is the one of resnet18 while the decoder is made of a few conv2d layers.
Here is the forward function of the encoder:
def forward(self, input_image): #encoder
self.features = []
x = (input_image - 0.45) / 0.225
x = self.encoder.conv1(x)
x = self.encoder.bn1(x)
self.features.append(self.encoder.relu(x))
self.features.append(self.encoder.layer1(self.encoder.maxpool(self.features[-1])))
self.features.append(self.encoder.layer2(self.features[-1]))
self.features.append(self.encoder.layer3(self.features[-1]))
self.features.append(self.encoder.layer4(self.features[-1]))
return self.features
and the forward function of the decoder:
def forward(self, input_features): #decoder
self.outputs = {}
x = input_features[-1]
for i in range(4,-1,-1):
x = self.convs[('pad', i, 0)](x)
x = self.convs[('upconv', i, 0)](x) #line causing the error in the decoder
x = self.convs[('nonlin', i, 0)](x)
x = [upsample(x)]
if self.use_skips and i > 0:
x += [input_features[i-1]]
x = torch.cat(x,1)
x = self.convs[('pad', i, 1)](x)
x = self.convs[('upconv', i, 1)](x)
x = self.convs[('nonlin', i, 1)](x)
if i in self.scales:
xx = self.convs[('pad', i, 2)](x)
xx = self.convs[('dispconv', i)](xx)
self.outputs[('disp', i)] = self.sigmoid(xx)
return self.outputs
and here is how the two modules are created:
self.device = torch.device("cpu" if self.opt.no_cuda else "cuda:0")
...
self.models["encoder"] = networks.ResnetEncoder(
self.opt.num_layers, self.opt.weights_init == "pretrained")
if torch.cuda.device_count() > 1:
self.models["encoder"] = nn.DataParallel(self.models["encoder"], device_ids=[0,1])
self.models["depth"] = networks.DepthDecoderParallel(
self.models["encoder"].module.num_ch_enc, self.opt.scales)
self.models["depth"] = nn.DataParallel(self.models["depth"], device_ids=[0,1])
else:
self.models["depth"] = networks.DepthDecoder(
self.models["encoder"].num_ch_enc, self.opt.scales)
self.models["encoder"].to(self.device)
self.models["depth"].to(self.device)
then, these two modules are used as follows:
features = self.models["encoder"](inputs["color_aug", 0, 0])
outputs = self.models["depth"](features)
The last line is where the problem occurs.
I believe this is because the output of the encoder is a list, not a tensor, and that’s why the input and the weights of the decoder are not on the same device, but I don’t know how to solve this.