I get a device mismatch error when using Data-Parallel for training with multiple GPUs.
After debugging I got to know that the data-parallel module doesn’t work with submodules.The model essentially is inceptionnet_v1/googlenet. My model design template is as below:
Class Submodule(nn.Module):
def __init__():
super .......
self.a = nn.Seq(.......)
self.b = nn.Seq(.......)
..........
def forward(self,x):
return torch.cat([self.a(x)......])
Class Model(nn.Module):
def __init__():
super .......
self.conv = conv
self.sublayer1 = Submodule(....)
self.sublayer2 = Submodule(....)
.......
def forward(self,x):
x= self.conv(x)
x = self.sublayer1(x)
x = self.sublayer2 (x)
........
return x
model = Model()
model = nn.DataParallel(model).cuda()
The DataParallel Works with Conv layers and not sublayers. It would be great if anyone can give me a pointer to debug this. I couldn’t find support from previous Pytorch Forums.