Data Parallel isn't working on submodules

Tejas_P · November 22, 2020, 7:55pm

I get a device mismatch error when using Data-Parallel for training with multiple GPUs.

After debugging I got to know that the data-parallel module doesn’t work with submodules.The model essentially is inceptionnet_v1/googlenet. My model design template is as below:

Class Submodule(nn.Module):

def __init__():

      super .......

      self.a = nn.Seq(.......)

      self.b = nn.Seq(.......)

   ..........

 def forward(self,x):

     return torch.cat([self.a(x)......])

Class Model(nn.Module):

 def __init__():

      super .......

      self.conv = conv

      self.sublayer1 = Submodule(....)

      self.sublayer2 = Submodule(....)

      .......

def forward(self,x):

     x= self.conv(x)

     x = self.sublayer1(x)

     x = self.sublayer2 (x)

     ........

return x

model = Model()

model = nn.DataParallel(model).cuda()

The DataParallel Works with Conv layers and not sublayers. It would be great if anyone can give me a pointer to debug this. I couldn’t find support from previous Pytorch Forums.

JuanFMontesinos · November 22, 2020, 8:35pm

It does work with submodules.
The problem is you probably have some hardcoded device in the submodules.
This is calling .cuda() instead of coding it as .to(device) being device based on forward’s input.

Tejas_P · November 23, 2020, 6:09pm

I tried changing it to .to(device) and have the same error. The problem here is I’m trying to work with DataParallel method and I think this method is not used by the submodule.

JuanFMontesinos · November 23, 2020, 11:40pm

Can you post the model code?
The submodule has to be a nn.Module, a nn.modulelist or torch dict.
Any other object is not properly traced by the system.