Dataparallel and network with custom forward function

I’m working on retraining CLIP to train on several new tasks.
I’m still new to dataparallel(I’m not using the DDP since DP seems simpler to implement).
This question is related to this thread

Since I want to use data parallel on non-forward method of the network, I decide to write this :

class CustomModel1(nn.Module):
  def __init__(self, model) :
        super(CustomModel1, self).__init__()
        self.model = model
  def forward(self,x,y):
    output1 = self.model.custom_methodA(x)
    output2 = self.model.custom_methodA(y)
    <do_something>
    return result1,result2

class CustomModel2(nn.Module):
  def __init__(self, model) :
        super(CustomModel2, self).__init__()
        self.model = model
  def forward(self,x,y):
    output1 = self.model.custom_methodB(x)
    output2 = self.model.custom_methodB(y)
    <do_something>
    return result1,result2

model = CLIP()
model1 = CustomModel1(model)
model2 = CustomModel2(model)

model1 = nn.DataParallel(model1)
model2 = nn.DataParallel(model2)

optimizer = torch.optim.Adam(model.parameters())
loss1 = nn.BCELoss()
loss2 = nn.BCELoss()

for batch in dataloader:
  a,b,c,d,y = batch
  result1 = model1(a,b)
  result2 = model2(c,d)
  loss = loss1(result1,y) + loss2(result2,y)
  loss.backward()
  optimizer.step()
  optimizer.zero_grad()

This code will give RuntimeError: CUDA error: device-side assert triggered at the loss.backward().

Which step is wrong in my code? I feel some stuff is wrong but can’t figure out how to handle it the right way. Like, is the computational graph of CLIP inside CustomModel2 and 1 still connected to the original model? Should my optimizer connected to model, or should I create separate optimizer for model1 and model2?

Thanks

Could you rerun the script via CUDA_LAUNCH_BLOCKING=1 python script.pt args and post the complete stack trace here, please?