I’m working on retraining CLIP to train on several new tasks.
I’m still new to dataparallel(I’m not using the DDP since DP seems simpler to implement).
This question is related to this thread
Since I want to use data parallel on non-forward method of the network, I decide to write this :
class CustomModel1(nn.Module):
def __init__(self, model) :
super(CustomModel1, self).__init__()
self.model = model
def forward(self,x,y):
output1 = self.model.custom_methodA(x)
output2 = self.model.custom_methodA(y)
<do_something>
return result1,result2
class CustomModel2(nn.Module):
def __init__(self, model) :
super(CustomModel2, self).__init__()
self.model = model
def forward(self,x,y):
output1 = self.model.custom_methodB(x)
output2 = self.model.custom_methodB(y)
<do_something>
return result1,result2
model = CLIP()
model1 = CustomModel1(model)
model2 = CustomModel2(model)
model1 = nn.DataParallel(model1)
model2 = nn.DataParallel(model2)
optimizer = torch.optim.Adam(model.parameters())
loss1 = nn.BCELoss()
loss2 = nn.BCELoss()
for batch in dataloader:
a,b,c,d,y = batch
result1 = model1(a,b)
result2 = model2(c,d)
loss = loss1(result1,y) + loss2(result2,y)
loss.backward()
optimizer.step()
optimizer.zero_grad()
This code will give RuntimeError: CUDA error: device-side assert triggered
at the loss.backward()
.
Which step is wrong in my code? I feel some stuff is wrong but can’t figure out how to handle it the right way. Like, is the computational graph of CLIP inside CustomModel2 and 1 still connected to the original model? Should my optimizer connected to model, or should I create separate optimizer for model1 and model2?
Thanks