Using intermediate model outputs in loss function combining multiple models

negreanu1 · May 28, 2021, 7:01am

Until now I was working with TensorFlow but for different reasons, I want to pass the code to Pytorch. I am working on this problem (tensorflow - Keras multioutput custom loss with intermediate layers output - Stack Overflow) and I don’t know if the code I have written in Pytorch does what I really want it to do, since the loss is stuck from the beginning. I have tried to replicate the code in this way:

submodel2 = submodel2()
submodel2.load_state_dict(torch.load('pretrained_submodel2.pt'))

for param in submodel2.parameters():
    param.requires_grad = False


submodel3 = submodel3()
submodel3.load_state_dict(torch.load('pretrained_submodel3.pt'))
    
for param in submodel3.parameters():
    param.requires_grad = False
    
    
all_model = all_model(submodel2,submodel3) #submodel1 (with trainable weights) + submodel2 and submodel3 with frozen weights
criterion_1 = nn.SomeLoss()
criterion_2 = nn.SomeLoss()
criterion_3 = nn.SomeLoss()

optimizer = optim.Adam(all_model.parameters(), lr=0.001)
cudnn.benchmark = True        
all_model = all_model.cuda()

for epoch in range(2500):  # loop over the dataset multiple times

    for i, data in enumerate(trainloader, 0):
  
        input1,input2, label1, label2= data['input1'],data['input2'],data['label1'],data['label2']
        
        input1= Variable(input1.cuda().type(torch.cuda.FloatTensor))
        input2= Variable(input2.cuda().type(torch.cuda.FloatTensor))
        label1= Variable(label1.cuda().type(torch.cuda.FloatTensor))
        label2= Variable(label2.cuda().type(torch.cuda.FloatTensor))

        # zero the parameter gradients
        optimizer.zero_grad()
        
        # forward + backward + optimize
        outputs = all_model(input1,input2)
        output1,output2,output3= outputs['output1'],outputs['output2'],outputs['output3']

        loss1 = criterion1(input1.float(), output1.float().detach())
        loss2 = criterion2(input2.float(), output2.float().flatten().detach())
        loss3 = criterion3(output2.float(), output2.float().detach())

        loss = loss1+loss2+loss3

        loss.backward()
   
        optimizer.step()

But it doesn’t seem to work well. I have also tried to make a backward for each loss function like this:

loss1.backward()
loss2.backward()
loss3.backward()
optimizer.step()

But I get this error:

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

ptrblck · May 28, 2021, 8:31am

Based on the code snippet it seems that loss1 and loss2 would be constants, since you are detaching the output tensors, and thus be used in the gradient calculation (and will also raise the error using the second approach).
I’m not familiar with your use case, so could you explain, why the detaching is used?

negreanu1 · May 28, 2021, 8:40am

Hello, thank you for replying. I am using detach because I was getting this error:

RuntimeError: the derivative for 'target' is not implemented

And looking for a solution I found detach(). As you say, loss1 and loss2 are constant and only loss3 starts to go down.

ptrblck · May 28, 2021, 8:42am

Usually loss functions expect the model output as the first argument and the target as the second.
I don’t know which loss functions you are using, but you might want to swap the order of inputs.

negreanu1 · May 28, 2021, 8:44am

Ohh thank you, is working now! Thank you very much!

negreanu1 · May 28, 2021, 8:47am

And another question, the first and the second solution are the same? That is, it’s the same to do loss.backward() or loss1.backward, loss2.backward and loss3.backward?

ptrblck · May 28, 2021, 8:48am

The same gradients would be accumulated, but the combined call should be faster.