Multi-model training and using loss of 1 to guide the other

Hello,

I have 2 models, with two separate loss functions (loss1 and loss2) and optimizers. The output of one model (a linear layer) is used by the other as an input. Based on some previous questions asked here, I was able to get this model training by repackaging the output of the first model as a Variable :

op1 = model1(data1)
op2 = Variable(op1.data1)
op3 = model2(data2,op2)

loss1 = criterion1(op1,target1)
loss1.backward()
optimizer1.step()

loss2 = criterion2(op2,target2)
loss2.backward()
optimizer2.step()

Q. This seems to work. But how do I know if the each optimizer is taking the correct set of gradients? I assume that it is, as each optimizer is associated with a specific set of parameters (none shared), so the number of gradients needs to be consistent.

Q. I was now trying to use the loss2 to guide the optimization of model1, Does this make sense??

loss1 = criterion1(op1,target1)
loss1.backward()
loss2 = criterion2(op2,target2)
loss2 = loss2 + lambda*loss1
loss2.backward()

This gives me an error, and tells me to use retain_variables=True in my first backward() call. When I do this the model does start training.

Q. What is retain_variables doing? Is this the correct way to do what I want (i.e. guide the parameter updates of one model using the loss of the other). Or does this just not make sense?

Thanks,
Gautam

in case 2, you dont need to do loss1.backward(). Because loss2 = loss2 + lambda * loss1, when you call loss2.backward(), loss1’s backward will also be called.

You can use retain_variables=True, which will keep all the intermediate buffers around to do loss1.backward twice, but it’ll lead in double gradients in your case.

Hi @smth, thanks for the reply.

I have been seeing some strange behavior (mostly good).
I am calling optimizer1.step() after I do loss1.backward() and optimizer2.step() after loss2.bakward(), so the double gradient shouldn’t affect anything since I already did the associated optimizer step. Is that correct?

Since loss1 and loss2 are associated with different models (no shared params), does the lambda*loss1 make any difference to loss2? I was starting to think that it doesn’t, but the two models (case 1 and case 2) give quite different results.

Thanks,
Gautam