Optimizing based on another model's output

TrevP_1 · August 31, 2017, 2:58pm

Hi sorry I’m new to Pytorch: if I have Model1 producing some output, which is fed into Model2 (which is pre-trained), is there a simple way to optimize Model1’s weights based on Model2’s outputs? I could of course just backprop through Model2 into Model1, but I want to assume Model2 is not accessible to Model1 other than it’s outputs.

If I do:

for p in Model2.parameters():
    p.requires_grad = False

Does that have the desired effect? - Model2’s weights and gradient are not accessible to Model1 during backprop

albanD · August 31, 2017, 3:06pm

Hi,

If I understand properly, you have this:

input = # ...
out1 = Model1(input)
out2 = Model2(out1)
loss = LossFunc(out2)

If you want to optimize the parameters of Model1, you can just use loss.backward() and create your optimizer to only update the first model with optimizer = torch.optim.SGD(Model1.parameters(), other_args). That way, Model1 will be updated but not Model2.

Note that whatever you’re gonna do, to be able to get gradients for Model1, you will have to backpropagate through Model2 (this is how backprop works).

TrevP_1 · August 31, 2017, 3:12pm

Thanks very much! Are there any gradient estimation techniques I could use then, if assume I don’t have access to Model2’s internals?

albanD · August 31, 2017, 3:15pm

If Model2 is completely unknown, then you have to start using “Black Box Optimization”. Which is a branch of optimization that studies the optimization of functions for which you have absolutely no information (you can just evaluate them). Unfortunately the quality of these approaches is worst (as expected) compared to problems where you have informations about the internal of your function.

TrevP_1 · August 31, 2017, 3:38pm

Fantastic thank you, last question - is there an easy method to pass this gradient approximation to Model1? i.e. combine the approximated gradient of Model2 and then continue with real backprop for Model1?

albanD · August 31, 2017, 3:43pm

These methods are not really implemented in pytorch so there is no builtin way.
If you have a blackbox optimization method that give you d(loss) / d(out1): how you loss varies wrt the input of Model2, then you can simply do:

grad = BBOpt(out1, out2, loss, ...)
optimizer.zero_grad()
out1.backward(grad)
optimizer.step()

Basically, specify directly what are the gradients that correspond to out1.

zakaria_laskar · September 1, 2017, 8:36pm

Hi, just an extended question to you reply. If one wants to update both the models, one needs to create 2 optimizers, one for each model?. And thereby use opt1.step() and opt2.step() after loss.backward()?

albanD · September 4, 2017, 9:28am

You can do that or create an optimizer that works with both models:
full_opt = torch.optim.SGD(list(Model1.parameters())+list(Model2.parameters()), other_args) and then you can just call full_opt.step() to update both models.

jeffreycordero · November 16, 2018, 2:25am

Hello! So mostly related, I am attempting to setup mutual learning with pytorch and am unclear about how. I have two models, whose loss values are related (affect the loss value of the other model, but not the same) and each model needs to be back-propagated with its corresponding loss value.

Is there a way to backprop each loss value to its corresponding model?

Deeply · May 21, 2019, 2:20pm

You really saved my day!
Just one question, would the + sign between the two models’ parameters work, or I’ll have to use itertools.chain, as follows?

full_opt = torch.optim.Adam( itertools.chain( model1.parameters(), model2.parameters(), model3.parameters() ), other_args )

Thanks

albanD · May 29, 2019, 3:16pm

You can chain the iterators to be clean or be lazy and just create lists that you can then add up
Both will work !

YoonYeong · January 9, 2020, 12:33pm

Hi, I have a related question regarding this problem. If I want to update the parameters of model1 while fixing model2, then should I have to set requires_grad=False for params in model2.parameters()? Actually, I guess that setting optimizer as torch.optim.SGD(model1.parameters()) will not change model2.parameters. I’m so confused

albanD · January 9, 2020, 2:48pm

Both will work. But there are subtle differences.
The short answer is do both: set requires_grad to False and do not give these paramaters to the optimizer.

If you only set the requires_grad field but give both to the optimizer, the weight might still be updated even though the gradient is 0 due to l2 regularization or momentum.
If you only not give it to the optimizer, then you might be doing extra computation to compute the gradients of some Tensors for which you don’t need them.

Qu_Yukun · June 2, 2020, 7:48am

Hi, I have a related question regarding this problem. Could you help me?
I have two models, and I hope I can use the sum output of the two models to update the two models.

The code is as follows:

spat_out = spat_model(spat_data)
temp_out = temp_model(temp_data)

spat_loss = spat_criterion(spat_out, labels)
temp_loss = temp_criterion(temp_out, labels)

loss = spat_loss + temp_loss
loss.backward()

spat_optimizer.step()
temp_optimizer.step()

loss.backward() will update wich models’ gradient? The last one or both of them?

albanD · June 2, 2020, 2:14pm

If you don’t use .detach() or torch.no_grad(), it will backprop on everything you used. So here both.

arinaruck · October 15, 2020, 8:30am

Hi!
I am trying to train two models in stages:

model1 = Model1()
model2 = Model2()
optimizer1 = Adam(model1.parameters())
optimizer2 = Adam(model2.parameters())
for epoch in range(epochs):
     for stage in range(stages):
         model2.eval()
         model1.train()
         for step in range(steps1):
              ....
              pred = model1()
              ....
              loss1.backward() # depends on model1 only
              optimizer1.step()
              optimizer1.zero_grad()
        model2.train()
        for step in range(steps2):
              ....
              with torch.no_grad():
                  encoded = model1.encoder(inp)
              projection = model1.proj(encoded)
              pred = model2(projection)
              ....
              loss2.backward()
              optimizer1.step()
              optimizer2.step()
              optimizer1.zero_grad()
              optimizer2.zero_grad()

And I am getting an CUDA error: an illegal memory access was encountered (embedding_dense_backward_cuda at /pytorch/aten/src/ATen/native/cuda/Embedding.cu:267)
(model1 is a BERT)
Is this an incorrect way to do this? How should I approach updating the model’s weights in this setting?

ptrblck · October 15, 2020, 10:50am

If you are using PyTorch 1.5, update to 1.6 or the nightly binaries, as assert statements were not properly working in 1.5.
Based on the illegal memory access in the embedding layer, I guess you are passing an input containing invalid indices to this layer. Instead of this memory violation you should get a proper indexing error.

arinaruck · October 15, 2020, 5:18pm

Thanks a lot! That actually helped me trace the error
Indeed, one of asserts in the loss function started falling as soon as I updated the torch version

Hongjun_Choi · April 12, 2021, 11:54pm

As an additional question for the above one, let’s suppose that we have this.

spat_out = spat_model(spat_data)
temp_out = temp_model(spat_out.detach())

spat_loss = spat_criterion(spat_out, labels)
temp_loss = temp_criterion(temp_out, labels)

loss = spat_loss + temp_loss
loss.backward()

spat_optimizer.step()
temp_optimizer.step()

Basically, the output of the first model becomes the input for the second model. If I create input for the second model by using ‘spat_out.detach()’ to treat it being an independent input, in this case this will backprop on the first model and second model independently? i.e the second model’s gradient is not flowing into the first model.?