Optimizing based on another model's output

Hi sorry I’m new to Pytorch: if I have Model1 producing some output, which is fed into Model2 (which is pre-trained), is there a simple way to optimize Model1’s weights based on Model2’s outputs? I could of course just backprop through Model2 into Model1, but I want to assume Model2 is not accessible to Model1 other than it’s outputs.

If I do:

for p in Model2.parameters():
    p.requires_grad = False

Does that have the desired effect? - Model2’s weights and gradient are not accessible to Model1 during backprop



If I understand properly, you have this:

input = # ...
out1 = Model1(input)
out2 = Model2(out1)
loss = LossFunc(out2)

If you want to optimize the parameters of Model1, you can just use loss.backward() and create your optimizer to only update the first model with optimizer = torch.optim.SGD(Model1.parameters(), other_args). That way, Model1 will be updated but not Model2.

Note that whatever you’re gonna do, to be able to get gradients for Model1, you will have to backpropagate through Model2 (this is how backprop works).


Thanks very much! Are there any gradient estimation techniques I could use then, if assume I don’t have access to Model2’s internals?

If Model2 is completely unknown, then you have to start using “Black Box Optimization”. Which is a branch of optimization that studies the optimization of functions for which you have absolutely no information (you can just evaluate them). Unfortunately the quality of these approaches is worst (as expected) compared to problems where you have informations about the internal of your function.

1 Like

Fantastic thank you, last question - is there an easy method to pass this gradient approximation to Model1? i.e. combine the approximated gradient of Model2 and then continue with real backprop for Model1?

These methods are not really implemented in pytorch so there is no builtin way.
If you have a blackbox optimization method that give you d(loss) / d(out1): how you loss varies wrt the input of Model2, then you can simply do:

grad = BBOpt(out1, out2, loss, ...)

Basically, specify directly what are the gradients that correspond to out1.


Hi, just an extended question to you reply. If one wants to update both the models, one needs to create 2 optimizers, one for each model?. And thereby use opt1.step() and opt2.step() after loss.backward()?

You can do that or create an optimizer that works with both models:
full_opt = torch.optim.SGD(list(Model1.parameters())+list(Model2.parameters()), other_args) and then you can just call full_opt.step() to update both models.


Hello! So mostly related, I am attempting to setup mutual learning with pytorch and am unclear about how. I have two models, whose loss values are related (affect the loss value of the other model, but not the same) and each model needs to be back-propagated with its corresponding loss value.

Is there a way to backprop each loss value to its corresponding model?

You really saved my day!
Just one question, would the + sign between the two models’ parameters work, or I’ll have to use itertools.chain, as follows?

full_opt = torch.optim.Adam( itertools.chain( model1.parameters(), model2.parameters(), model3.parameters() ), other_args )


You can chain the iterators to be clean or be lazy and just create lists that you can then add up :smiley:
Both will work !

1 Like

Hi, I have a related question regarding this problem. If I want to update the parameters of model1 while fixing model2, then should I have to set requires_grad=False for params in model2.parameters()? Actually, I guess that setting optimizer as torch.optim.SGD(model1.parameters()) will not change model2.parameters. I’m so confused :frowning:

Both will work. But there are subtle differences.
The short answer is do both: set requires_grad to False and do not give these paramaters to the optimizer.

If you only set the requires_grad field but give both to the optimizer, the weight might still be updated even though the gradient is 0 due to l2 regularization or momentum.
If you only not give it to the optimizer, then you might be doing extra computation to compute the gradients of some Tensors for which you don’t need them.

1 Like

Hi, I have a related question regarding this problem. Could you help me?
I have two models, and I hope I can use the sum output of the two models to update the two models.

The code is as follows:

spat_out = spat_model(spat_data)
temp_out = temp_model(temp_data)

spat_loss = spat_criterion(spat_out, labels)
temp_loss = temp_criterion(temp_out, labels)

loss = spat_loss + temp_loss


loss.backward() will update wich models’ gradient? The last one or both of them?

If you don’t use .detach() or torch.no_grad(), it will backprop on everything you used. So here both.

1 Like

I am trying to train two models in stages:

model1 = Model1()
model2 = Model2()
optimizer1 = Adam(model1.parameters())
optimizer2 = Adam(model2.parameters())
for epoch in range(epochs):
     for stage in range(stages):
         for step in range(steps1):
              pred = model1()
              loss1.backward() # depends on model1 only
        for step in range(steps2):
              with torch.no_grad():
                  encoded = model1.encoder(inp)
              projection = model1.proj(encoded)
              pred = model2(projection)

And I am getting an CUDA error: an illegal memory access was encountered (embedding_dense_backward_cuda at /pytorch/aten/src/ATen/native/cuda/Embedding.cu:267)
(model1 is a BERT)
Is this an incorrect way to do this? How should I approach updating the model’s weights in this setting?

If you are using PyTorch 1.5, update to 1.6 or the nightly binaries, as assert statements were not properly working in 1.5.
Based on the illegal memory access in the embedding layer, I guess you are passing an input containing invalid indices to this layer. Instead of this memory violation you should get a proper indexing error.

1 Like

Thanks a lot! That actually helped me trace the error
Indeed, one of asserts in the loss function started falling as soon as I updated the torch version

As an additional question for the above one, let’s suppose that we have this.

spat_out = spat_model(spat_data)
temp_out = temp_model(spat_out.detach())

spat_loss = spat_criterion(spat_out, labels)
temp_loss = temp_criterion(temp_out, labels)

loss = spat_loss + temp_loss


Basically, the output of the first model becomes the input for the second model. If I create input for the second model by using ‘spat_out.detach()’ to treat it being an independent input, in this case this will backprop on the first model and second model independently? i.e the second model’s gradient is not flowing into the first model.?