Different optimizers for shared parameters

Hi,

I’ve seen some similar questions but not exactly the same, and I haven’t found an answer.
I have an encoder decoder model with some reconstruction loss, and the outputs of the encoder also go to another model and another loss –
encoded = encoder(x)
decoded = decoder(encoded)
loss_recon = ||decoded - x||^2
y = model2(encoded)
loss_model = ||y-target||^2

Now, I want to update the encoder decoder with some optimizer (say optimizer 1 that has the encoder and decoder parameters) with respect to the reconstruction loss, and in addition, I want to update the encoder with the loss from the model, via a second optimizer (say optimizer 2 that has only the encoder parameters).
Now, optimizer1 and 2 share the parameters of the encoder. How can I do the optimization step with a single pass through the graph?
For instance, using
optimizer1.zero_grad()
loss_recon.backward(retain_graph=True)
optimizer1.step()

optimizer2.zero_grad()
loss_model.backward()
optimizer2.step()

doesn’t work, because optimizer1.step() changed the parameters of the encoder.

using
optimizer1.zero_grad()
optimizer2.zero_grad()

loss_recon.backward(retain_graph=True)
loss_model.backward()

optimizer1.step()
optimizer2.step()

doesn’t work since only the grads from the second backward will remain.

Any ideas?

Edit:
To simplify the question – assume I have a single model with two losses, and I want to use a different optimizer for each loss. Is there anyway I can to this with a single pass through the model?

optimzer1 = Adam(model.parameters(), lr=lr1)
optimizer2 = Adam(model.parameters(), lr=lr2)
out = model(x)
loss1 = loss1(x, labels1) # update with respect to this loss with optimizer 1
loss2 = loss2(x, labels2) # update with respect to this loss with optimizer 2

Your use case is a bit tricky and you’ve already posted some complications.
Let’s check your summary:

optimzer1 = Adam(model.parameters(), lr=lr1)
optimizer2 = Adam(model.parameters(), lr=lr2)
out = model(x)
loss1 = loss1(x, labels1) # update with respect to this loss with optimizer 1
loss2 = loss2(x, labels2) # update with respect to this loss with optimizer 2

Your use case uses two different losses: loss1 and loss2. These losses would calculate different gradients (grad) in the same model parameters (param) (assuming the forward activations (act) are still alive and valid).
The optimizers should however use only the “corresponding” gradients for their update.

A potential approach could be to:

  • calculate grad1 from loss1 using act1 and param1
  • store grad1 separately and zero out the param.grad attributes
  • calculate grad2 from loss2 using act1 and param1
  • update the model using opt2 to create parameter set param2
  • zero out the gradients, restore the .grad attributes from the previously stored grad1 tensors
  • update !!! param2 !!! using opt1 and grad1

Note that mathematically the last step is again wrong, since grad1 was never calculated using these parameters.
I don’t know what your exact use case is, but it seems that you want to update the model twice using stale gradients and losses, so could you explain the idea a bit more?

Hi,

Thanks for the quick reply! Yes, I guess what you suggested will work while a bit hacky.

Maybe I’ll provide a different view for my use case, and there will be a better solution:
The problematic point is that I have two losses to train an encoder, and one of the losses is used to train another model that uses the encoder’s output, but has a different scale to it –
I have an encoder decoder model that uses some loss (say loss1). The encoded vectors are then used
as an input to another model and there’s another loss at the output of that model (say loss2).
Now, I want to update the encoder parameters using loss1+0.001*loss2 and to update the parameters of the second model (the one that uses the encoder output, using loss2). The problem here is that there’s a different scale for loss2 in the update of the encoder (*0.001), and the second model (*1).
So, I can’t simply use 3 optimizers (encoder, decoder, model) and loss=loss1+0.001*loss2, since it will incorrectly update the model. Any good solution in this case? I can compensate for this by increasing the learning rate of model 2, but that is not equivalent, and seems to give bad results.
Any ideas here?

And another question, how would you store and load the gradients from the model efficiently?

I could think of the following solution. See if it works for your use case.

  1. First, do loss1.backward(retain_graph=True). This will accumulate loss1’s gradients in encoder and decoder, and keep the computation graph intact.
  2. Next, configure a backward hook on the encoded vectors to multiply 0.001 to its gradients.
  3. Now, do loss2.backward(). Conceptually, this will backpropagate loss2’s gradients to second model and backpropagate 0.001 * loss2 gradients to the encoder.

Hope it helps!

1 Like

Yes, this was one of the first things I tried, but the second backward (loss2.backward()) replaces the gradients from the first backward (and not accumulates them), so it ends up updating only with the gradients from loss2.

I am not sure why it didn’t work for you.
Pytorch will not replace the gradients, but will aggregate the gradients. Perhaps you may have to check your code if you called a zero_grad() somewhere?

The following code shows the solution that I meant:

1 Like