Updating two sets of parameters using two optimizers FAILS

I’m trying to train a model under two different loss functions using two optimizers; where the first updates the whole model’s parameters and the other updates only the decoder. The following is a simple code to explain what I mean:

opt1 = Adam(model.parameters())
opt2 = Adam(model.decoder.parameters())

opt1.step()
opt2.step()

The problem is when the model is initialized, I see the following error:

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons:

1) Use of a module parameter outside the `forward` function.
Please make sure model parameters are not shared across multiple concurrent
forward-backward passes. or try to use _set_static_graph() as
a workaround if this module graph does not change during training loop.

2) Reused parameters in multiple reentrant backward passes.
For example, if you use multiple `checkpoint` functions to wrap the same part of your model,
it would result in the same set of parameters been used by different reentrant backward passes
multiple times, and hence marking a variable ready multiple times. DDP does not support such use
cases in default. You can try to use _set_static_graph() as a workaround
if your module graph does not change over iterations.
Parameter at index 95 has been marked as ready twice. This means that multiple autograd engine 
hooks have fired for this particular parameter during this iteration. You can set the environment variable
TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print parameter names for further
debugging.

I know this is because the decoder.parameters() were used twice. I don’t know how to fix this error.

Any help is very much appreciated

Are you trying to have a larger effective learning rate for the decoder of your model? In that case multiple param groups would be more suitable?

opt = Adam([{'params' : model.encoder.parameters(), 'lr' : encoder_lr}, {'params' : model.decoder.parameters(), 'lr' : decoder_lr}])

No, I’m using two optimizers to update two sets of parameters based on two different loss functions.

Why is summing the two losses and using a single optimizer not an option here?

I want to update the (encoder+decoder) parameters based on the first loss… and update the (decoder) parameters based on the second loss. Can I do that using one optimizer?

Hey @Anwarvic will it work if you manually exclude decoder parameters from the first optimizer? Sth like

decoder_param_ids = [id(p) for p in decoder.parameters()]

opt1 = Adam([p for p in model.parameters() if id(p) not in decoder_param_ids])
opt2 = Adam(decoder.parameters())

Thank you for your answer. But in your code, opt1 has only the parameters of the encoder while opt2 has the parameter of the decoder. In my use case, I need opt1 to have all (encoder + decoder) parameters.

I should have added this to my question, but my code works just fine on CPUs. The problem occurs when I use multiple GPUs.

Yes, I have finally made it work thanks to God.

The problem was in using two optimizers. I don’t know why though. However, the reason behind using them is to update two sets of parameters separately. And this can be done using just one optimizer. Here is what I did:

First, I created two loss functions:

criterion = Criterion()
decoder_criterion = AnotherCriterion()

Then, I created one optmizer that monitors the whole model parameters:

opt = Optim(model.parameters())

Then, you can get the two losses separately like so:

model_loss = criterion(model_output, model_target)
decoder_loss = decoder_criterion(decoder_output, decoder_target)

Finally, you can perform the backward propagation over the two-loss functions at the same time like so:

loss = model_loss + decoder_loss
loss.backward()
opt.step()

This is how I fixed my problem. The following is a few more details in case you were interested:

  1. loss.backward(): all it does is to calculate the gradient of all parameters that were used in the loss's forward() method that has require_grad=True. Given a parameter x, this method saves its gradient wrt the loss inside the x.grad variable.
  2. opt.step(): all it does is to update these parameters using the gradient.
  3. When you use (loss1 + loss2).backward(): this performs the backward() over loss1 and loss2 separately and accumulate the gradient. You don’t have to worry.

This is what I learned so far. Please, don’t hesitate to correct me if I’m wrong.