Using two optimizers to train a model with two outputs

slapeeh · August 2, 2021, 3:18pm

Hi,

I have been browsing the forum and could not find answer to my question.
I have a model that has two outputs. At first, I combined the two losses and used one optimizer to calculate the gradients.

Now, I want to try using two separate optimizers. However, the two branches of the model (that result in two outputs) have overlapping parameters. How can I calculate and update the gradients of the model without updating the same parameters twice?

tom · August 2, 2021, 3:27pm

So what do you want to do for the parameters that appears in both?

If you want to update them twice, you don’t really need to do anything except keeping the zero_grad-backward-step bits of each model disentangled. For the first, you probabably want retain_graph or so.
An alternative could be to update each part alternatingly on every second input batch.

That said, I’d probably try with your first approach. From my experience, what is important in such a case is some equilibration, i.e. you likely want the losses to be of the same order or in some relation to each other.

Best regards

Thomas

slapeeh · August 2, 2021, 3:35pm

I just thought of one possible solution (I don’t know whether it is feasible) where I would do the two backward() calls first which would accumulate the computed gradients on the overlapping parameters right?
And then I could call on of the optimizer’s step() function, which would apply the changes on every parameter on that branch (inlcuding the overlapping parameters) and then call zero_grad() so that the other optimizer’s step would not update the overlapping parameters again because they are 0 thanks to the previous zero_grad() call.

Is this a possibility or not in you opinion?

tom · August 2, 2021, 3:56pm

If you want to do it this way, I would just leave out the duplicate parameters in one of the step functions.

But in general, it makes me wonder what you intend to gain over summing the loss. If you do the backprop and step separately and keep the grads for both this means you get the adaptivity of the optimizer (if there is any, e.g. because you use Adam or so) tailored to either loss.
But even then: In the end, you optimize some loss function and you likely implicitly optimize a combination of the two losses, so I wonder what there is to gain to make this implicit rather than explicit.

All that said, you can run the experiments, if the result works better for you, reality beats my limited intuition.

Best regards

Thomas