How to perform optimization with Multiple Optimizer?

SKYHOWIE25 · October 19, 2017, 3:31am

Hi

Suppose I have three optimizers and I want to optimize a graph three times within one batch. Which one is the right order to call those functions?

optimizer1.zero_grad()
optimizer2.zero_grad()
optimizer3.zero_grad()

loss1.backward(retain_graph=True)
loss2.backward(retain_graph=True)
loss3.backward()

optimizer1.step()
optimizer2.step()
optimizer3.step()

optimizer1.zero_grad()
loss1.backward(retain_graph=True)
optimizer1.step()

optimizer2.zero_grad()
loss2.backward(retain_graph=True)
optimizer2.step()

optimizer3.zero_grad()
loss3.backward()
optimizer3.step()

Thanks

ptrblck · October 19, 2017, 12:47pm

I don’t know which order is correct, but I’m curious why you would like to use 3 different optimizers sequentially.
Note, that after a weight update your model moved already on the loss surface and the current loss has to be calculated for your current weights. Applying the “old” gradients seems to be wrong in my opinion, so what is your approach?

SimonW · October 19, 2017, 3:52pm

First option: each optimizer will see sum of gradients from three losses. In fact, you can do (loss1 + loss2 + loss3).backward(), which is more efficient.
Second option: each optimizer will see gradients only from the specific loss.

SKYHOWIE25 · October 19, 2017, 11:29pm

Hi

The three losses and optimizers have their own purpose. For example, one is for the normal backpropagation to update the parameters of the whole model. Another is used to update some certain layers with specific gradient.

SKYHOWIE25 · October 19, 2017, 11:32pm

Hi Simon

One optimizer is used to update the whole graph and the others are only used to update part of it. So i guess the second one is the right choice for me?

SimonW · October 19, 2017, 11:34pm

It still depends on what gradient you want your optimizers to see

Satish1901 · July 23, 2020, 12:57am

Hello

Were you able to do it. I am trying something similar but getting issue

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16]] is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!