Suppose I have three optimizers and I want to optimize a graph three times within one batch. Which one is the right order to call those functions?
I don’t know which order is correct, but I’m curious why you would like to use 3 different optimizers sequentially.
Note, that after a weight update your model moved already on the loss surface and the current loss has to be calculated for your current weights. Applying the “old” gradients seems to be wrong in my opinion, so what is your approach?
First option: each optimizer will see sum of gradients from three losses. In fact, you can do
(loss1 + loss2 + loss3).backward(), which is more efficient.
Second option: each optimizer will see gradients only from the specific loss.
The three losses and optimizers have their own purpose. For example, one is for the normal backpropagation to update the parameters of the whole model. Another is used to update some certain layers with specific gradient.
One optimizer is used to update the whole graph and the others are only used to update part of it. So i guess the second one is the right choice for me?
It still depends on what gradient you want your optimizers to see
Were you able to do it. I am trying something similar but getting issue
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor ] is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!