Use Multiple Optimizers in one Model

blueeagle · July 15, 2021, 7:15am

Hi everyone,

I have a Model consisting of 3 parts. Part 1 is an encoder, part 2 and 3 are classifiers, that both get the output of the encoder (part 1) as input. I have two optimizers, the first has the parameters of part 1 and 2 (named optimizer12), the second only has the parameters of part 3 (optimizer3). I am calculating a Loss for the output of part 2 (named loss2) and one for the output of part 3 (named loss3). Now I want to update part 3 and part 1&2 alternatingly. I used the code below in my training method:

loss2 = criterion2(output2, target2)
loss3 = criterion3(output3, target3)
loss3.backward(retain_graph=True)
optimizer3.step() 
loss12 = loss2 + some_value*loss3
loss12.backward()
optimizer12.step()

However, this gives me the following error:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [2, 2000]] is at version 2; expected version 1 instead.

I think the Variable mentioned in the Error Message is the output of part 3 (at least it hast the same dimensions). Regarding this answer, I assume that the error is caused by the multiplication of loss3 with some_value. However, I cannot get rid of this multiplication so I need to find a way around this.
Can you please help me?

Thank you!

ptrblck · July 15, 2021, 7:19am

I think you might be running into this error, since optimizer3.step() would update the parameters, which could have been used to calculate loss2 and loss12.backward() would then try to compute the gradients using stale intermediate forward activations (since the corresponding parameters were already updated as described in the linked post).

blueeagle · July 15, 2021, 7:37am

Thank you for your fast answer! The optimizer3 only updates the parameters of part3, therefore the new parameters should not affect loss2 (only depends from parameters of part 1 and 2).
If I got it right, in the post you mentioned the solution would be to set retain_graph=False? This won’t work for me, since it results in the following error

RuntimeError: Trying to backward through the graph a second time (or directly access saved variables after they have already been freed).

In this case, I think autograd is trying to backpropagate through part 3 a second time (which would be as I want it to). Even if I calculate the loss3 again after the gradient step of optimizer3 (to get the loss dependent from the new parameters) I am getting this error when calling loss12.backward().

Am I misunderstanding an important concept here or where is my problem?

Thank you!

ptrblck · July 15, 2021, 7:40am

Thanks for the update and yes, you are right: based on the description loss3 should be causing the issue not loss2.

That is unexpected. Could you post an executable code snippet to reproduce this issue?

blueeagle · July 15, 2021, 9:00am

import torch
import torch.nn as nn
import torch.optim as optim

class myModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(100,200)
        self.fc2 = nn.Linear(200, 2, bias=False)
        self.fc3 = nn.Linear(200, 3, bias=False)

    def forward(self,x):
        h = self.fc1(x)
        out2 = self.fc2(h)
        out3 = self.fc3(h)
        return out2, out3

    

if __name__ == '__main__':
    torch.autograd.set_detect_anomaly(True)
    some_value = 0.5
    model = myModel()
    optimizer12 = optim.Adam(list(model.fc1.parameters()) + list(model.fc2.parameters()), lr=1e-3, weight_decay=1e-4)
    optimizer3 = optim.Adam(model.fc3.parameters(), lr=1e-3, weight_decay=1e-4)
    criterion2 = nn.CrossEntropyLoss()
    criterion3 = nn.CrossEntropyLoss()
    input = torch.rand((1,100))
    label2 = torch.randint(0,2,(1,))
    label3 = torch.randint(0,3,(1,))
    

    output2, output3 = model(input)
    
    loss2 = criterion2(output2, label2)
    loss3 = criterion3(output3, label3)

    #loss3.backward() # produces  RuntimeError: Trying to backward through the graph a second time
    loss3.backward(retain_graph=True) # produces: RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation
    optimizer3.step() 
    
    loss3 = criterion3(output3, label3)
    loss12 = loss2 + some_value*loss3
    loss12.backward()
    optimizer12.step()

ptrblck · July 15, 2021, 6:26pm

Thanks for the code snippet.
The same root cause can be seen in your code, as output2 and output3 share self.fc1, which is also the reason why you would need to use loss3.backward(retain_graph=True).
optimizer3.step() will update the parameters of fc3, which would then compute a wrong gradient during the the backpropagation to fc1.

blueeagle · July 16, 2021, 6:24am

Thank you very much!
So if I got that right the Solution would be as follows:

    output2, output3 = model(input)
    loss3 = criterion3(output3, label3)
    loss3.backward()
    optimizer3.step() 
    
    optimizer12.zero_grad()
    output2, output3 = model(input)
    loss2 = criterion2(output2, label2)
    loss3 = criterion3(output3, label3)
    loss12 = loss2 + some_value*loss3
    loss12.backward()
    optimizer12.step()

Which means one has to forward the input through the network a second time, in order to get the “new” activations for the shared layer.
I am not sure about the optimizer12.zero_grad(). I think before the second forward pass I will have to zero the gradients, at least for fc2, as if I would not do it the the gradients would get summed up for both forward-passes. Is that right?
Again, thanks a lot for your help!

ptrblck · July 16, 2021, 6:28am

Yes, executing another forward pass should work. Another approach would be to compute the gradients for both losses and use optimizerX.step() afterwards, but it depends on your actual use case, if that’s possible.

Zeroing out the gradients of optimizer12 looks valid, but note that the forward pass will not create any gradients, which are computed in the backward call.