Suppose we have two networks with parameters
And they all have the same input x such that
y1, y2=f(x) [multi-task network with two outputs] and
What I want to do is this: Assume
y2=f2(x), to update
f we have two losses in which one is based on the prediction from
g(x), after one gradient update, we want to update
g which based on the performance of
f’ = f + \alpha * Grad(L(f1(x), y1_gt) + L(f2(x), g(x)))
g‘ = g + \beta * Grad(L(f1’(x), y1_gt))). # use the updated f’
\alpha, \beta are learning rates.
The current issue is: I don’t know how to implement the
My current code is (proved to be wrong):
# define my model1 with two outputs class model1(...): ... def forward(x): ... return y1, y2 # define my model2 with one output class model2(...): ... def forward(x): ... return z def model_fit(pred, gt): return [cross_entropy_loss(pred, gt)] # define optmizers: optimizer1 = optim.SGD(model1.parameters(), lr=1e-5) optimizer2 = optim.SGD(model2.parameters(), lr=1e-5) # run one iteration as a simple example: y1, y2 = model1(train_data) # y1,y2=f(x) z = model2(train_data) # z=g(x) optimizer1.zero_grad() optimizer2.zero_grad() train_loss1 = model_fit(y1, y1_gt) # compute loss1 L(f1(x), y1_gt) train_loss2 = model_fit(y2, z) # compute loss2 L(f2(x), g(x)) (train_loss1+train_loss2).backward(create_graph=True) # Grad(L(f1(x), y1_gt) + L(f2(x), g(x)) optimizer1.step() # f’ = f + \alpha * Grad(L(f1(x), y1_gt) + L(f2(x), g(x))) # 2nd forward pass based on the updated network 1. y1,y2 = model1(train_data) # y1,y2 = f'(x) train_loss1 = model1.model_fit(y1, y1_gt) # L(f1’(x), y1_gt)) train_loss1.backward() # Grad(L(f1’(x), y1_gt))) optimizer2.step() # g‘ = g + \beta * Grad(L(f1’(x), y1_gt)))
However, the performance didn’t go as expected. It only calculates the gradient on train_loss2 in the first step() and retains the same gradient on the second step().
Update: I think I have understood what my code is doing: in the first backward function: it calculates the gradient w.r.t. both model1 and model2, the optimizer1.step() updates the gradients in model1 (Grad(L(f1(x), y1_gt) + L(f2(x), g(x))) w.r.t. f). However, in the second forward pass, the model2 recreates the computational graph and it cannot retain the previous graph. Thus, the gradient of train_loss1 w.r.t. model2 would be simply zeros. Since I put a create_graph=True in the first gradient update, the second backward pass will cumulate the gradients, thus the same as train_loss2 gradient w.r.t. model2 in the first gradient update (Grad(L(f2(x), g(x)) w.r.t. g, (accumulates grad(L(f1(x), y1_gt) w.r.t. g which is zero) ).
Now the problem becomes: how to retain the graph in the second forward pass so that we can compute the higher order gradients.
Any comments are welcomed. Thanks!