Suppose we have two networks with parameters `f`

and `g`

respectively.

And they all have the same input x such that `y1, y2=f(x)`

[multi-task network with two outputs] and `z=g(x)`

.

What I want to do is this: Assume `y1=f1(x)`

, `y2=f2(x)`

, to update `f`

we have two losses in which one is based on the prediction from `g(x)`

, after one gradient update, we want to update `g`

which based on the performance of `f1(x)`

:

f’ = f + \alpha * Grad(L(f1(x), y1_gt) + L(f2(x), g(x)))

g‘ = g + \beta * Grad(L(f1’(x), y1_gt))). # use the updated f’

\alpha, \beta are learning rates.

The current issue is: I don’t know how to implement the `grad(L(f1’(x), y1_gt)))`

.

My current code is (proved to be wrong):

```
# define my model1 with two outputs
class model1(...):
...
def forward(x):
...
return y1, y2
# define my model2 with one output
class model2(...):
...
def forward(x):
...
return z
def model_fit(pred, gt):
return [cross_entropy_loss(pred, gt)]
# define optmizers:
optimizer1 = optim.SGD(model1.parameters(), lr=1e-5)
optimizer2 = optim.SGD(model2.parameters(), lr=1e-5)
# run one iteration as a simple example:
y1, y2 = model1(train_data) # y1,y2=f(x)
z = model2(train_data) # z=g(x)
optimizer1.zero_grad()
optimizer2.zero_grad()
train_loss1 = model_fit(y1, y1_gt) # compute loss1 L(f1(x), y1_gt)
train_loss2 = model_fit(y2, z) # compute loss2 L(f2(x), g(x))
(train_loss1+train_loss2).backward(create_graph=True) # Grad(L(f1(x), y1_gt) + L(f2(x), g(x))
optimizer1.step() # f’ = f + \alpha * Grad(L(f1(x), y1_gt) + L(f2(x), g(x)))
# 2nd forward pass based on the updated network 1.
y1,y2 = model1(train_data) # y1,y2 = f'(x)
train_loss1 = model1.model_fit(y1, y1_gt) # L(f1’(x), y1_gt))
train_loss1.backward() # Grad(L(f1’(x), y1_gt)))
optimizer2.step() # g‘ = g + \beta * Grad(L(f1’(x), y1_gt)))
```

However, the performance didn’t go as expected. It only calculates the gradient on train_loss2 in the first step() and retains the same gradient on the second step().

Update: I think I have understood what my code is doing: in the first backward function: it calculates the gradient w.r.t. both model1 and model2, the optimizer1.step() updates the gradients in model1 **(Grad(L(f1(x), y1_gt) + L(f2(x), g(x))) w.r.t. f)**. However, in the second forward pass, the model2 recreates the computational graph and it cannot retain the previous graph. Thus, the gradient of train_loss1 w.r.t. model2 would be simply zeros. Since I put a create_graph=True in the first gradient update, the second backward pass will cumulate the gradients, thus the same as train_loss2 gradient w.r.t. model2 in the first gradient update **(Grad(L(f2(x), g(x)) w.r.t. g, (accumulates grad(L(f1(x), y1_gt) w.r.t. g which is zero) )**.

Now the problem becomes: how to retain the graph in the second forward pass so that we can compute the higher order gradients.

Any comments are welcomed. Thanks!