Suppose we have two networks with parameters f and g respectively.

And they all have the same input x such that y1, y2=f(x) [multi-task network with two outputs] and z=g(x).

What I want to do is this: Assume y1=f1(x), y2=f2(x), to update f we have two losses in which one is based on the prediction from g(x), after one gradient update, we want to update g which based on the performance of f1(x):

f’ = f + \alpha * Grad(L(f1(x), y1_gt) + L(f2(x), g(x)))

g‘ = g + \beta * Grad(L(f1’(x), y1_gt))). # use the updated f’

\alpha, \beta are learning rates.

The current issue is: I don’t know how to implement the grad(L(f1’(x), y1_gt))).

My current code is (proved to be wrong):

# define my model1 with two outputs
class model1(...):
...
def forward(x):
...
return y1, y2

# define my model2 with one output
class model2(...):
...
def forward(x):
...
return z

def model_fit(pred, gt):
return [cross_entropy_loss(pred, gt)]

# define optmizers:
optimizer1 = optim.SGD(model1.parameters(), lr=1e-5)
optimizer2 = optim.SGD(model2.parameters(), lr=1e-5)

# run one iteration as a simple example:
y1, y2 = model1(train_data)  # y1,y2=f(x)
z = model2(train_data)   # z=g(x)

train_loss1 = model_fit(y1, y1_gt) # compute loss1 L(f1(x), y1_gt)
train_loss2 = model_fit(y2, z) # compute loss2  L(f2(x), g(x))

(train_loss1+train_loss2).backward(create_graph=True) # Grad(L(f1(x), y1_gt) + L(f2(x), g(x))
optimizer1.step() # f’ = f + \alpha * Grad(L(f1(x), y1_gt) + L(f2(x), g(x)))

# 2nd forward pass based on the updated network 1.
y1,y2 = model1(train_data)  # y1,y2 = f'(x)
train_loss1 = model1.model_fit(y1, y1_gt)  # L(f1’(x), y1_gt))