Hi,
I’m trying to train 2 networks in the following scheme:

netA, netB = Net(), Net()
netA_opt = torch.optim.SGD(netA.parameters(), lr=0.1)
netB_opt = torch.optim.SGD(netB.parameters(), lr=0.1)

#Phase 1:
y1_A = netA(X1)
l1 = cretirion1(y,y1_A)
l1.backward()

#Phase 2:
y2_A = netA(X2)
y2_B = netB(X2)
l2 = cretirion2(y2_A,y2_B)
l2.backward(retain_graph=True)
netB_opt.step()

#Phase 3:
y1_B = netB(X1)
l3 = cretirion3(y,y1_B)
l3.backward()
netA_opt.step()

Now, in phase 2 netB is updated as a function of netA predictions. In phase 3 I want to optimize netA to give better updates in phase 2, so netB will minimize l3. In practice I see that l3.backward() does not calculate any gradients w.r.t netA. What should I change?

Phase3 doesn’t use any parameters of `netA` to calculate the loss `l3`, so no gradients will be computed for parameters in `netA`.
If you want to optimizer `netA` in phase3, you would have to pass `y1_a` to `netB` so that the computation graph includes both models.

PS: you can post code snippets by wrapping them into three backticks ```, which makes debugging easier. Thanks for the answer and the tip This is the key problem.
In phase 2, `netB` is trained as a function of `netA`, so I want `netA` to be a part of the computation graph created by `netB`. So when `netB` giving predictions, `netA` will get gradients.

In other words,

Calculate the gradients of `netB` w.r.t `netA` predictions and update `netB`, but it does not save this update as a function of `netA` (which it is what I want to do…)