Now, in phase 2 netB is updated as a function of netA predictions. In phase 3 I want to optimize netA to give better updates in phase 2, so netB will minimize l3. In practice I see that l3.backward() does not calculate any gradients w.r.t netA. What should I change?

Phase3 doesn’t use any parameters of netA to calculate the loss l3, so no gradients will be computed for parameters in netA.
If you want to optimizer netA in phase3, you would have to pass y1_a to netB so that the computation graph includes both models.

PS: you can post code snippets by wrapping them into three backticks ```, which makes debugging easier.

This is the key problem.
In phase 2, netB is trained as a function of netA, so I want netA to be a part of the computation graph created by netB. So when netB giving predictions, netA will get gradients.

In other words,

Calculate the gradients of netB w.r.t netA predictions and update netB, but it does not save this update as a function of netA (which it is what I want to do…)