I have 2 neural networks (N1 and N2) and 2 loss functions (L1 and L2). The Loss functions of the neural networks are dependent on each other. So the backpropagation of a loss function has to be done with respect to the weights of both the neural networks (say w1 and w2).

Note:- I cannot just backpropagate the two-loss functions separately and record the gradients to form the Jacobian matrix.

D = Jacobian Matrix =

partial derivate of L1 wrt weights of N1, partial derivate of L1 wrt weights of N2
partial derivate of L2 wrt weights of N1, partial derivate of L2 wrt weights of N2

Yes, you can use loss.backward() to calculate gradients with respect
to the weights of multiple networks.

Pytorchâ€™s autograd facility calculates gradients with respect to tensors
that have requires_grad = True. It doesnâ€™t matter whether such
tensors are wrapped in Parameters or belong to a Module or do not
all belong to the same network.

So far so good.

I donâ€™t understand specifically what you mean by this.

This can certainly be done. Consider a made-up example:

pred1 = N1 (input1)
pred2 = N2 (torch.cat (input2, pred1)) # the output of N1 is fed into N2
loss1 = L1 (pred1, targ1) # depends only on weights of N1
loss2 = L2 (pred2, targ2) # depends on the weights of both N1 and N2
loss = loss1 + loss2 # also depends on the weights of both N1 and N2
loss.backward() # calculates the gradient of loss with respect to the weights of both N1 and N2

This will work fine. Again, autograd only knows and cares that the
weights of N1 and N2 are leaves of the computation graph it is
processing (with requires_grad = True). It doesnâ€™t matter that
these leaves happen to belong to different networks.

(As an aside, the pytorch api doesnâ€™t really have the concept of a
network, per se. The closest thing would be a Module. What we
call a network would often be packaged as a Module, but it doesnâ€™t
have to be. Also, there are plenty of Modules that we would not
think of as being networks and some commonly used Modules
contain no Parameters or other trainable weights. So autograd
can neither know nor care whether certain weights are in this â€śnetworkâ€ť
or that â€śnetworkâ€ť because, at a technical level, a â€śnetworkâ€ť isnâ€™t really
a thing in pytorch.)

Hi Frank, thanks a lot for the detailed explanation. I am actually interested to apply this concept in the case of a Generative adversarial network (GANs) where we have 2 loss functions and 2 neural networks of the generator and the discriminator respectively. I am training the generator and the discriminator separately with loss.backward() and have also stored the weights of the two neural networks in a large vector. Can I just add both the losses here as well and call loss.backward() to do the job? or do I need to work on the loss functions separately and generate the partial derivatives with respect to the large vector containing the weights of both the neural networks?

Almost, but with an important detail. When you backpropagate from
the discriminator back up into the generator, you need to flip the sign
of the gradient.

(Also, I see so benefit to storing the network weights in a separate
â€ślarge vector.â€ť Just leave them in the networks themselves.)

Iâ€™ve never built a GAN, so I am fuzzy on the details, but the basic
idea is as follows:

You have a generator network (Gen) that produces â€śfakeâ€ť images
that look real and you have a discriminator network (Disc) whose
job it is to tell the fake images apart from teal ones.

So Disc is an ordinary classifier â€“ â€śfakeâ€ť vs. â€śrealâ€ť â€“ and you can
train it with something like BCEWithLogitsLoss.

But the idea of a GAN is to also train Gen to generate fake images
that fool Disc into classifying them as real. The scheme is to train Disc so that the loss from Disc goes down but train Gen so that the
loss from Disc goes up.

You can do this as follows:

Feed a real image into Disc and calculate the classification loss.
Backpropagate it through Disc, updating Disc's weights.

Now feed some random input into Gen. Gen acts sort of like a decoder
and â€śdecodesâ€ť the random input into a fake image. The fake image output
by Gen depends on Gen's weights and has requires_grad = True.
Feed this fake image into Disc, calculate the classification loss and
backpropagate it. This also updates Disc's weights, continuing to train
it to distinguish fake from real.

The key point:

When we further backpropagate Disc's classification loss for the
fake image through Gen â€“ which we can do because the input to Disc came from Gen, depends on Gen's weights, and carries requires_grad = True â€“ we flip the sign of the gradient. This
is because we want to penalize Gen if Disc did well, and reward Gen if Disc did poorly when classifying the fake image. That is,
we train Gen and Disc at cross-purposes with one another.

One convenient way to effect this gradient sign-flip is to interpose
a â€śsign-flipâ€ť Function between Gen and Disc. During the forward
pass, the sign-flip Function simply passes its input through unchanged.
(That is, we pass the fake image generated by Gen unchanged into Disc.) But on the backward pass the sign-flipper takes the gradient
itâ€™s given and flips its sign before sending it on to Gen for further
backpropagation.

Iâ€™m not aware that pytorch offers a pre-packaged sign-flip Function,
but itâ€™s easy enough to write one.

Some additional discussion about flipping the gradientâ€™s sign can be
found in this thread: