# Can I use loss.backward for Backpropagation with respect to the weights of another neural network?

I have 2 neural networks (N1 and N2) and 2 loss functions (L1 and L2). The Loss functions of the neural networks are dependent on each other. So the backpropagation of a loss function has to be done with respect to the weights of both the neural networks (say w1 and w2).

Note:- I cannot just backpropagate the two-loss functions separately and record the gradients to form the Jacobian matrix.

D = Jacobian Matrix =

``````partial derivate of L1 wrt weights of N1, partial derivate of L1 wrt weights of N2
partial derivate of L2 wrt weights of N1, partial derivate of L2 wrt weights of N2``````

Hi Jeet!

Yes, you can use `loss.backward()` to calculate gradients with respect
to the weights of multiple networks.

that have `requires_grad = True`. It doesnâ€™t matter whether such
tensors are wrapped in `Parameter`s or belong to a `Module` or do not
all belong to the same network.

So far so good.

I donâ€™t understand specifically what you mean by this.

This can certainly be done. Consider a made-up example:

``````pred1 = N1 (input1)
pred2 = N2 (torch.cat (input2, pred1))   # the output of N1 is fed into N2
loss1 = L1 (pred1, targ1)   # depends only on weights of N1
loss2 = L2 (pred2, targ2)   # depends on the weights of both N1 and N2
loss = loss1 + loss2        # also depends on the weights of both N1 and N2
loss.backward()             # calculates the gradient of loss with respect to the weights of both N1 and N2
``````

This will work fine. Again, autograd only knows and cares that the
weights of `N1` and `N2` are leaves of the computation graph it is
processing (with `requires_grad = True`). It doesnâ€™t matter that
these leaves happen to belong to different networks.

(As an aside, the pytorch api doesnâ€™t really have the concept of a
network, per se. The closest thing would be a `Module`. What we
call a network would often be packaged as a `Module`, but it doesnâ€™t
have to be. Also, there are plenty of `Module`s that we would not
think of as being networks and some commonly used `Module`s
contain no `Parameter`s or other trainable weights. So autograd
can neither know nor care whether certain weights are in this â€śnetworkâ€ť
or that â€śnetworkâ€ť because, at a technical level, a â€śnetworkâ€ť isnâ€™t really
a thing in pytorch.)

Best.

K. Frank

1 Like

Hi Frank, thanks a lot for the detailed explanation. I am actually interested to apply this concept in the case of a Generative adversarial network (GANs) where we have 2 loss functions and 2 neural networks of the generator and the discriminator respectively. I am training the generator and the discriminator separately with loss.backward() and have also stored the weights of the two neural networks in a large vector. Can I just add both the losses here as well and call loss.backward() to do the job? or do I need to work on the loss functions separately and generate the partial derivatives with respect to the large vector containing the weights of both the neural networks?

Hi Jeet!

Almost, but with an important detail. When you backpropagate from
the discriminator back up into the generator, you need to flip the sign

(Also, I see so benefit to storing the network weights in a separate
â€ślarge vector.â€ť Just leave them in the networks themselves.)

Iâ€™ve never built a GAN, so I am fuzzy on the details, but the basic
idea is as follows:

You have a generator network (`Gen`) that produces â€śfakeâ€ť images
that look real and you have a discriminator network (`Disc`) whose
job it is to tell the fake images apart from teal ones.

So `Disc` is an ordinary classifier â€“ â€śfakeâ€ť vs. â€śrealâ€ť â€“ and you can
train it with something like `BCEWithLogitsLoss`.

But the idea of a GAN is to also train `Gen` to generate fake images
that fool `Disc` into classifying them as real. The scheme is to train
`Disc` so that the loss from `Disc` goes down but train `Gen` so that the
loss from `Disc` goes up.

You can do this as follows:

Feed a real image into `Disc` and calculate the classification loss.
Backpropagate it through `Disc`, updating `Disc`'s weights.

Now feed some random input into `Gen`. `Gen` acts sort of like a decoder
and â€śdecodesâ€ť the random input into a fake image. The fake image output
by `Gen` depends on `Gen`'s weights and has `requires_grad = True`.
Feed this fake image into `Disc`, calculate the classification loss and
backpropagate it. This also updates `Disc`'s weights, continuing to train
it to distinguish fake from real.

The key point:

When we further backpropagate `Disc`'s classification loss for the
fake image through `Gen` â€“ which we can do because the input to
`Disc` came from `Gen`, depends on `Gen`'s weights, and carries
`requires_grad = True` â€“ we flip the sign of the gradient. This
is because we want to penalize `Gen` if `Disc` did well, and reward
`Gen` if `Disc` did poorly when classifying the fake image. That is,
we train `Gen` and `Disc` at cross-purposes with one another.

One convenient way to effect this gradient sign-flip is to interpose
a â€śsign-flipâ€ť `Function` between `Gen` and `Disc`. During the forward
pass, the sign-flip `Function` simply passes its input through unchanged.
(That is, we pass the fake image generated by `Gen` unchanged into
`Disc`.) But on the backward pass the sign-flipper takes the gradient
itâ€™s given and flips its sign before sending it on to `Gen` for further
backpropagation.

Iâ€™m not aware that pytorch offers a pre-packaged sign-flip `Function`,
but itâ€™s easy enough to write one.