I was training CycleGAN.
When I train discriminator,
case 1.
loss_D_A.backward()
loss_D_B.backward()
case 2.
loss_D = loss_D_A + loss_D_B
loss_D.backward()
Without thinking of retain_graph=True stuff, assuming this code doesn’t occur any error, are both case have same effects?
My question is that for case 2., does loss get accumulated twice (not just loss A. loss A + loss B for Discriminator A. Same case for Discriminator B.) so it update the gradient twice more then they should have been? Or does loss automatically affects only to its related tensors even though they are added like case 2.?
This will work properly. loss_D_A.backward() accumulates the gradient
of loss_D_A into whatever parameters loss_D_A depends on. loss_D_B.backward() then accumulates the gradient of loss_D_B
into its relevant parameters.
If loss_D_A and loss_D_B both depend on some of the same
parameters, those parameters will have the sum of the two gradients
accumulated into their .grad properties.
But gradients are linear – that is, the gradient of the sum is the sum of
the gradients. So accumulating the gradient of loss_D_A + loss_D_B
into the relevant parameters does, indeed, accumulate the sum of the
two gradients into the parameters. (But by doing so in one .backward()
call, it will be somewhat cheaper.)
Just to be clear, in case 1, if loss_D_A and loss_D_B share
part of the same computation graph, you will have to use loss_D_A.backward (retain_graph = True) in order for loss_D_B.backward() to work.
(retain_graph = True won’t be needed for case 2 because there
is only one .backward() call.)
Hello Frank!
Thank you for your fast and kind reply!
So do you mean that also for case 2, it will accumulate the gradient of loss_D_A only into its relevant parameters and accumulate the gradient of loss_D_B only into its relevant parameters even if we call only one .backward() to sum of those two losses (loss_D_A + loss_D_B) but not separate .backward() call to each losses?
Yes, this is correct (with the proviso that you might need to use retain_graph = True in case 1).
You can easily test that cases 1 and 2 produce the same gradients.
Run case 1 and clone() the .grads of all of the parameters that affect
either loss_D_A or loss_D_B for future comparison. Then repeat the
loss computations from scratch (using the same data, of course) and
run case 2. You can now compare the .grads from case 2 with those
you saved from case 1 and you will see that they are equal (up to numerical
round-off error).