When training GAN why do we not need to zero_grad discriminator?

dagcilibili · June 28, 2018, 9:48pm

In the DCGAN example that can be found here, while training the generator network after training the discriminator network, we do not perform netG.zero_grad() again. However, doesn’t this accumulate the gradients with respect to real data in the netD (line 208), or the gradients with respect to the previous feeding of fake data (line 217)? Does the former not happen because the input tensor is different (real/fake), and does the latter not happen because we had performed a detach (line 215)?

ptrblck · June 28, 2018, 10:13pm

In the update step of the discriminator (line 208), the generator does not get the data, so the backward step does not calculate any gradients for it.
In line 217 the input to the discriminator is detached as you already observed. Thus the backward call of errD_fake also does not calculate the gradients for the generator.

Before updating the generator (line 225 etc.) the gradients are zeroed, so it looks alright.

dagcilibili · June 28, 2018, 10:34pm

Thank you very much, this answer helped me understand the workings of autograd better.

Ahmed_m · May 9, 2019, 3:05pm

Just one more question regarding your answer: when optimizing the discriminator, the first call to “backward” function should save gradients which are accumulated to those ones calculated from the second call to “backward” when optimizing the generator … is this true?

ptrblck · May 9, 2019, 3:11pm

While optimizing the discriminator you are using a real and fake input.
Both inputs will create a loss, which will accumulate gradients in the discriminator.
The fake input is detached from the generator (while updating the discriminator), so that the generator won’t see any gradients.

Ahmed_m · May 9, 2019, 3:21pm

Yes, but what about the second call to backward function for optimizing the generator? my understanding is that there would be gradients from the first call (when optimizing the discriminator) and these are now added to the gradients of the second call … i.e., the generator would be optimized by gradients of the second backward call (the correct gradients) + the gradients from the first call (when the generator was detached).

If that’s true, can we apply “zero_grad” to the discriminator before using it to optimize the generator?

ptrblck · May 9, 2019, 3:29pm

The generator update does not have a second backward call. Could you point me to the line of code?

While optimizing the discriminator, you won’t compute any gradients in the generator.
Since the fake input was detached from the generator, no gradients will be created in the generator itself.
You can check it by calling print(netG.some_layer.weight.grad) after the discriminator was updated (in the first iteration, otherwise you might see the gradients from the previous run).

Ahmed_m · May 9, 2019, 3:46pm

I was referring to “errG.backward()” … Now it is totally clear, thanks so much!

deepweaver · February 19, 2020, 2:42pm

There is no zero_grad between errD_real.backward() errD_fake.backward(), which means the gradients of these two are accumulated (sumed). If you look closer, the computation of these two is the same with back propagating errD (= errD_real + errD_fake). So the first two backwards(errD_real.backward & errD_fake.backward) are equivalent to errD.backward().

However, when optimizing generator, the gradient for the discriminator parameters is not zeroed! It still has the gradient for optimizing the discriminator.

If you plot a computation graph for the discriminator, you will notice that the gradient of theta(D) does not affect the generator. It’s true that theta(D).grad is been inaccurately accumulated. But what will be passed back to the generator is theta(D) and theta(D) will not be affected by theta(D).grad since we are not updating it.

saba · August 13, 2020, 3:22am

Hi Ptrblck,

I hope you are well. Sorry, I need to check the gradient in the DCGAN for discriminator and generator and see what is the trend of gradient which update them. Would you please tell me how it is possible to see them in the graph?

ptrblck · August 13, 2020, 3:52am

You can inspect the gradients either by directly printing them after the backward() call:

for name, param in model.named_parameters():
    if param.grad is not None:
        print(param.grad)
        print(param.grad.abs().max())
        ...

or by using hooks via model.layer.param.register_hook().

terbed · February 22, 2021, 10:32am

@ptrblck

Then if we zero_grad the discriminator before updating the generator would not have any effect? But it would be more memory efficient?

What if I have a third network which loss is based on the discriminator intermediate layers:

            # ------------------
            # Update controller
            # ------------------
            G.eval()
            D.eval()
            C.train()
            optims['C'].zero_grad()

            c = C(sample_dys)
            C_loss = D.lap1_loss(G(c), sample_dys)

            C_loss.backward()
            optims['C'].step()

where lap1 loss is:

    def lap1_loss(self, x: tr.tensor, y: tr.tensor):
        """
        Implements the laplace loss for the discriminator layers.
        Input shape: (N, C, H, W)
        :return: the scalar loss value
        """
        assert x.shape == y.shape, "The shape of inputs must be equal."
        assert len(x.shape) == 4, "Input must be 4 dimensional."

        _, x_acts = self.forward(x)
        _, y_acts = self.forward(y)

        losses = [trf.l1_loss(x_l, y_l)*2**(2*l) for l, (x_l, y_l) in enumerate(zip(x_acts, y_acts))]
        loss = tr.sum(tr.tensor(losses, requires_grad=True))

        return loss

Here we should also zero_grad() the discriminator network or the situation is the same here and it is not a problem that gradients are accumulated in the discriminator?

ptrblck · February 23, 2021, 12:40am

I wouldn’t say “it’s not a problem”, as it depends on your use case.
If you want to update a specific model, e.g. G in your use case, in the current code block, then you could either let PyTorch calculate the gradients for C and D and zero them out before updating C and D in their “update blocks” or alternatively you could set their requires_grad attribute of all parameters temporarily to False.
Both approaches would work and the critical step is to make sure the optimizer uses only “valid” gradients to update the corresponding model.