How to turn off gradient during GAN training

I am going through the DCGAN tutorials tutorials.

One question I have is how do you turn off the gradient history tracking for discriminator when you are training the generator. In the tutorial, it is not turned off as shown below.

...
# this part trains generator
        netG.zero_grad()
        label.fill_(real_label)  # fake labels are real for generator cost
        # Since we just updated D, perform another forward pass of all-fake batch through D
        output = netD(fake).view(-1)
        # Calculate G's loss based on this output
        errG = criterion(output, label)
        # Calculate gradients for G
        errG.backward()
...

I see the grad tracking is turned off for generator when training the discriminator by calling detach on fake image. But not the other way around. Thanks in advance for your help :smiley:

2 Likes

Since the discriminator’s optimizer won’t be called for the generator updates, nothing bad will happen.
You could set the requires_grad attributes of the discriminator’s parameters to False and reset them to True after the generator updates, but this is not really necessary, as you are using different optimizers.

3 Likes

Thanks for your reply. But my understanding of how optimizer works is that they take .grad and current weight and then update them. So even through discriminator’s optimizer does not get called, .grad gets calculated for discriminator’s weight when you call .backward and on the forward pass, cache is saved as require_grad==True. Let me know if that is the case or I mis-understand something.

Your understanding is basically correct. Since netD.zero_grad() is called before updating the discriminator, these gradients will be cleared.

In other words, even though discriminator’s optimizer won’t be called for the generator updates, you will still save some time and memory by turning requires_grad off. This is because regardless of whether or not there will be optimizer call on weight, weight.grad get calculated (if requires_grad is on).

I’m not sure about the computation and memory usage, since you need the gradients to backpropagate to the generator.

1 Like

I will benchmark it and report back.

There is a slight improvement in terms of time. I single out the part of the code that trains generator as shown below.

label = torch.full((batch_size,), real_label, device=device)

def train():
    netG.zero_grad()
    noise = torch.randn(batch_size, nz, 1, 1, device=device)
    fake = netG(noise)
    output = netD(fake).view(-1)
    errG = criterion(output, label)
    errG.backward()
    optimizerG.step()
    
%timeit train()
# -> 44.3 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
for p in netD.parameters():
    temp.requires_grad_(False)


%timeit train()
# -> 41 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1 Like

Thanks for the debugging!
This post might explain the benefits you are seeing.

Thanks for the reference and your help. Much appreciated :smiley:

Hi Ptrblck,

in this line based on which error the updating is done?

errD_penalty = errD_fake - errD_real + grad_penalty
errD = errD_fake - errD_real
optimizerD.step()

In the posted code snippet, no backward() operation is called, thus no gradients are calculated, and optimizerD.step() will use any gradients, which are already assigned to the passed parameters.

I guess. you’ve forgotten to post the code containing the backward() operation.
In that case the loss.backward() is used to calculate the gradients.

That’s the sample code, the self.d_optimizer.step() is optimized with which error? with d_loss = d_loss_fake - d_loss_real? does it need backward()?

                # Train discriminator
                # WGAN - Training discriminator more iterations than generator
                # Train with real images
                d_loss_real = self.D(images)
                d_loss_real = d_loss_real.mean(0).view(1)
                d_loss_real.backward(one)

                # Train with fake images
                if self.cuda:
                    z = Variable(torch.randn(self.batch_size, 100, 1, 1)).cuda()
                else:
                    z = Variable(torch.randn(self.batch_size, 100, 1, 1))
                fake_images = self.G(z)
                d_loss_fake = self.D(fake_images)
                d_loss_fake = d_loss_fake.mean(0).view(1)
                d_loss_fake.backward(mone)
                d_loss = d_loss_fake - d_loss_real
                Wasserstein_D = d_loss_real - d_loss_fake
                self.d_optimizer.step()

Assuming d_optimizer contains the parameters of D, then d_optimizer.step() is using the gradients of d_loss_real.backward(one) and d_loss_fake.backward(mone) to update the parameters.
Both losses will create gradients in D and these gradients will be accumulated in all parameters of D, which require gradients.

Wasserstein_D = d_loss_real - d_loss_fake is not used to create the gradients in this example, but based on the formula could yield the same gradients as seen here:

torch.manual_seed(2809)
model = models.resnet18()

out_real = model(torch.randn(1, 3, 224, 224))
out_real = out_real.mean()
out_real.backward()

out_fake = model(torch.randn(1, 3, 224, 224))
out_fake = out_fake.mean()
out_fake.backward(torch.ones_like(out_fake) * -1)

print(model.conv1.weight.grad.abs().sum())

torch.manual_seed(2809)
model = models.resnet18()

out_real = model(torch.randn(1, 3, 224, 224))
out_real = out_real.mean()

out_fake = model(torch.randn(1, 3, 224, 224))
out_fake = out_fake.mean()

loss = out_real - out_fake
loss.backward()

print(model.conv1.weight.grad.abs().sum())

You could check the .grad attributes of some layer and compare the value to the current approach and to the gradient created by Wasserstein_D.backward().

Great, I appreciate your help