One question I have is how do you turn off the gradient history tracking for discriminator when you are training the generator. In the tutorial, it is not turned off as shown below.
...
# this part trains generator
netG.zero_grad()
label.fill_(real_label) # fake labels are real for generator cost
# Since we just updated D, perform another forward pass of all-fake batch through D
output = netD(fake).view(-1)
# Calculate G's loss based on this output
errG = criterion(output, label)
# Calculate gradients for G
errG.backward()
...
I see the grad tracking is turned off for generator when training the discriminator by calling detach on fake image. But not the other way around. Thanks in advance for your help
Since the discriminator’s optimizer won’t be called for the generator updates, nothing bad will happen.
You could set the requires_grad attributes of the discriminator’s parameters to False and reset them to True after the generator updates, but this is not really necessary, as you are using different optimizers.
Thanks for your reply. But my understanding of how optimizer works is that they take .grad and current weight and then update them. So even through discriminator’s optimizer does not get called, .grad gets calculated for discriminator’s weight when you call .backward and on the forward pass, cache is saved as require_grad==True. Let me know if that is the case or I mis-understand something.
In other words, even though discriminator’s optimizer won’t be called for the generator updates, you will still save some time and memory by turning requires_grad off. This is because regardless of whether or not there will be optimizer call on weight, weight.grad get calculated (if requires_grad is on).
In the posted code snippet, no backward() operation is called, thus no gradients are calculated, and optimizerD.step() will use any gradients, which are already assigned to the passed parameters.
I guess. you’ve forgotten to post the code containing the backward() operation.
In that case the loss.backward() is used to calculate the gradients.
Assuming d_optimizer contains the parameters of D, then d_optimizer.step() is using the gradients of d_loss_real.backward(one) and d_loss_fake.backward(mone) to update the parameters.
Both losses will create gradients in D and these gradients will be accumulated in all parameters of D, which require gradients.
Wasserstein_D = d_loss_real - d_loss_fake is not used to create the gradients in this example, but based on the formula could yield the same gradients as seen here:
You could check the .grad attributes of some layer and compare the value to the current approach and to the gradient created by Wasserstein_D.backward().