I have 2 neural networks namely the generator and the discriminator. I feed the output of the generator to the discriminator and the output of the dicriminator tot he generator. In this context I have some questions. My code is as follows -

gen_fake=generator(z)
disc_fake= disc(gen_fake)
lossDreal=criterion(xxxx) //Here a forward graph has been created with the parameters of the discriminator?
lossDfake=criterion(disc_fake,torch.ones()) //Here again a forward graph has been created but with which parameters? parameters of discriminator and the generator as disc_fake is fed in which is the output of the discriminator?
lossD=lossDreal+lossDfake // Which parameters are involved here?

Now when I do :-

lossD.backward()// according to which paramters will be the gradient calculated?

and if I do lossD.backward(retain_graph=True) then later if I call lossG(loss of the generator)
lossG.backward() //Will this also calculate the gradient with respect to both the paramters of the discriminator and the generator?

Note that autograd records every operation (until explicitly turned off by some means), and so all the tensors that are involved in calculating a certain tensor shall be in the computation graph.

When backward is called on a certain tensor, the grad attribute of only the leaf tensors is populated by default. If you want to populate the grad attribute of intermediate tensors, use retain_grad on the intermediate tensors.

This shall help you find better answers to your questions. If I try to answer them -

Depends on what you pass in place of xxxx.

Parameters of both the discriminator and the generator should be the leaf tensors in the graph of lossDfake.

Parameters of both the discriminator and the generator should be the leaf tensors in the graph of lossD as well.

This is answered by what I explained above about leaf tensors.

How do you calculate lossG?
Assuming using gen_fake only, lossG.backward() shall populate the grad attributes of the parameters of generator only.

Thanks a lot for the detailed answer. It was really helpful but it has also given birth to few more questions that I would like to clarify. Please help me out if possible.

Question 1 :- I am actually interested in the gradients of all the parameters involved in the discriminator and the generator but I have not done retain_grad=True anywhere. Does that affect my results? What do you mean by leaf tensors? So do I get the gradients of all the parameters in the discriminator and the generator by doing the following after calling lossD.backward()?
x=torch.empty(0,0)
x=torch.flatten(x)
for param in disc.parameters():
flat=torch.flatten(param.grad)
con=torch.cat((x,flat),0)
x=con

w=torch.empty(0,0)
w=torch.flatten(w)
for param in gen.parameters():
flat=torch.flatten(param.grad)
con=torch.cat((w,flat),0)
w=con

If not then where should I set the retain_grad=True ?

Question 2:- Since I am doing
lossD=lossDreal+lossDfake
lossG = criterion(disc_fake, torch.ones_like(disc_fake)) //tensors of discriminators are also there.
lossD.backward(retain_graph=True)
lossG.backward()
How does retain_graph=True matter here? What impact will it make if I remove the same because I could already get the gradients of all the parameters from the discriminator and the generator just by calling lossG.backward() after calling lossD.backward()?

It would be really very helpful if you could clarify the above two questions!

retain_grad isn’t a tensor attribute, so there’s nothing like setting it to True. It’s a method call, like so

x = torch.tensor([1.0], requires_grad=True)
y = x**2
y.retain_grad()
print(y.requires_grad) # gives True

When a tensor is first initialized, it becomes a leaf node/tensor.
Basically, all inputs and weights of a neural network are leaf tensors and hence leaf nodes in the computational graph.

Whenever any operation is performed on a tensor, it does not remain a leaf node anymore. In the above example y is not a leaf is it a result of some operations on x; x is a leaf.

This also answers your question :

Yes, assuming lossD is what you’ve described in your first post.

[To check whether a tensor is a leaf tensor or not, use is_leaf. y.is_leaf gives False]

No need to use retain_graph=True if that runs error free for you.
However, I’m anticipating it’ll not. retain_graph=True shall be required here as when you call lossG.backward() without it, you are basically trying to backward through the graph a second time after the saved tensors have already been freed which will produce an error.

Thanks! You are correct! retain_graph=True is required here. Does these saved tensors also save their corresponding gradients which was calculated during lossD.backward() ? If these gradients are also saved then I need to do discriminator.zero_grad() and generator().zero_grad(), before I call lossG.backward(). Otherwise the gradient for the generator will be calculated over the gradient of the discriminator which I would like to avoid. So that when I do param.grad I should get the gradients calculated by the lossG and not by the lossD after calling lossG.backward().

Hi Shrishti, since I have no optimizer in my training as I am optimizing the network manually, how should I use the zero_grad()?
If I have Discriminator and the Generator as two neural networks then can I do?
disc = Discriminator(image_dim).to(device)
disc.zero_grad()
lossG.backward()

since I already have retain_graph=true while calling the lossD.backward()

I am optimizing the parameters of the neural network with the help of an optimizing equation. Due to this reason I need the partial prder derivatiove of the loss function of the neural networks(generator and the discriminator) with respect to all the parameters. So I am afraid that I need only the first order partial derivative of the lossg and lossD with respect to all the parameters(weights and biases). If somehow the gradient is calculated wrong or twice then my descent direction will be miscalculated.

Calling zero_grad on a model instance created using nn.Sequential works.

I’m not sure if that’s the case with instances of custom model classes as well.

Update: Apparently, it works just as fine. Custom model classes should inherit from nn.module which has this method nn.module.zero_grad() & so can be called on instances of subclasses.