Question about the use of zero_grad() in DDPG Implementation

trcn · November 26, 2020, 6:02pm

Hi,

I am looking at the DDPG implementation in this repo. And the updates are done by

        # Compute critic loss
		critic_loss = F.mse_loss(current_Q, target_Q)

		# Optimize the critic
		self.critic_optimizer.zero_grad()
		critic_loss.backward()
		self.critic_optimizer.step()

		# Compute actor loss
		actor_loss = -self.critic(state, self.actor(state)).mean()
		
		# Optimize the actor 
		self.actor_optimizer.zero_grad()
		actor_loss.backward()
		self.actor_optimizer.step()

My question is: It doesn’t zero the gradients of critic before actor_loss.backward(). This would mean gradients are accumulated for variables of critic, right? So, they are the sum of gradients for critic_loss and actor_loss .

If yes, why the gradients for variables of actor is not affected by this when “applying chain rule”?

My thinking was, if we had just one variable for each network:A: Actor and C: Critic,

After critic_loss.backward():

C.grad is populated with d(critic_loss)/dC
A.grad is 0, as it is not a part of the critic_loss computation.

And after actor_loss.backward():

C.grad is now d(critic_loss)/dC + d(actor_loss)/dC
A.grad is C.grad x dC/dA = (d(critic_loss)/dC + d(actor_loss)/dC) x dC/dA

But apparently, it is not true. I guess it is because my understanding of autograd is wrong.
Could you explain what I am missing here? Is it just that A.grad wouldn’t look at C.grad but d(actor_loss)/dC?

Thanks in advance. Sorry for the long question.