Hi,

I am looking at the DDPG implementation in this repo. And the updates are done by

``````        # Compute critic loss
critic_loss = F.mse_loss(current_Q, target_Q)

# Optimize the critic
critic_loss.backward()
self.critic_optimizer.step()

# Compute actor loss
actor_loss = -self.critic(state, self.actor(state)).mean()

# Optimize the actor
actor_loss.backward()
self.actor_optimizer.step()
``````

My question is: It doesn’t zero the gradients of critic before `actor_loss.backward()`. This would mean gradients are accumulated for variables of critic, right? So, they are the sum of gradients for `critic_loss` and `actor_loss` .

If yes, why the gradients for variables of actor is not affected by this when “applying chain rule”?

My thinking was, if we had just one variable for each network:A: Actor and C: Critic,

After `critic_loss.backward()`:

• `C.grad` is populated with `d(critic_loss)/dC`
• `A.grad` is 0, as it is not a part of the `critic_loss` computation.

And after `actor_loss.backward()`:

• `C.grad` is now `d(critic_loss)/dC + d(actor_loss)/dC`
• `A.grad` is `C.grad x dC/dA = (d(critic_loss)/dC + d(actor_loss)/dC) x dC/dA`

But apparently, it is not true. I guess it is because my understanding of autograd is wrong.
Could you explain what I am missing here? Is it just that `A.grad` wouldn’t look at `C.grad` but `d(actor_loss)/dC`?

Thanks in advance. Sorry for the long question.