I am looking at the DDPG implementation in this repo. And the updates are done by
# Compute critic loss critic_loss = F.mse_loss(current_Q, target_Q) # Optimize the critic self.critic_optimizer.zero_grad() critic_loss.backward() self.critic_optimizer.step() # Compute actor loss actor_loss = -self.critic(state, self.actor(state)).mean() # Optimize the actor self.actor_optimizer.zero_grad() actor_loss.backward() self.actor_optimizer.step()
My question is: It doesn’t zero the gradients of critic before
actor_loss.backward(). This would mean gradients are accumulated for variables of critic, right? So, they are the sum of gradients for
If yes, why the gradients for variables of actor is not affected by this when “applying chain rule”?
My thinking was, if we had just one variable for each network:A: Actor and C: Critic,
C.gradis populated with
A.gradis 0, as it is not a part of the
d(critic_loss)/dC + d(actor_loss)/dC
C.grad x dC/dA = (d(critic_loss)/dC + d(actor_loss)/dC) x dC/dA
But apparently, it is not true. I guess it is because my understanding of autograd is wrong.
Could you explain what I am missing here? Is it just that
A.grad wouldn’t look at
Thanks in advance. Sorry for the long question.