Hi,
I am looking at the DDPG implementation in this repo. And the updates are done by
# Compute critic loss
critic_loss = F.mse_loss(current_Q, target_Q)
# Optimize the critic
self.critic_optimizer.zero_grad()
critic_loss.backward()
self.critic_optimizer.step()
# Compute actor loss
actor_loss = -self.critic(state, self.actor(state)).mean()
# Optimize the actor
self.actor_optimizer.zero_grad()
actor_loss.backward()
self.actor_optimizer.step()
My question is: It doesn’t zero the gradients of critic before actor_loss.backward()
. This would mean gradients are accumulated for variables of critic, right? So, they are the sum of gradients for critic_loss
and actor_loss
.
If yes, why the gradients for variables of actor is not affected by this when “applying chain rule”?
My thinking was, if we had just one variable for each network:A: Actor and C: Critic,
After critic_loss.backward()
:
-
C.grad
is populated withd(critic_loss)/dC
-
A.grad
is 0, as it is not a part of thecritic_loss
computation.
And after actor_loss.backward()
:
-
C.grad
is nowd(critic_loss)/dC + d(actor_loss)/dC
-
A.grad
isC.grad x dC/dA = (d(critic_loss)/dC + d(actor_loss)/dC) x dC/dA
But apparently, it is not true. I guess it is because my understanding of autograd is wrong.
Could you explain what I am missing here? Is it just that A.grad
wouldn’t look at C.grad
but d(actor_loss)/dC
?
Thanks in advance. Sorry for the long question.