Hi, I am trying to implement DDPG with pytorch, I am sure most of my implementation is right, but the policies don’t converge so I am not too certain about the gradient part. The paper shows that the gradient of Q(s,a) is with respect to action a. I am defining the loss to be Q(s,a), but how do I know that the gradient will be taken derivative with respect to a? Since the loss = Q(s,a), I am setting loss.requires_grad = True, and loss.volatile = False.
Q should never be the loss function. DDPG is a case of Deep Actor-Critic algorithm, so you have two gradients: one for the actor (the parameters leading to the action (
mu)) and one for the critic (that estimates the value of a state-action (
Q) – this is our case – , or sometimes the value of a state (
In DDPG, the critic loss is the temporal difference (as in classique deep Q learning):
critic_loss = (R - gamma*Q(t+1) - Q(t))**2
Then the critic’s gradient is obtained by a simple backward of this loss.
For the actor gradient, things are more complex: it’s an estimation of the policy gradient, given by:
actor_grad = Q_grad * mu_grad
mu is the output of the network, estimating the optimal mean of the action’s Gaussian distribution.
sorry, I should say I am trying to find the actor gradient. I am referencing this implementation and he does
policy_loss = -self.critic([to_tensor(state_batch), self.actor(to_tensor(state_batch))].mean()
which is just simply defining the loss = -mean(critic(state,actor(state)))
My belief is that the autograd will find the gradient with respect to action, which is actor_grad = Q_grad * mu_grad given the chain rule property.
Ok I see, it makes sens to directly derivates
Q(S, pi(S)) wrt
pi 's parameters.
In the paper, the gradient is wrt a, because they decompose the derivative (hence the
mu_grad in my equation above). If you directly take
Q as a loss, you must derivate it wrt policy’s parameters.
So, just doing
pi_loss = -Q(state, pi(state)) pi_loss.backward() pi_optimizer.step()
should be ok.
In the paper, look at equation 6: it’s a mater of computing the first line (Q directly with derivative wrt pi) or the second line (decomposition, with a derivative wrt a)
so back to my question, how is the gradient wrt action specified? Q(s,a) has both state and action as variable, and the gradient can be taken wrt to state if it’s not specified. This is what’s confusing me.
As I said, you don’t want to derivate wrt action in your case, but wrt the parameters of your policy. The states must be detached from the graph, and the code in my post above should do what you want.
Oh I see, so is there anyway to actually see that the gradient is actually with respect to policy parameters?
When you are using an optimizer on some parameters, only these parameter will be affected by the gradient step. So, if you did something like
policy_optim = nn.Optimizer(policy.parameters(), lr)
above in your code, and then
policy_loss = -Q(s, policy(s)) policy_loss.backward() policy_optim.step()
… then, only the parameter of the policy will be affected, with the gradient of your loss wrt these parameters. It’s that simple!