DDPG gradient with respect to action

alexis-jacq · May 16, 2018, 10:48am

Q should never be the loss function. DDPG is a case of Deep Actor-Critic algorithm, so you have two gradients: one for the actor (the parameters leading to the action (mu)) and one for the critic (that estimates the value of a state-action (Q) – this is our case – , or sometimes the value of a state (V) ).

In DDPG, the critic loss is the temporal difference (as in classique deep Q learning):
critic_loss = (R - gamma*Q(t+1) - Q(t))**2
Then the critic’s gradient is obtained by a simple backward of this loss.

For the actor gradient, things are more complex: it’s an estimation of the policy gradient, given by:
actor_grad = Q_grad * mu_grad
Where mu is the output of the network, estimating the optimal mean of the action’s Gaussian distribution.