Hey all! Say I’ve got a network like the following
Training the RL algorithm, DDPG, comprises 2 steps:
I see a number of implementations online, but most do not have a shared layer so I don’t have a baseline to compare against.
Line 12 and 13
This portion describes how to compute the loss and gradient update through the first head (Q->action-value)
This portion describes how to compute the loss and gradient update through the second head, Policy. I’ve coded it such that on every pass of the
actor-critic I receive 2 outputs, the policy and the Q-value.
I’ve separated this into 2 optimization steps:
self.actor_critic.critic_optimizer.zero_grad() _, y_hat_critic = self.actor_critic(s_batch, a_batch) critic_loss = F.mse_loss(y_hat_critic, y_critic) critic_loss.backward() self.custom_gradient_scaling_for_shared(self.actor_critic.parameters()) self.actor_critic.critic_optimizer.step()
self.actor_critic.actor_optimizer.zero_grad() _, y_hat_policy = self.actor_critic(s_batch) policy_loss = T.mean(-y_hat_policy) policy_loss.backward() self.custom_gradient_scaling_for_shared(self.actor_critic.parameters()) self.actor_critic.actor_optimizer.step()
P.s I’ve verified that in the first case, the policy head’s gradient update is zero, and in the second, although a gradient is generated for the Q-value head, it is not applied.
However, I am facing an issue in terms of determining if the 2 pass update would cause the shared layer to run into stability issues. I’ve got
self.custom_gradient_scaling_for_shared(self.actor_critic.parameters()) which determines if the update is for the shared layer, and if it is, it just scales the gradient.
Does anyone have insights into this?