Hi, I have implemented an actor-critic method with linear networks on both actor and critic also I am using a custom function for loss calculation which is loss function used in PPO. The reward function of environment is different from the loss function of networks.
This is the loss calculation:
dist = self.actor(states) critic_value = self.critic(states) critic_value = torch.squeeze(critic_value) new_probs = dist.log_prob(actions) prob_ratio = new_probs.exp() / old_probs.exp() weighted_probs = advantage[batch.item()] * prob_ratio weighted_clipped_probs = torch.clamp(prob_ratio, 1-self.policy_clip, 1+self.policy_clip)*advantage[batch.item()] actor_loss = -torch.min(weighted_probs, weighted_clipped_probs).mean() returns = advantage[batch.item()] + values[batch.item()] critic_loss = (returns-critic_value)**2 critic_loss = critic_loss.mean() total_loss = actor_loss + 0.5*critic_loss
and then model omtimization:
self.actor.optimizer.zero_grad() self.critic.optimizer.zero_grad() total_loss.backward() self.actor.optimizer.step() self.critic.optimizer.step()
Now my questions:
- I want to know is it correct to use backward in this way?
- Is the backward now returning grad for both actor and critic networks and the both networks are optimizing?
- If I add a feature extractor network before inputting the features to network, and input the extracted features to the actor and critic, will this backward works for feature extractor network?