Hi, I have implemented an actor-critic method with linear networks on both actor and critic also I am using a custom function for loss calculation which is loss function used in PPO. The reward function of environment is different from the loss function of networks.
This is the loss calculation:
dist = self.actor(states)
critic_value = self.critic(states)
critic_value = torch.squeeze(critic_value)
new_probs = dist.log_prob(actions)
prob_ratio = new_probs.exp() / old_probs.exp()
weighted_probs = advantage[batch.item()] * prob_ratio
weighted_clipped_probs = torch.clamp(prob_ratio, 1-self.policy_clip,
1+self.policy_clip)*advantage[batch.item()]
actor_loss = -torch.min(weighted_probs,
weighted_clipped_probs).mean()
returns = advantage[batch.item()] + values[batch.item()]
critic_loss = (returns-critic_value)**2
critic_loss = critic_loss.mean()
total_loss = actor_loss + 0.5*critic_loss
and then model omtimization:
self.actor.optimizer.zero_grad()
self.critic.optimizer.zero_grad()
total_loss.backward()
self.actor.optimizer.step()
self.critic.optimizer.step()
Now my questions:
- I want to know is it correct to use backward in this way?
- Is the backward now returning grad for both actor and critic networks and the both networks are optimizing?
- If I add a feature extractor network before inputting the features to network, and input the extracted features to the actor and critic, will this backward works for feature extractor network?