Loss in reinforcement learning(policy gradient)

Hello!
I have implemented policy gradient algorithm with such loss:
loss = -torch.mean(log_prob*discounted_rewards)
where log_prob is tensor with actions probabilities which were taken and discounted_rewards is tensor with corresponding discounted rewards for each action.
Is it correct implementation of loss in policy gradient algorithm? Can I use this approach instead of action.reinforce(r)?

Thanks!

1 Like

This should work…

1 Like

Hello,

I also implemented the policy gradient algorithm by minimizing the above loss. However, I observe a strange behavior that while the loss decreases, the reward also decreases quickly and log_prob increases. After spending whole day, I still have no idea. What is possible reason for that?

Thanks!