Hello!
I have implemented policy gradient algorithm with such loss:
loss = -torch.mean(log_prob*discounted_rewards)
where log_prob
is tensor with actions probabilities which were taken and discounted_rewards
is tensor with corresponding discounted rewards for each action.
Is it correct implementation of loss in policy gradient algorithm? Can I use this approach instead of action.reinforce(r)
?
Thanks!