I have implemented policy gradient algorithm with such loss:
loss = -torch.mean(log_prob*discounted_rewards)
log_prob is tensor with actions probabilities which were taken and
discounted_rewards is tensor with corresponding discounted rewards for each action.
Is it correct implementation of loss in policy gradient algorithm? Can I use this approach instead of