Hello!

I have implemented policy gradient algorithm with such loss:

`loss = -torch.mean(log_prob*discounted_rewards)`

where `log_prob`

is tensor with actions probabilities which were taken and `discounted_rewards`

is tensor with corresponding discounted rewards for each action.

Is it correct implementation of loss in policy gradient algorithm? Can I use this approach instead of `action.reinforce(r)`

?

Thanks!