What's the right way of implementing policy gradient?

Emm, here is the full formula
policy is the weight of loss.grad, not the weight of loss itself.