What's the right way of implementing policy gradient?

I think you should look at this topic explaining the tensor.reinforce method:

If the action is the result of a sampling, calling action.reinforce(r) acts as a policy gradient.
You can find a code example of implementation here: