alexis-jacq
(Alexis David Jacq)
2
I think you should look at this topic explaining the tensor.reinforce method:
If the action is the result of a sampling, calling action.reinforce(r)
acts as a policy gradient.
You can find a code example of implementation here: