What is action.reinforce(r) doing actually?

I think I see what you’re asking now. The gradient estimates from all the examples in the batch are added together in the stochastic node (there’s no implicit division by the batch size) so, depending on your use case, you may want to manually divide your rewards by the batch size.

1 Like