I’m trying to reconcile the implementation of REINFORCE with the math. To begin, the R algorithm attempts to maximize the expected reward.
How it’s commonly implemented in neural networks in code is by taking the gradient of reward times logprob.
loss = reward*logprob loss.backwards()
In other words,
Where theta are the parameters of the neural network. This makes sense sense because we’re using the logprob trick to transform the gradient of our expected loss.
What I’m not comfortable with is interpreting reward*logprob as a “loss” because
with r(x) * log P(X) as a Monte Carlo sample of a loss inevitably suggests taking N samples of x
This is an approximation to:
which is emphatically not what we started out with:
I believe there’s something subtle going on with taking the gradient through the sampling process but somehow you lose the sampling term in the math when you make Monte Carlo estimates.