Understanding REINFORCE implementation

I’m trying to reconcile the implementation of REINFORCE with the math. To begin, the R algorithm attempts to maximize the expected reward.

How it’s commonly implemented in neural networks in code is by taking the gradient of reward times logprob.

loss = reward*logprob
loss.backwards()

In other words,

Where theta are the parameters of the neural network. This makes sense sense because we’re using the logprob trick to transform the gradient of our expected loss.

What I’m not comfortable with is interpreting reward*logprob as a “loss” because

with r(x) * log P(X) as a Monte Carlo sample of a loss inevitably suggests taking N samples of x

This is an approximation to:

which is emphatically not what we started out with:

I believe there’s something subtle going on with taking the gradient through the sampling process but somehow you lose the sampling term in the math when you make Monte Carlo estimates.

Prob not helping, but when you say the reward * log_prob, I think you mean the return * log_prob. They are different. Reward is what is given at each step of the episode. Return is the expected cumulative reward over the entire episode from the time t onward. In Sutton’s book the return is represented by ‘G’ in the pseudocode on page 271.

In addition to what @keithmgould said, keep in mind that you are calculating the expectation of your return with respect to the trajectory distribution, and this expectation is the objective function to be maximized.

I recommend studying the first few slides of Sergey Levine’s Policy Gradient lecture notes to understand how the gradient of the objective is derived.