I’m trying to wrap my head around REINFORCE and in particular this line. This would appear to result in a loss of 0 if log_prob is 0 or if the reward (return) is 0. In the latter case, a very high return with 100% probability would look just like no return with any probability. I’m not sure how it could learn under such confounded signals. What am I missing?
The underlying logic is that you want to maximize the sum of the rewards under the policy, which can be obtained by a gradient descent with the derivative of this sum. And the derivative of this sum is the sum of R*derivative(log(policy)).
That’s the policy-gradient theorem, and it’s easy to prove (using the fact that d(pi) = d(log(pi))*pi )
Now, how to interpret the idea in a less formal way: you want to maximize the log-likelihood where rewards are height and minimalize it when rewards are small. Of course, there is a singularity when the policy is deterministic (p(Si) = Ai with probability 1), but such a case is never reached in a sub-optimal way if the policy is not initialized in that singular way. Usually the policy is initialized with a small normal noise around a uniform distribution.
Thanks for explanation, @alexis-jacq. However, it is still unclear to me.
If there is a reward function that has zero or negative values, e.g. reward=[-1, 0, +1], how math will work out in this line? Because when the reward is zero, loss becomes zero, so means no error? but when the reward is +1 (desired one), there is an error?
Or may be the reward can not be negative or zero in this code?
@confused, I believe that I can help explain the math a bit.
Notice that in this line, the rewards from an episode are standardized to be unit normal, meaning the distribution of the standardized rewards will have mean 0 and a variance of 1. This serves as a measure of the “goodness” of an action, relative to the others in the trajectory. Standardized rewards > 0 means that we want to increase the probability of taking those actions given the state, and the reverse for rewards < 0.
Let’s look at the case for the highest loss, which is when there is a very low probability of taking an action that led to a very high reward. Try plugging that into the loss:
(-log_prob * reward) term, and you’ll see that it results in a very high loss, which we want to minimize.
I hope I was able to help!
Thanks @kphng. But what happens if reward is zero (in above example rewards can be -1 or 0 or +1), then loss = (-log_prob * 0), so no error will propagate?
@confused, yes, no error will propagate. This is because the goodness of that standardized action would be neither better or worse than “average”. So we would not want to adjust the probability of taking that action.
Thanks again, @kphng . Another question, since the goal is to minimize error, if there is reward -1, then loss become smaller than a case where there is +1 reward in the (-log_prob * reward). Example:
a) reward -1, prob =0.9 (-log(0.9)*reward), loss = -0.105
b) reward 1, prob =0.9 (-log(0.9)*reward), loss = 0.105
In this scenario, case b has higher error than case a. Am I missing sth here?
@confused, I think that the concept is that instead of comparing the loss values between your 2 cases there, you should think about minimizing each policy loss value by itself.
Let’s look at case b), in order to lower that loss, increasing the probability would lower that loss value. Now for case a), taking a step to minimize the loss would mean decreasing the probability of that action.
I wouldn’t try to view the theory as comparing the losses for the separate cases (a, b, etc…), but think about how the loss is a scalar generated by summing the loss for every single case in a trajectory. By minimizing the error in each case by itself, the overall loss will be decreased.
To make it more concrete:
With your example, loss for cases a, b would be -0.105 + 0.105 = 0. Now if we updated the weights w/a gradient step, let’s say:
a) reward=-1, prob=.85, (-log(.85)*reward), loss = -0.162
b) reward=1, prob = .95, (-log(.95)*reward), loss = 0.051
Now the loss (w/our new and improved weights) would be -0.162 + 0.051 = -0.112, which is a lower loss value than 0 above.
Thanks @kphng, it makes more sense now.
@confused, glad I could help! I updated my above comment to give a concrete example of how the policy update would lower the loss, with the example cases that you provided above.