I took the actor-critic example from the examples and turned it into a tutorial with no gym dependencies, simulations running directly in the notebook. I’d like to know if I explained anything poorly or incorrectly or not enough, especially the parts about policy gradients.
Nice work! Looks pretty good at first glance. One suggestion that might help with the simplicity to readers is that you have a lot of nice clearly worded variables and then quite a few one letter variables. Maybe renaming some of those variable with more meaningful names might help with clarity for the reader.
Sorry now after taking second look actor critic looks off to me. Is this performing well? Why discount factor of 0.9 not traditional 0.99. Also do have your rewards matched up incorrectly in different step direction to actions and values?
According to the Sutton book this might be better described as “REINFORCE with baseline” (page 342) rather than actor-critic:
Although the REINFORCE-with-baseline method learns both a policy and a state-value function, we do not consider it to be an actor-critic method because its state-value function is used only as a baseline, not as a critic. That is, it is not used for bootstrapping (updating a state from the estimated values of subsequent states), but only as a baseline for the state being updated.
But it does perform pretty well. The rewards are given only at the end of the episode, and the discount factor was lowered because episodes are relatively short (~50 steps).
oh ok I see how that can work though doesn’t seem like it would be very robust in general purpose algorithms go. As for discount factor in my understanding and use of it a lower discount factor will just lead to grab rewards as soon as it can(High immediate reward values) and a higher discount factor leads to more importance on later rewards. Gridworld is a end goal objective game I believe(sorry never played lol) so would benefit with higher D factor in my opinion. Number of steps should not be an issue.