[announcement] Gridworld reinforcement learning tutorial

spro · June 7, 2017, 10:18pm

I took the actor-critic example from the examples and turned it into a tutorial with no gym dependencies, simulations running directly in the notebook. I’d like to know if I explained anything poorly or incorrectly or not enough, especially the parts about policy gradients.

github.com

spro/practical-pytorch/blob/master/reinforce-gridworld/reinforce-gridworld.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](https://i.imgur.com/eBRPvWB.png)\n",
    "\n",
    "# Practical PyTorch: Playing GridWorld with Reinforcement Learning (Policy Gradients with REINFORCE)\n",
    "\n",
    "In this project we'll teach a neural network to navigate through a dangerous grid world.\n",
    "\n",
    "![](http://i.imgur.com/XNGB7sr.gif)\n",
    "\n",
    "Training uses [policy gradients](http://www.scholarpedia.org/article/Policy_gradient_methods) via the REINFORCE algorithm and a simplified Actor-Critic method. A single network calculates both a policy to choose the next action (the actor) and an estimated value of the current state (the critic). Rewards are propagated through the graph with PyTorch's [`reinforce` method](http://pytorch.org/docs/autograd.html?highlight=reinforce#torch.autograd.Variable.reinforce)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},

This file has been truncated. show original

dgriff · June 9, 2017, 7:06am

Nice work! Looks pretty good at first glance. One suggestion that might help with the simplicity to readers is that you have a lot of nice clearly worded variables and then quite a few one letter variables. Maybe renaming some of those variable with more meaningful names might help with clarity for the reader.

dgriff · June 11, 2017, 8:29pm

Sorry now after taking second look actor critic looks off to me. Is this performing well? Why discount factor of 0.9 not traditional 0.99. Also do have your rewards matched up incorrectly in different step direction to actions and values?

spro · June 11, 2017, 9:07pm

According to the Sutton book this might be better described as “REINFORCE with baseline” (page 342) rather than actor-critic:

Although the REINFORCE-with-baseline method learns both a policy and a state-value function, we do not consider it to be an actor-critic method because its state-value function is used only as a baseline, not as a critic. That is, it is not used for bootstrapping (updating a state from the estimated values of subsequent states), but only as a baseline for the state being updated.

But it does perform pretty well. The rewards are given only at the end of the episode, and the discount factor was lowered because episodes are relatively short (~50 steps).

dgriff · June 11, 2017, 10:04pm

oh ok I see how that can work though doesn’t seem like it would be very robust in general purpose algorithms go. As for discount factor in my understanding and use of it a lower discount factor will just lead to grab rewards as soon as it can(High immediate reward values) and a higher discount factor leads to more importance on later rewards. Gridworld is a end goal objective game I believe(sorry never played lol) so would benefit with higher D factor in my opinion. Number of steps should not be an issue.