Network always predicts a single move

I have a game with a square board in which the player moves around. They cannot move onto the last square that they were previously on, or out of bounds, but they can move around anywhere inside the square.

The game state is represented as a matrix. We mark the edge of the play area with a magic number and the player with a magic number, as well as the player’s previous space. There are a few other magic numbers for items in the play area that are not used (yet). The player starts at a random location.

We scale all of these between 0 and 1, then flatten it as input to the agent. The reward is 1 for making a proper step, and -1 for stepping out of boards or on it’s previous location (the game resets in this case). There are four output moves corresponding to each direction. This agent uses vanilla policy gradients based on the example in the PyTorch repository, but we’ve expanded the network size to be 2 hidden layers with 128 nodes.

In nearly all cases, the agent seems to very quickly converge on moving in only a single or two directions until it hits out of bounds. I think this is a local minima, since by going only in a corner direction or single direction it never accidentally moves onto its previous space. But this is clearly not optimal since it does not learn to avoid out of bounds.

I’ve tried using convolutional layers, various network sizes, various learning parameters, verifying the input data, using an actor-critic model, encouraging exploration by randomly forcing a move with certain probability, different rewards, slightly different game mechanics, and a range of other things but I must be missing something! Everything I’ve tried results in the (believed) local minimum, with the exception of forcing a random move which also results in an exploding gradient and crash as it gets close to the local minimum (log probability of the forced move becomes huge as the probability approaches zero). Any advice for debugging this?

What are you using for the state of your environment? It’s possible you have already accounted for this, but the agent needs some way to know where it was in the previous state of the environment. Otherwise, the only way it will be able to ensure it does not accidentally return to its previous location is to always go a single direction. If you have not accounted for this, try stacking the current observation with the previous observation for your state. Or a less computationally expensive state would be to subtract some portion of the previous observation from the current observation.

If you have a sufficient state and the problem continues to persist, it’s probably a bug in your code, but there is the chance that your hyperparameters need tuning. For something as simple as your game, good rules of thumb are using .99 for your reward discount factor, and .001-.0001 for your learning rate (this will depend on if you are averaging or summing the gradient terms). Also, a good piece of advice (from John Schulman) is to try increasing the amount of data you use per update. You probably won’t need anything more than 20,000 frames per update(?), but honestly this is one of those black magic things that is problem and implementation dependent.

Thanks for the response Satchel! We create a 2D matrix representing the square board, bordered by values of 0.8, and represent the agent with 0.1, and the agent’s previous location with 0.2. All other locations are 0. There are some other values I’d like to include in the grid eventually, but right now I’m trying to get a basic sample working. The 2D matrix is flattened prior to input.

I’ve tried learning rates ranging from 0.01 and 0.0001, and currently use 0.99 for the discount factor. I can try experimenting with other values.

However, I think your suggestion of amount of data per update is interesting. Since the agent dies a lot at first, most of the episodes are only 1-4 moves in length. Currently, the update is done at the end of every episode. It sounds like this might be way too frequent, so I’ll try increasing this.

Nice, you’re state sounds great.

Yeah, try accumulating like 100-200 of those episodes’ gradients and average them for the update (the normalizing factor can be included in your learning rate). It’d be great if you let me know what you end up doing, and how it goes!

Hey Satchel,

Sorry for the long delay! I wanted to be sure to follow up on this. We implemented gradient averaging for the update, as well as a couple smaller tweaks like using an entropy-based loss component, and it did the trick! The agent seemed to move out of its local minima and start exploring actual strategies that led to a much longer episode length.

Thank you for your help!