I wanted to add my two cents.
One of the problems is related to what is known as the “vanishing gradient” problem in supervised learning. Bottom line is, your agent learns a good policy and then stops exploring the areas that are sub-optimal. At one point, all your agent knows about is the best states, and actions because there is nothing other than the samples coming from a near-optimal policy in your replay buffer. Then, all updates to your network come from the same couple of near-optimal states, actions.
So, due to the vanishing gradient problem, your agent forgets how to get to the best, straight-up pole, position. It knows how to stay there, but not how to get there. You have no more samples of that in your replay buffer, guess what. As soon as the initial state is minimally different, chaos…
BTW, this happens in virtually any DQN/Cart Pole example I’ve tested, even the ones using the continuous variables as opposed to images. Yes, this includes OpenAI Baselines! Just change the code so it keeps training indefinitely and you’ll find the same divergence issues.
The way I got it to perform better is to increase the Replay Buffer size to 100,000 or 1,000,000 (you may need a solid implementation - see OpenAI Baselines: https://github.com/openai/baselines/blob/master/baselines/deepq/replay_buffer.py), and increase the Batch Size to ~64 or ~128. I think reducing the learning rate would help as well, but also slow down learning, of course. Though, I suppose this will only postpone the issues, but at least I ran 1,000 episodes of perfect performance which works for me.
Finally, I found it interesting to review the basics described by Sutton in his book: http://incompleteideas.net/book/the-book-2nd.html
From the book, take a look at Example 3.4: Pole-Balancing.
Example 3.4: Pole-Balancing The objective in this task is to apply forces to a cart moving along
a track so as to keep a pole hinged to the cart from falling over: A failure is said to occur if the pole
falls past a given angle from vertical or if the cart runs off the track. The pole is reset to vertical
after each failure. This task could be treated as episodic, where the natural episodes are the repeated
attempts to balance the pole. The reward in this case could be +1 for every time step on which failure
did not occur, so that the return at each time would be the number of steps until failure. In this case,
successful balancing forever would mean a return of infinity. Alternatively, we could treat pole-balancing
as a continuing task, using discounting. In this case the reward would be −1 on each failure and zero
at all other times. The return at each time would then be related to −γ
K, where K is the number of
time steps before failure. In either case, the return is maximized by keeping the pole balanced for as
long as possible.
As the Cart-Pole example is setup as an episodic task in OpenAI Gym; +1 every time step on which failure did not occur, gamma should then be set to 1.