DQN example from PyTorch diverged!

Eddie_Li · June 19, 2017, 3:27am

Hello folks.

I just implemented my DQN by following the example from PyTorch. I found nothing weird about it, but it diverged. I run the original code again and it also diverged.

The behaviors are like this. It often reaches a high average (around 200, 300) within 100 episodes. Then it starts to perform worse and worse, and stops around an average around 20, just like some random behaviors. I tried a lot of changes, the original version was surprisingly the best one, as described.

Any ideas?

Eddie_Li · June 19, 2017, 3:29am

Another concern is more theoretical. Since I’m training it on CartPole, in which sequential or temporal information matters, would DQN be able to learn the best strategy, if enough training given?

stegben · June 19, 2017, 10:15am

Hi, I just submit a PR of DQN in PyTorch:

I’ve encounter that problem before. Maybe you use the same Q network when updating? The Q network for finding the greatest action should be fixed.

Eddie_Li · June 19, 2017, 7:34pm

Thanks man.

I tried after I saw your reply. The nature DQN paper didn’t mention how frequently they updated their target network. But I tried every 2, 10, and 100 iterations. None of them solved my divergence problem. But they made my averages stay at 200 for a very long time. I guess it’s working, but not enough.

BTW I’m using exactly the same implementation given by PyTorch tutorials, which takes screens as inputs.

xuehy · July 1, 2017, 1:09pm

How do you keep the average over 200 for a period? In my experiments, it cannot keep. There is only bursts into high durations and the following duration will fall to very low values. I also tried the the update frequency but it does not work. Can you show me your duration-episode curve?

I am also using the difference image as the state.

Eddie_Li · July 10, 2017, 4:00am

Hey xuehy. Sorry since it was not very successful so I didn’t keep a lot of records. What I was doing was just adjusting hyperparameters. What I do remember is, memory capacity matters a lot.

So far I haven’t found a solution to make DQN converge. The potential reasons in my mind are 1) NN models are prone to divergence in theory; 2) simple image recognition model doesn’t learn CartPole very well.

I don’t know how far did you get since then, but I would suggest both include a LSTM cell, and try your model on other games. I haven’t done them yet though.

j.laute · July 10, 2017, 6:28am

I found that the usage of smooth l1 loss (Huber) always led to divergence on the cart pole environment (somebody else also had that problem I’ll add the link later)

It worked for me when using l2 (mse) loss

Further investigation showed that the loss explodedn with the Huber loss, (I ll need to check what happened to the gradient)

I found this very strange considering that the Huber loss is specifically designed to be stable

If anyone has any idea why this happens please let me know

Tldr if you usw smoothl1 loss, just try mse

Eddie_Li · July 10, 2017, 5:57pm

Thanks for pointing out! But I still can’t make it converge. In my case it just diverges very slowly (in case that my hyperparameters are problematic, I use the original example).

Can you let us know what hyperparameters and other tricks, like fixed Q target, you are using?

tessler · July 11, 2017, 6:57am

Hey,
You can take a look at the original implementation (beware, it’s in LUA): https://github.com/kuz/DeepMind-Atari-Deep-Q-Learner
There are several tricks they use:

They play using the static network, the dynamic one it trained, every 10K steps (hyper-parameter called target_q) the static is replaced by the dynamic.
Reward clipping - reward is clipped to +- 1
They clip the delta (loss) to +- 1 also.
Probably some more tricks I forgot / missed.

Besides that, once the basic DQN is working, I would suggest to add in Double Q Learning. It is pretty straight-forward to implement and gives steady convergence and better results.

j.laute · July 11, 2017, 8:00am

@Eddie_Li I’ll look for the code and put it on github (it’s a mess though). I was not able to find the blog post stating that the Huber loss leads to divergence on the cart pole example

j.laute · July 11, 2017, 8:55am

Here is the github link https://github.com/jaidmin/pytorch-q-learning

I just pushed my q-learning folder to github, the cartpole example is in “pytorch-q-learning-simple.ipynb”, that
it converges for me after ~ a couple hundred / maybe 1-2 thousend episodes.
As you can see I use fixed q-targets and a replay memory

I’ll create a gist with a simple minimal example later

If you find any errors in the code or have any questions please let me know

johannes

xuehy · July 11, 2017, 9:24am

I just changed the state from the image difference to the original state of the game environment. After tuning the learning rate, everything is fine now. It seems the image input is difficult to train.

denizs · July 12, 2017, 10:28am

Definitely, it’s way harder to train than training your model with the observations, which in that particular case of CartPole-v0 return the underlying physical model of the game. However, it is also way cooler, as it demonstrates the power of this algorithm

However, I’d like to mention that the original Deep Q Learning algorithm described by Mnih et al. didn’t use the difference of previous and current frame as a state representation, but rather a stack of the 4 last seen and processed frames (Resulting in a 1x4x84x84 input tensor as they were training on gray scale images). Also, they leveraged a technique called frame skipping:

The agent selects an action on every kth frame (I think it was 4th in the paper) and this action is applied to the environment for k-1 frames. This helps the agent to see more frames in the same time, as computing an action requires significantly more computing power than stepping the environment.

Also, the paper mentioned priorly deployed a second ‘target network’, which would be updated every 10.000 frames, which additionally stabilizes the algorithm.

vv_ss · July 20, 2017, 1:21pm

xuehy, I am facing the same problem on CartPole… the agent learns the most optimal behavior after 2K episodes, but then wanders off into less optimal regions. Can you please provide your learning rate, discount factor, exploration epsilon that worked for you? Thanks!

xuehy · July 21, 2017, 9:08am

Well. Learning doesn’t take such long time. I only train it until the 300th episode. The result is stable after around 100th episode. The discount factor is 0.9999. The learning rate is 0.01 with adam. the exploration epsilon starts at 0.9 and ends at 0.05 and the eps_decay is 200.

Eddie_Li · July 21, 2017, 5:21pm

I’m afraid people like me who trained on pixels can’t make it work because the example is wrong on its preprocessing. See denizs’ post.

mufeili · July 22, 2017, 3:42pm

Double DQN might help. DQN is known to have a very high variance and there is no guarantee of convergence. Double DQN yields a much lower variance and a better policy from my experiments on CartPole. While I used state signals from Gym directly, this might also work for the images.

AlbertM · November 16, 2017, 3:44am

Hi! I think the preprocessing makes sense, but not very well, as the difference of the two frames capture the following information:

the current position and the next position
which position is current, and which position is next

However, it does not capture the absolute position very well, because it uses a region of interest, which will keep fixed when the cartpole is in the middle. I think this might be where divergence comes.

mimoralea · July 17, 2018, 9:13pm

I wanted to add my two cents.

One of the problems is related to what is known as the “vanishing gradient” problem in supervised learning. Bottom line is, your agent learns a good policy and then stops exploring the areas that are sub-optimal. At one point, all your agent knows about is the best states, and actions because there is nothing other than the samples coming from a near-optimal policy in your replay buffer. Then, all updates to your network come from the same couple of near-optimal states, actions.

So, due to the vanishing gradient problem, your agent forgets how to get to the best, straight-up pole, position. It knows how to stay there, but not how to get there. You have no more samples of that in your replay buffer, guess what. As soon as the initial state is minimally different, chaos…

BTW, this happens in virtually any DQN/Cart Pole example I’ve tested, even the ones using the continuous variables as opposed to images. Yes, this includes OpenAI Baselines! Just change the code so it keeps training indefinitely and you’ll find the same divergence issues.

The way I got it to perform better is to increase the Replay Buffer size to 100,000 or 1,000,000 (you may need a solid implementation - see OpenAI Baselines: https://github.com/openai/baselines/blob/master/baselines/deepq/replay_buffer.py), and increase the Batch Size to ~64 or ~128. I think reducing the learning rate would help as well, but also slow down learning, of course. Though, I suppose this will only postpone the issues, but at least I ran 1,000 episodes of perfect performance which works for me.

Finally, I found it interesting to review the basics described by Sutton in his book: Sutton & Barto Book: Reinforcement Learning: An Introduction

From the book, take a look at Example 3.4: Pole-Balancing.

Example 3.4: Pole-Balancing The objective in this task is to apply forces to a cart moving along
a track so as to keep a pole hinged to the cart from falling over: A failure is said to occur if the pole
falls past a given angle from vertical or if the cart runs off the track. The pole is reset to vertical
after each failure. This task could be treated as episodic, where the natural episodes are the repeated
attempts to balance the pole. The reward in this case could be +1 for every time step on which failure
did not occur, so that the return at each time would be the number of steps until failure. In this case,
successful balancing forever would mean a return of infinity. Alternatively, we could treat pole-balancing
as a continuing task, using discounting. In this case the reward would be −1 on each failure and zero
at all other times. The return at each time would then be related to −γ
K, where K is the number of
time steps before failure. In either case, the return is maximized by keeping the pole balanced for as
long as possible.

As the Cart-Pole example is setup as an episodic task in OpenAI Gym; +1 every time step on which failure did not occur, gamma should then be set to 1.

alexis-jacq · July 19, 2018, 7:57pm

And if you want to explore what it the list of tricks that make the cartpole balancing faster and longer with Pytorch, from simple DQN to Double Duelling DQN, I recommend this tutorial:

github.com

criteo-research/paiss_deeprl/blob/master/exercises_pytorch.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# PAISS Practical Deep-RL by Criteo Research (Pytorch version)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%pylab inline\n",
    "\n",
    "from utils import RLEnvironment, RLDebugger\n",

This file has been truncated. show original