PvPZT_ETRON
(Baptiste Juran)
November 28, 2022, 8:50am
1
Hi ! I tried to implement my first DQN agent for gym Cartpole, but it doesn’t seem to learn : the score at the end is worse than random play
I tried some things :
changing some parameters : learning rate, parameters for epsilon greedy, discount rate
changing the network architecture by making it much bigger
removing the target network
Those don’t seem to work and I am very confused regarding what I’m doing wrong
Thanks in advance for your help !
import gym
import math, random as rd, numpy as np, copy, matplotlib.pyplot as plt
import torch as T
import torch.nn as nn
import torch.functional as F
import torch.optim as omptim
env = gym.make('CartPole-v1', render_mode='human')
obs = env.reset()
gamma = 0.95
lr = 0.0001
epsilon, epmax, epmin, epdecay = 1, 1, 0.1, 0.005
N_episodes = 3000
n_input, n_hidden, n_out = 4, 5, 2
This file has been truncated. show original
J_Johnson
(J Johnson)
December 4, 2022, 5:39pm
2
Do you have a chart of the progress? What I’ve found is DQNs often get better up to a point and then much worse if you keep training them. So it’s good to set milestones to save.
J_Johnson
(J Johnson)
December 4, 2022, 5:50pm
3
Getting the correct rewards and Bellman’s target can often be a weak point and may need some tweaking. This developer was having a similar issue(albeit in Keras): DQN debugging using Open AI gym Cartpole - ADG Efficiency
So you might need to review and tweak accordingly. DQNs are an ongoing area of research.
J_Johnson
(J Johnson)
December 4, 2022, 5:54pm
4
Last comment, Pytorch has a tutorial with code you could give a try. It worked when I tried it at improving over time.
https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html
vmoens
(Vincent Moens)
December 8, 2022, 2:35pm
5
Minor note here:
We’re working on improving the DQN tutorial, you can check it there:
pytorch:master
← SiftingSands:DQN_revise_training
opened 03:53AM - 07 Sep 22 UTC
Following up the discussion from https://github.com/pytorch/tutorials/pull/2026
…
I still need to do multiple runs to get a semblance of the statistics of # episodes vs duration for both the original and my changes. The slight increase in model capacity still only uses ~1.5 GB of VRAM, so it should be pretty accessible and training is still relatively quick.
Here's the reward history for one run of these tweaks when I was doing a bunch of trial and error (spent an embarrassing amount of time tweaking hyperparameters and rewards).

@vmoens Feel free to change (or completely discard) anything based on your findings. I haven't tried tweaking anything else in the training pipeline.
1 Like