A section to discuss RL implementations, research, problems
actor-critic.py examples do not really converge for a
running_reward > 200 for me. Did anyone get it to work? I found that the running reward reached 199 quite often and then the rewards start to decrease. Does anyone have a similar experience?
I had the same problem, it only reached at 199 at my env, then back and force…
I got it work the first few times I ran it, but later without any changes the same situation happened as you mentioned. Very weird. Thought it’s using the same random seed thought out.
I’ve been trying to replicate DeepMind’s original results (http://arxiv.org/pdf/1312.5602v1.pdf) for the particular case of Atari Pong, but I am not succeeding…
One interesting thing that I am observing is that, after a few training iterations (one match is enough), my Q-network starts outputting zeros regardless of the input state! Initially, I thought there was a bug in my code, but now I think that it somehow makes sense. In Pong, the obtained reward is almost always zero (except in the frames where we score or concede a goal) and the Bellman equation is:
Q(s,a) = reward + GAMMA * max_a’ (Q(s’,a’))
so, every time we get a zero reward, the Bellman equation is easily satisfied if Q(s,a) = max_a’ (Q(s’,a’)) = 0. That’s why I think my Q-network is basically learning to output zeros regardless of the input… Any hints on how I can overcome this issue?
I am following the exact same methodology as in DeepMind’s paper, including the network architecture and the preprocessing.
Btw, I am not sure if this is the right place to ask this question, anyway I would be very grateful if any of you could help…