I’ve been trying to replicate DeepMind’s original results (http://arxiv.org/pdf/1312.5602v1.pdf) for the particular case of Atari Pong, but I am not succeeding…
One interesting thing that I am observing is that, after a few training iterations (one match is enough), my Q-network starts outputting zeros regardless of the input state! Initially, I thought there was a bug in my code, but now I think that it somehow makes sense. In Pong, the obtained reward is almost always zero (except in the frames where we score or concede a goal) and the Bellman equation is:
Q(s,a) = reward + GAMMA * max_a’ (Q(s’,a’))
so, every time we get a zero reward, the Bellman equation is easily satisfied if Q(s,a) = max_a’ (Q(s’,a’)) = 0. That’s why I think my Q-network is basically learning to output zeros regardless of the input… Any hints on how I can overcome this issue?
I am following the exact same methodology as in DeepMind’s paper, including the network architecture and the preprocessing.
Btw, I am not sure if this is the right place to ask this question, anyway I would be very grateful if any of you could help…