Model not training properly on GPU

Ausizio · May 31, 2020, 8:36am

Hi guys,

so I have been trying to get into reinforcement learning and have taken to a Udemy course, which has worked perfectly fine, until it came to an implementation of a deep Q learning algorithm.
The issue I am having is that my model works fine on the CPU (only issue it being significantly slower), with an expected learning curve. The agent seems to learn how to play pong after 220 games and comes to an average score of 16 (over the last 100 games) after 500 games.

If I try to run it on my GPU (GTX980M) though to improve performance, the learning curve seems to fall off a cliff at some point, seemingly reaching a maximum of -18 points on average and then dropping back to -21. The learning curve for the runs on the GPU looks as follows:
PongNoFrameskip-v4_DQN_pyTorch_50000Mem_300Games_cu_results

I have since reversed to simply cloning the repository from the course ( Deep-Q-Learning-Paper-To-Code/DQN at master · philtabor/Deep-Q-Learning-Paper-To-Code · GitHub ), but the issue persists and I have not yet found out why. I have reinstalled PyTorch and Cuda recently without success (Python 3.6.8, Torch 1.5.0, Cuda 10.1)

Any ideas as to why I am getting such an odd behaviour?

Any input would be greatly appreciated.

Thanks in advance,
Alex

ptrblck · May 31, 2020, 10:51am

How reproducible is this behavior? I.e. for 10 runs with different seeds, how many times does the CPU model converge and the GPU model diverge?

Ausizio · May 31, 2020, 11:04am

On the same system I have been able to reproduce the behavior 9 out of 10 times. I have not been able to reproduce it on a different system and hardware though.

ptrblck · June 1, 2020, 7:24am

Were you using the same setup on the different systems, i.e. same OS, PyTorch version, CUDA, cudnn, etc. or what were the differences?
Did you see any hardware issues in the first system before and could you run a stress test on the GPU?

Ausizio · June 1, 2020, 10:37am

The second system was running the same version of Python, but other than that I am not sure.
I also had the same issue on the first system with different versions of PyTorch and Cuda earlier, but I do not remember which exact versions I used, as I have since reinstalled and updated them.
I did not find any hardware issues while running the stress test.