Model not training properly on GPU

Hi guys,

so I have been trying to get into reinforcement learning and have taken to a Udemy course, which has worked perfectly fine, until it came to an implementation of a deep Q learning algorithm.
The issue I am having is that my model works fine on the CPU (only issue it being significantly slower), with an expected learning curve. The agent seems to learn how to play pong after 220 games and comes to an average score of 16 (over the last 100 games) after 500 games.

If I try to run it on my GPU (GTX980M) though to improve performance, the learning curve seems to fall off a cliff at some point, seemingly reaching a maximum of -18 points on average and then dropping back to -21. The learning curve for the runs on the GPU looks as follows:
PongNoFrameskip-v4_DQN_pyTorch_50000Mem_300Games_cu_results

I have since reversed to simply cloning the repository from the course ( https://github.com/philtabor/Deep-Q-Learning-Paper-To-Code/tree/master/DQN ), but the issue persists and I have not yet found out why. I have reinstalled PyTorch and Cuda recently without success (Python 3.6.8, Torch 1.5.0, Cuda 10.1)

Any ideas as to why I am getting such an odd behaviour?

Any input would be greatly appreciated.

Thanks in advance,
Alex

How reproducible is this behavior? I.e. for 10 runs with different seeds, how many times does the CPU model converge and the GPU model diverge?

On the same system I have been able to reproduce the behavior 9 out of 10 times. I have not been able to reproduce it on a different system and hardware though.

Were you using the same setup on the different systems, i.e. same OS, PyTorch version, CUDA, cudnn, etc. or what were the differences?
Did you see any hardware issues in the first system before and could you run a stress test on the GPU?

The second system was running the same version of Python, but other than that I am not sure.
I also had the same issue on the first system with different versions of PyTorch and Cuda earlier, but I do not remember which exact versions I used, as I have since reinstalled and updated them.
I did not find any hardware issues while running the stress test.