Can someone debug my implementation of Policy Gradients (REINFORCE) for playing Atari breakout?

I’m learning about RL, and I’m struggling to get the simple REINFORCE algorithm working for playing Atari Breakout.

Self-contained code here: https://colab.research.google.com/drive/1x1vM8l1kXT3SuOeOR1E_a9I943_t5GcK

My problem is that the algorithm overfits one action, meaning that one action is played nearly every step (probabilistic due to the sampling operation when choosing an action). Therefore it doesn’t learn to play properly.

I must have an implementation error somewhere because as I understand it actions which lead to negative reward should be relatively discouraged, but they’re not.

Is my implementation of the loss correct? My understanding is that you only backprop through the neuron that codes the action you actually selected, but the gradient gets distributed to all other neurons by the softmax?

If someone can figure out what I did wrong I’d be very grateful. It would also be great to know how you went about figuring out what was wrong.