Can someone debug my implementation of Policy Gradients (REINFORCE) for playing Atari breakout?

I’m learning about RL, and I’m struggling to get the simple REINFORCE algorithm working for playing Atari Breakout.

Self-contained code here:

My problem is that the algorithm overfits one action, meaning that one action is played nearly every step (probabilistic due to the sampling operation when choosing an action). Therefore it doesn’t learn to play properly.

I must have an implementation error somewhere because as I understand it actions which lead to negative reward should be relatively discouraged, but they’re not.

Is my implementation of the loss correct? My understanding is that you only backprop through the neuron that codes the action you actually selected, but the gradient gets distributed to all other neurons by the softmax?

If someone can figure out what I did wrong I’d be very grateful. It would also be great to know how you went about figuring out what was wrong.