What was the final consensus? I’ve tried most of the suggestions here with no improvements:
changed pixels to gym environment
tried mse loss
tuned learning rate, 0.001, 0.0001
changed target network update cycle, 10, 100, 1000
None of these worked, and the average duration stays around 20 timesteps.
Even with the original PyTorch implementation, at max it stays at 50 timesteps average duration.
In my case, L1 loss required much lower target network synchronization frequency for successful training.
In CartPole-v0 environment with numerical state representation (not the image one), L2 loss works well with synchronization every 100 frames, but L1 loss needed synchronization frequency >= 5000 in unit of frame.
Although I never tried with Huber loss, I think it would have basically same characteristic as L1 loss.