I’m pretty new to Pytorch and I’m having issues with getting my agent to learn when I introduce a second network as my target network in a DQN. I originally had a simple DQN with just a policy net and a replay buffer which actually learned pretty well. Per convention and to get better stability I want to introduce a target net. As a first step, I have the target net updating every iteration to verify that I get similar results to the simple one-net DQN.
Okay, now for the issue I’m facing. I have added my second target net by creating a second network object using the same class I used for my policy net. During training, I use my replay memory which consists of (state, action, reward, new_state, done_fllag). I send the state batches forward through the policy net to get my predictions, and then I send the new_state batch forward through my target net to get my target values. From there, I calculate my loss. No matter if I use the policy net or the target net the losses are the same. I call loss.backward(), and the gradient isn’t 0, nor does it explode, but the network only ever learns when I use the policy net wor calculate the losses. The network weights change, so the gradient seems to be affecting it, but it doesn’t affect it in a good way.
I’m only able to get this to learn when i use the policy net to get the target_reward though.