Why is my DQN (Deep Q Network) not learning?

I am training a DQN (Deep Q Network) on a CartPole problem from OpenAI’s gym, but the total score from an episode decreases, instead of increasing. I don’t know if it is helpful but I noticed that the AI prefers one action over another and refuses to do anything else (unless it is forced by the epsilon greedy strategy), at least for some time. I tried my best, but I just can’t figure out what is going on.

Here is my code:


import torch as t
import torch.nn as nn
import torch.nn.functional as f

import random as r

class QNet:
    def predict(self, x: t.Tensor) -> t.Tensor:

    def copy_weights(origin: [], target: []):
        for origin_layer, target_layer in zip(origin, target):
            target_layer.weight = nn.Parameter(origin_layer.weight.clone())

class Memory:
    def __init__(self, state: t.Tensor, next_state: t.Tensor, action: int, reward: float):
        self.state = state
        self.next_state = next_state
        self.action = action
        self.reward = reward

class ReplayMemory:
    def __init__(self, capacity: int):
        self.capacity = capacity
        self.memories = []

    def add_memory(self, memory: Memory):

        if len(self.memories) > self.capacity:

    def get_batch(self, size: int):
        if len(self.memories) < size:
            raise Exception("There are not enough memories to make a batch.")

        start_index = r.randint(0, len(self.memories) - size)
        end_index = start_index + size
        return self.memories[start_index:end_index]

class QLearning:
    def __init__(self, net: QNet, target_net: QNet, optimizer, gamma: float):
        self.net = net
        self.target_net = target_net
        self.optimizer = optimizer
        self.gamma = gamma

    def train(self, batch: [Memory]):
        batched_pred = []
        batched_opt_pred = []
        for sample in batch:
            pred = self.net.predict(sample.state)

            opt_pred = pred.clone()
            opt_pred[sample.action] = sample.reward
            if sample.next_state is not None:
                opt_pred[sample.action] += t.max(self.target_net.predict(sample.next_state)) * self.gamma


        loss = f.mse_loss(t.stack(batched_pred), t.stack(batched_opt_pred))


import gym

from qlearning import *

env = gym.make("CartPole-v1")
state = t.tensor(env.reset(), dtype=t.float)

class Agent(nn.Module, QNet):
    def __init__(self):

        self.l1 = nn.Linear(4, 32)
        self.l2 = nn.Linear(32, 16)
        self.l3 = nn.Linear(16, 8)
        self.l4 = nn.Linear(8, 4)
        self.l5 = nn.Linear(4, 2)

    def predict(self, x):
        y = f.relu(self.l1(x))
        y = f.relu(self.l2(y))
        y = f.relu(self.l3(y))
        y = f.relu(self.l4(y))
        return self.l5(y)

agent = Agent()
target_agent = Agent()
q = QLearning(agent, target_agent, optim.Adam(agent.parameters(), lr=0.001), 0.9)
replay_memory = ReplayMemory(100000)
epsilon = 1
epsilon_dec = 1 / 1000
total_reward = 0
for i in range(1000):

    action = 0
    if r.random() > epsilon:
        action = t.argmax(agent.predict(state)).item()
        action = env.action_space.sample()

    epsilon -= epsilon_dec

    next_state, reward, done, info = env.step(action)
    next_state = t.tensor(next_state, dtype=t.float)
    if done:
        reward = -1
        replay_memory.add_memory(Memory(state, None, action, reward))
        replay_memory.add_memory(Memory(state, next_state, action, reward))

    total_reward += reward

    if done:
        state = t.tensor(env.reset(), dtype=t.float)

        # print(int(total_reward))
        total_reward = 0

    if len(replay_memory.memories) >= 10:

    if i % 10:
        QNet.copy_weights([agent.l1, agent.l2, agent.l3, agent.l4, agent.l5],
                          [target_agent.l1, target_agent.l2, target_agent.l3, target_agent.l4, target_agent.l5])

    state = next_state

You have to copy the learning network to the target network. The target network is not updating.

def update_Network_Parameters(self,tau = None):
        if tau == None:
            tau_param = self.tau_param
            tau_q = self.tau_q
            tau_param = 1
            tau_q = 1
        q_net_dict = self.q_net_dict.named_parameters()
        target_q_net_dict = self.target_q_net_dict.named_parameters()

        q_net_dict_dict = dict(q_net_dict)
        target_q_net_dict = dict(target_q_net_dict)

        for name in q_net_dict:
            target_q_net_dict [name] = tau_param*copy.deepcopy(q_net_dict[name])+(1-tau_param)*copy.deepcopy(target_q_net_dict[name])
        self.target_q_net_dict.load_state_dict(target_q_net_dict )

I think I do. Right here:

QNet.copy_weights([agent.l1, agent.l2, agent.l3, agent.l4, agent.l5],
                          [target_agent.l1, target_agent.l2, target_agent.l3, target_agent.l4, agent.l5])

(Here is the used function)

 def copy_weights(origin: [], target: []):
     for origin_layer, target_layer in zip(origin, target):
         target_layer.weight = nn.Parameter(origin_layer.weight.clone())

I even made an if statement to be sure:

weights = q.target_net.l1.weight
QNet.copy_weights([agent.l1, agent.l2, agent.l3, agent.l4, agent.l5],
                          [target_agent.l1, target_agent.l2, target_agent.l3, target_agent.l4, agent.l5])
if not t.all(t.eq(q.target_net.l1.weight, weights)).item():
   print("works fine")

But I have to say that in the previous code, I didn’t copy the last (fifth) layer. So your statement was technically true.

Try not to hard copy them. Q-Learning tends to diverge fast if the target is moving too fast.
Also you have a really deep neural network. Try a more shallow network. use only 2-3 layers. As mentioned in the OFENET paper deep RL networks tend to diverge also.

Beside does it work now?

Unfortunately, no (even after your improvements). But I think the problem is somewhere in the train function.

Also you don’t want to train the targetnetwork.
So you have to put the line "with torch.no_grads():" in

[“with torch.no_grads():”]
if sample.next_state is not None:
opt_pred[sample.action] += t.max(self.target_net.predict(sample.next_state)) * self.gamma

Thanks, but it still doesn’t work. I think I should point out, that when I print the loss, most of the values are fine (somewhere under 1), but in some specific conditions, the loss can even be around 60.

(My PyCharm console)
Screenshot 2021-07-12 at 7.24.49

Also, I noticed that most of the people gather the state-action values from the prediction and target and calculate the loss with these. Instead I use the whole tensor, so I wonder if this is not the problem.

(Code from PyTorch tutorial on DQN)

state_action_values = policy_net(state_batch).gather(1, action_batch)
next_state_values = torch.zeros(BATCH_SIZE, device=device)
next_state_values[non_final_mask] = target_net(non_final_next_states).max(1)[0].detach()

This is in case of a policy network in actor critic methods?

DQN doesn’t have a policy network just a value network.

No, it’s not. Or at least I don’t want it to be a policy network.

Look DQN has just and only 1 Value network.
Actor-Critic have 1 Value network and 1 policy network.

The policy is evaluated by the value network. That’s why you gather the max actions of the policy network and assign values to it through the value network.

Yeah, but that code was from the PyTorch tutorial on DQNs. Here`s the link: Reinforcement Learning (DQN) Tutorial — PyTorch Tutorials 1.9.0+cu102 documentation

And this is their training code:

    state_batch = torch.cat(batch.state)
    action_batch = torch.cat(batch.action)
    reward_batch = torch.cat(batch.reward)

    # Compute Q(s_t, a) - the model computes Q(s_t), then we select the
    # columns of actions taken. These are the actions which would've been taken
    # for each batch state according to policy_net
    state_action_values = policy_net(state_batch).gather(1, action_batch)

    # Compute V(s_{t+1}) for all next states.
    # Expected values of actions for non_final_next_states are computed based
    # on the "older" target_net; selecting their best reward with max(1)[0].
    # This is merged based on the mask, such that we'll have either the expected
    # state value or 0 in case the state was final.
    next_state_values = torch.zeros(BATCH_SIZE, device=device)
    next_state_values[non_final_mask] = target_net(non_final_next_states).max(1)[0].detach()
    # Compute the expected Q values
    expected_state_action_values = (next_state_values * GAMMA) + reward_batch

    # Compute Huber loss
    criterion = nn.SmoothL1Loss()
    loss = criterion(state_action_values, expected_state_action_values.unsqueeze(1))

    # Optimize the model
    for param in policy_net.parameters():
        param.grad.data.clamp_(-1, 1)

Oh yeah you are right!
You have to calculate the state-action value for the choosen action which you messed up a little bit:

opt_pred[sample.action] = opt_pred[sample.action]+ sample.reward

So that code you wrote is the solution?

Yeah you were setting the q-value of the action to just the reward and not the qvalue of the action+reward. That was the mistake i guess. Change it, try it out and tell me if it worked

No wait that’s totally messed up. Why do you even clone the prediction for the opt prediction?

pred = self.net.predict(sample.state)[sample.action]

            opt_pred = sample.reward
            if sample.next_state is not None:
                opt_pred += t.max(self.target_net.predict(sample.next_state)) * self.gamma


1 Like

HOLY SHIT!!! NO FUCKING WAY!!! It works!!! Thank you sooooooo much

I was working on this for a very long time and I really appreciate the effort you put into helping me. So thank you one more time.

1 Like

Sooooooo, ehmmmmmm… I don’t want to bother you with this issue anymore but it looks like it worked only that one time, and now without any changes it’s still messed up. So, I would be glad if you helped me out again, but I totally understand if you don’t want to work on this anymore.

Oops, never mind, I just didn’t train it long enough :sweat_smile: :sweat_smile:

1 Like