Why is my DQN (Deep Q Network) not learning?

Vladislav_Korecky · July 10, 2021, 7:05am

I am training a DQN (Deep Q Network) on a CartPole problem from OpenAI’s gym, but the total score from an episode decreases, instead of increasing. I don’t know if it is helpful but I noticed that the AI prefers one action over another and refuses to do anything else (unless it is forced by the epsilon greedy strategy), at least for some time. I tried my best, but I just can’t figure out what is going on.

Here is my code:

(qlearning.py)

import torch as t
import torch.nn as nn
import torch.nn.functional as f

import random as r


class QNet:
    def predict(self, x: t.Tensor) -> t.Tensor:
        pass

    @staticmethod
    def copy_weights(origin: [], target: []):
        for origin_layer, target_layer in zip(origin, target):
            target_layer.weight = nn.Parameter(origin_layer.weight.clone())


class Memory:
    def __init__(self, state: t.Tensor, next_state: t.Tensor, action: int, reward: float):
        self.state = state
        self.next_state = next_state
        self.action = action
        self.reward = reward


class ReplayMemory:
    def __init__(self, capacity: int):
        self.capacity = capacity
        self.memories = []

    def add_memory(self, memory: Memory):
        self.memories.append(memory)

        if len(self.memories) > self.capacity:
            self.memories.pop(0)

    def get_batch(self, size: int):
        if len(self.memories) < size:
            raise Exception("There are not enough memories to make a batch.")

        start_index = r.randint(0, len(self.memories) - size)
        end_index = start_index + size
        return self.memories[start_index:end_index]


class QLearning:
    def __init__(self, net: QNet, target_net: QNet, optimizer, gamma: float):
        self.net = net
        self.target_net = target_net
        self.optimizer = optimizer
        self.gamma = gamma

    def train(self, batch: [Memory]):
        batched_pred = []
        batched_opt_pred = []
        for sample in batch:
            pred = self.net.predict(sample.state)

            opt_pred = pred.clone()
            opt_pred[sample.action] = sample.reward
            if sample.next_state is not None:
                opt_pred[sample.action] += t.max(self.target_net.predict(sample.next_state)) * self.gamma

            batched_pred.append(pred)
            batched_opt_pred.append(opt_pred)

        loss = f.mse_loss(t.stack(batched_pred), t.stack(batched_opt_pred))
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

(main.py)

import gym

from qlearning import *

env = gym.make("CartPole-v1")
state = t.tensor(env.reset(), dtype=t.float)


class Agent(nn.Module, QNet):
    def __init__(self):
        super().__init__()

        self.l1 = nn.Linear(4, 32)
        self.l2 = nn.Linear(32, 16)
        self.l3 = nn.Linear(16, 8)
        self.l4 = nn.Linear(8, 4)
        self.l5 = nn.Linear(4, 2)

    def predict(self, x):
        y = f.relu(self.l1(x))
        y = f.relu(self.l2(y))
        y = f.relu(self.l3(y))
        y = f.relu(self.l4(y))
        return self.l5(y)


agent = Agent()
target_agent = Agent()
q = QLearning(agent, target_agent, optim.Adam(agent.parameters(), lr=0.001), 0.9)
replay_memory = ReplayMemory(100000)
epsilon = 1
epsilon_dec = 1 / 1000
total_reward = 0
for i in range(1000):
    env.render()

    action = 0
    if r.random() > epsilon:
        action = t.argmax(agent.predict(state)).item()
    else:
        action = env.action_space.sample()

    epsilon -= epsilon_dec

    next_state, reward, done, info = env.step(action)
    next_state = t.tensor(next_state, dtype=t.float)
    if done:
        reward = -1
        replay_memory.add_memory(Memory(state, None, action, reward))
    else:
        replay_memory.add_memory(Memory(state, next_state, action, reward))

    total_reward += reward

    if done:
        state = t.tensor(env.reset(), dtype=t.float)

        # print(int(total_reward))
        total_reward = 0

    if len(replay_memory.memories) >= 10:
        q.train(replay_memory.get_batch(10))

    if i % 10:
        QNet.copy_weights([agent.l1, agent.l2, agent.l3, agent.l4, agent.l5],
                          [target_agent.l1, target_agent.l2, target_agent.l3, target_agent.l4, target_agent.l5])

    state = next_state
env.close()

TheUnnamed22 · July 11, 2021, 1:54pm

You have to copy the learning network to the target network. The target network is not updating.



def update_Network_Parameters(self,tau = None):
        
        if tau == None:
            tau_param = self.tau_param
            tau_q = self.tau_q
        else:
            tau_param = 1
            tau_q = 1
        
        q_net_dict = self.q_net_dict.named_parameters()
        target_q_net_dict = self.target_q_net_dict.named_parameters()

        q_net_dict_dict = dict(q_net_dict)
        target_q_net_dict = dict(target_q_net_dict)

        for name in q_net_dict:
            target_q_net_dict [name] = tau_param*copy.deepcopy(q_net_dict[name])+(1-tau_param)*copy.deepcopy(target_q_net_dict[name])
        
        
        self.target_q_net_dict.load_state_dict(target_q_net_dict )

Vladislav_Korecky · July 11, 2021, 5:42pm

@TheUnnamed22
I think I do. Right here:

QNet.copy_weights([agent.l1, agent.l2, agent.l3, agent.l4, agent.l5],
                          [target_agent.l1, target_agent.l2, target_agent.l3, target_agent.l4, agent.l5])

(Here is the used function)

@staticmethod
 def copy_weights(origin: [], target: []):
     for origin_layer, target_layer in zip(origin, target):
         target_layer.weight = nn.Parameter(origin_layer.weight.clone())

I even made an if statement to be sure:

weights = q.target_net.l1.weight
QNet.copy_weights([agent.l1, agent.l2, agent.l3, agent.l4, agent.l5],
                          [target_agent.l1, target_agent.l2, target_agent.l3, target_agent.l4, agent.l5])
if not t.all(t.eq(q.target_net.l1.weight, weights)).item():
   print("works fine")

But I have to say that in the previous code, I didn’t copy the last (fifth) layer. So your statement was technically true.

TheUnnamed22 · July 11, 2021, 5:46pm

Try not to hard copy them. Q-Learning tends to diverge fast if the target is moving too fast.
Also you have a really deep neural network. Try a more shallow network. use only 2-3 layers. As mentioned in the OFENET paper deep RL networks tend to diverge also.

Beside does it work now?

Vladislav_Korecky · July 11, 2021, 7:05pm

Unfortunately, no (even after your improvements). But I think the problem is somewhere in the train function.

TheUnnamed22 · July 11, 2021, 7:14pm

Also you don’t want to train the targetnetwork.
So you have to put the line "with torch.no_grads():" in

[“with torch.no_grads():”]
if sample.next_state is not None:
opt_pred[sample.action] += t.max(self.target_net.predict(sample.next_state)) * self.gamma

Vladislav_Korecky · July 12, 2021, 5:35am

Thanks, but it still doesn’t work. I think I should point out, that when I print the loss, most of the values are fine (somewhere under 1), but in some specific conditions, the loss can even be around 60.

(My PyCharm console)
Screenshot 2021-07-12 at 7.24.49

Also, I noticed that most of the people gather the state-action values from the prediction and target and calculate the loss with these. Instead I use the whole tensor, so I wonder if this is not the problem.

(Code from PyTorch tutorial on DQN)

state_action_values = policy_net(state_batch).gather(1, action_batch)

next_state_values = torch.zeros(BATCH_SIZE, device=device)
next_state_values[non_final_mask] = target_net(non_final_next_states).max(1)[0].detach()

TheUnnamed22 · July 12, 2021, 10:44am

This is in case of a policy network in actor critic methods?

DQN doesn’t have a policy network just a value network.

Vladislav_Korecky · July 12, 2021, 12:11pm

No, it’s not. Or at least I don’t want it to be a policy network.

TheUnnamed22 · July 12, 2021, 1:37pm

Look DQN has just and only 1 Value network.
Actor-Critic have 1 Value network and 1 policy network.

The policy is evaluated by the value network. That’s why you gather the max actions of the policy network and assign values to it through the value network.

Vladislav_Korecky · July 12, 2021, 4:33pm

Yeah, but that code was from the PyTorch tutorial on DQNs. Here`s the link: Reinforcement Learning (DQN) Tutorial — PyTorch Tutorials 1.9.0+cu102 documentation

And this is their training code:

    state_batch = torch.cat(batch.state)
    action_batch = torch.cat(batch.action)
    reward_batch = torch.cat(batch.reward)

    # Compute Q(s_t, a) - the model computes Q(s_t), then we select the
    # columns of actions taken. These are the actions which would've been taken
    # for each batch state according to policy_net
    state_action_values = policy_net(state_batch).gather(1, action_batch)

    # Compute V(s_{t+1}) for all next states.
    # Expected values of actions for non_final_next_states are computed based
    # on the "older" target_net; selecting their best reward with max(1)[0].
    # This is merged based on the mask, such that we'll have either the expected
    # state value or 0 in case the state was final.
    next_state_values = torch.zeros(BATCH_SIZE, device=device)
    next_state_values[non_final_mask] = target_net(non_final_next_states).max(1)[0].detach()
    # Compute the expected Q values
    expected_state_action_values = (next_state_values * GAMMA) + reward_batch

    # Compute Huber loss
    criterion = nn.SmoothL1Loss()
    loss = criterion(state_action_values, expected_state_action_values.unsqueeze(1))

    # Optimize the model
    optimizer.zero_grad()
    loss.backward()
    for param in policy_net.parameters():
        param.grad.data.clamp_(-1, 1)
    optimizer.step()

TheUnnamed22 · July 12, 2021, 5:51pm

Oh yeah you are right!
You have to calculate the state-action value for the choosen action which you messed up a little bit:

opt_pred[sample.action] = opt_pred[sample.action]+ sample.reward

Vladislav_Korecky · July 12, 2021, 6:00pm

So that code you wrote is the solution?

TheUnnamed22 · July 12, 2021, 6:18pm

Yeah you were setting the q-value of the action to just the reward and not the qvalue of the action+reward. That was the mistake i guess. Change it, try it out and tell me if it worked

TheUnnamed22 · July 12, 2021, 6:32pm

No wait that’s totally messed up. Why do you even clone the prediction for the opt prediction?

pred = self.net.predict(sample.state)[sample.action]

            
            opt_pred = sample.reward
            if sample.next_state is not None:
                opt_pred += t.max(self.target_net.predict(sample.next_state)) * self.gamma

            batched_pred.append(pred)
            batched_opt_pred.append(opt_pred)

Vladislav_Korecky · July 12, 2021, 6:42pm

HOLY SHIT!!! NO FUCKING WAY!!! It works!!! Thank you sooooooo much

Vladislav_Korecky · July 12, 2021, 6:43pm

I was working on this for a very long time and I really appreciate the effort you put into helping me. So thank you one more time.

Vladislav_Korecky · July 13, 2021, 7:58am

Sooooooo, ehmmmmmm… I don’t want to bother you with this issue anymore but it looks like it worked only that one time, and now without any changes it’s still messed up. So, I would be glad if you helped me out again, but I totally understand if you don’t want to work on this anymore.

Vladislav_Korecky · July 14, 2021, 5:52am

Oops, never mind, I just didn’t train it long enough