How to use GRU/LSTM is RL?

Hi all,
I want to add memory cell/layer to my model to improve performance on Atari games. But I’m not sure if I’m doing it right!
If I understood recurrent networks correctly, they take a sequence of observations from the environment. This defies the i.i.d assumption as the observations in the batch become highly correlated, but that is fine since the memory cells are made for that specific task.

That said, what I’m struggling with is how to use the hidden state output in my training. Should I:
1- Save it after each step and feed it back to the network in step + 1 until episode end?
2- OR ignore it and just initiate zero tensor at each step?

Secondly:
Assuming the GRU is feeding a dense layers below. Which output should I pass to the FC layer (output or hn)?

Hi Ayman,

If you don’t feed back the hidden state of t to the step t+1, you basically use a fancy version of a linear layer with shared weights for all time steps, but you loose the recurrent property of the model (I think this approach is called time distributed layer, which is not a recurrent one). Whenever you have a time series the states of a sequence are highly correlated so I don’t see a problem with that (since that is an inherent property to all series, there is nothing specific about RL in this case). Or am I getting your question wrong?

Also, you always have to feed the output to the next layer not the hidden state.

Hey,

That’s exactly the question I needed confirmation on. However, I tried initiating h0 as zeros at the beginning of each episode and make the agent pass it from one step to another until episode end. But the agent didn’t learn. Not even after couple million frames on any of the Atari games.
I must be doing something wrong!

On the second point, the output is nothing but hidden states from all sequences stacked. i.e. last value of output == h_n
If that is the case why not pass h_n to the dense layer instead of output. it’s computationally cheaper and backpropogation will go through the other hidden layers by default since h_n was calculated from them. What do you think?

(Ok, first of all a disclaimer: I’m for sure no expert, but I have worked with several RL algorithms in Deep Learning)
So having this out of the way :smiley: I see what you mean, interesting question… I guess the difference is that once the linear layer only knows about the last step, in the other case it has access to the entire sequence. I think both ways are valid, but in the context with RL with unknown sequence length you will probably have to go with with the hidden state. But I also assume that you will not need any Linear non-recurrent layer, so you could probably just use the hidden state of the last layer directly as output.

Regarding the first question what algorithm are you using? DQN, DDQN, or a Policy Gradient? Which environment are you trying to solve and what is the architecture you are using?

If you are interested, I’m also happy to discuss that with our outside of that threat!

You know what I’m going to try both approaches on a simple environment and see which one learns faster! I think sequence number can be fixed with pad_sequance.

I’m using A3C for this. I’d created the “fancy linear version” by adding GRU cell without feeding back hidden states from last step :). It gave much better results than A3C alone but still not the level I want!!

here’s the model I’m upgrading:

Sounds good, let me know what you found! It is for sure possible, but I think it is computationally less efficient, since you then have to process the entire length even though it is empty?

I’m sorry, I didn’t mean it as an insult! :sweat_smile:

Looks interesting I will have a look at it! :slight_smile:

Lol… non taken man! It must be my dark sense of humor :smiley:

As for the recurrent model; I’ve experimented yesterday with OpenAI game ‘CartPole-v0’, and a basic version of Policy Gradient to keep things simple until I understand the dynamics of recurrent networks in RL and they’re trained!

I’m going to benchmark it against this linear model, which solves the environment in 7 seconds.

class PGNet(nn.Module):
    r"""Plain vanilla policy gradient network."""

    def __init__(self, obs_size, act_size, hid_size=128):
        super().__init__()
        self.fc1 = nn.Linear(obs_size, hid_size)
        self.output = nn.Linear(hid_size, act_size, bias=True)

    def forward(self, x):
        """Feed forward."""
        y = F.relu(self.fc1(x))
        return self.output(y)

Screenshot from 2021-06-24 07-43-19

Benchmark code on :DeepRL/05_PolicyGradient at main · ayjabri/DeepRL · GitHub

I’ll post the logic and code later today…tbc

This is the Recurrent Policy Gradient model I’m gonna use. It has two GRU layers and one dense for policy output (just like the one above).

The network takes a sequence of 5 observations (.i.e. input size (batch, 5, features)) and pads it to the desired sequence number. It takes into account episode end

For example:
At the beginning of the episode the input is only one observation of shape (batch, 1, features). The functions adds zeros before the observation. But at episode end it adds them after the last observation.

class RNNPG(nn.Module):
    r"""Recurrent plain vanila policy gradient model."""

    def __init__(self, obs_size, act_size, sequence=5, hid_size=128, num_layers=2):
        super().__init__()
        self.sequence = sequence

        self.gru = nn.GRU(obs_size, hid_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hid_size*sequence, act_size)

    @staticmethod
    def pad_sequence(states, episode_end=False, sequence=5, padding_value=0.0):
        r"""Pad a sequence of observations upto sequence size by adding `padding_value` above or below the input.

        It pads above states by default unless episode end is `True` then it pads after (see example).

        Args:
            states: tensor of size (batch, sequence_0, features)

            episode_end: bool   

            sequence = 5: int. desired sequence size    

            padding_value = 0.0: float. value to pad tensor with     

        Example:
            >>> a = a = torch.ones(1,2,4)
            >>> print(a) 
            tensor([[[1., 1., 1., 1.],
                 [1., 1., 1., 1.]]])
            >>> a = pad_sequence(a,episode_end=False, sequence=5)
            >>> print(a)
            tensor([[[0., 0., 0., 0.],
                     [0., 0., 0., 0.],
                     [0., 0., 0., 0.],
                     [1., 1., 1., 1.],
                     [1., 1., 1., 1.]]])
        """
        # assuming batch is always first
        out_dims = (len(states), sequence, states.size(2))
        out_tensor = states[0].data.new(*out_dims).fill_(padding_value)
        length = states.size(1)
        if episode_end:
            out_tensor[:, :length, :] = states
        else:
            out_tensor[:, -length:, :] = states
        return out_tensor

    def forward(self, x, hx=None, episode_end=False):
        r"""Return output and pad the input if observation is less than sequence."""
        if x.size(1) < self.sequence:
            x = self.pad_sequence(x, episode_end, self.sequence)
        out, hx = self.gru(x, hx)
        out = F.relu(out.flatten(1))
        return self.fc(out), hx

Unfortunately the GRU (or LSTM) layer model is not learning! I’ve tried many combinations and feeding techniques, but it didn’t learn at all :frowning:
However, when I replace GRU layer with a GRUCell, the model works fine and solves the environment (in 6 seconds)

** My question is: what’s the difference between GRUCell and GRU Layer? And why this is happening?

=====================================================
Here’s the logic I’m using:

Reinforce algorithm:
1 - initiate the model using random weights
2 - play N full episodes gathering (state, action, reward)
3- for each episode: calculate the discounted rewards for subsequent steps:

𝑄(𝑘,𝑡) = ∑ 𝛾^𝑖 * 𝑟𝑖

4- Calculate the loss function and add entropy to improve exploration:

ℒ = − ∑ 𝑄(𝑘,𝑡) log (𝜋(𝑠(𝑘,𝑡) , 𝑎(𝑘,𝑡)))

5- perform SGD update of weights, minimizing loss
6- repeat until solved

When GRU layer:
a. hn = None at the beginning of each episode t=0
b. hn from t-1 is then fed to the network at step t

=========================================================

Here’s the code. To run it as a GRUCell mode just add --cell argument

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Sat Jun 26 14:39:33 2021

@author: Ayman Al Jabri
"""

import gym
import torch
import torch.nn as nn
import torch.nn.functional as F
import argparse
import numpy as np
from time import time
from collections import namedtuple
from itertools import count
from datetime import datetime, timedelta


Experience = namedtuple('Experience',['state','action','reward'])


class GRULayer(nn.Module):
    r"""GRU plain vanila policy gradient model."""

    def __init__(self, obs_size, act_size, hid_size=128, num_layers=2):
        super().__init__()
        self.hid_size = hid_size
        self.num_layers = num_layers
        
        self.input = nn.Linear(obs_size, 32)
        self.gru = nn.GRU(32, hid_size, num_layers, batch_first=True)
        self.output = nn.Linear(hid_size, act_size)

    def forward(self, x, hx=None):
        r"""Return output and pad the input if observation is less than sequence."""
        batch_size = x.size(0)
        y = self.input(x)
        y = y.view(batch_size, 1, -1)
        if hx is None:
            hx = torch.zeros((self.num_layers, x.size(0), self.hid_size))
            y, hx = self.gru(y, hx)
        else:
            y, hx = self.gru(y, hx)
        y = self.output(F.relu(y.flatten(1)))
        return y, hx
    
    
class GRUCell(nn.Module):
    r"""Plain vanilla policy gradient network."""

    def __init__(self, obs_size, act_size, hid_size=128):
        super().__init__()
        self.fc1 = nn.GRUCell(obs_size, hid_size)
        self.output = nn.Linear(hid_size, act_size, bias=True)

    def forward(self, x):
        """Feed forward."""
        y = F.relu(self.fc1(x))
        return self.output(y)
    

def discount_rewards(rewards, gamma):
    r"""
    Summary: calculate the discounted future rewards.

    Takes in list of rewards and discount rate
    Returns the accumlated future values of these rewards

    Example:
        >>> r = [1,1,1,1,1,1]
        >>> gamma = 0.9
        >>> [4.68559, 4.0951, 3.439, 2.71, 1.9, 1.0]
    """
    sum_r = 0.0
    res = []
    for r in reversed(rewards):
        sum_r *= gamma
        sum_r += r
        res.append(sum_r)
    return list(reversed(res))


@torch.no_grad()
def generate_eipsodes(env, gamma, cell, n=3):
    r"""Yield n number of episodes' observations:(state,action,discounted_rewards,total_rewards,frames)"""
    episode = 0
    batch_total_rewards = 0
    hc = None
    act_size = env.action_space.n
    states,actions,rewards = [],[],[]
    dis_r = []
    state = env.reset()
    for frame in count():
        state_v = torch.FloatTensor([state])
        # if np.random.random() > 0.5: # placeholder to injects noise
        #     state_v.fill_(0.0)
        #     pass
        if cell:
            prob = net(state_v) ##Linear
        else:
            prob, hc = net(state_v, hc) ##GRU
        prob = torch.softmax(prob,dim=1).data.cpu().numpy()[0]
        action = np.random.choice(act_size, p=prob)
        last_state, reward, done, _ = env.step(action)
        states.append(state)
        actions.append(action)
        rewards.append(reward)
        if done:
            dis_r.extend(discount_rewards(rewards, gamma))
            batch_total_rewards += sum(rewards)
            rewards.clear()
            episode += 1
            state=env.reset()
            hc = None
            if episode ==n:
                yield (np.array(states,copy=False),
                       np.array(actions),
                       np.array(dis_r), 
                       batch_total_rewards/n,
                       frame)
                states.clear()
                actions.clear()
                rewards.clear()
                dis_r.clear()
                batch_total_rewards = 0
                episode = 0
        state = last_state

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--cell', action='store_true', help='Use GRUCell instead of GRU layer')
    args = parser.parse_args()
    
    ENTROPY_BETA = 0.02
    GAMMA = 0.99
    HID_SIZE = 64
    NUM_LAYERS = 2
    SOLVE = 195
    LR = 0.01
    N_EPS = 1
    SEED = 155

    env = gym.make('CartPole-v0') 
    env.seed(SEED)
    torch.manual_seed(SEED)
    np.random.seed(SEED)
    obs_size = env.observation_space.shape[0]
    act_size = env.action_space.n
    
    if args.cell:
        net = GRUCell(obs_size, act_size, HID_SIZE) #Linear
    else:        
        net = GRULayer(obs_size, act_size,HID_SIZE,NUM_LAYERS) #GRU
        
    print(net)
    
    optimizer = torch.optim.Adam(net.parameters(), lr=LR)

    total_rewards = []
    print_time = time()
    start = datetime.now()
    frame = 0
    prev_frame = 0
    mean = None
    for episode,batch in enumerate(generate_eipsodes(env, GAMMA, args.cell, n=N_EPS)):
        states, actions, rewards, batch_total_rewards,frame = batch
        total_rewards.append(batch_total_rewards)
        mean_reward = np.mean(total_rewards[-100:])
        if time() - print_time > 1:
            speed = (frame - prev_frame)/(time()-print_time)
            prev_frame = frame
            print(
                f"{frame:,}: done {episode} episodes, mean reward {mean_reward:6.3f}, speed:{speed:.0f} fps", flush=True)
            print_time = time()

        if mean_reward > SOLVE:
            duration = timedelta(seconds=(datetime.now()-start).seconds)
            print(f'Solved in {duration}')
            break
        
        ### training ###
        states_v = torch.FloatTensor(states)
        batch_scale_v = torch.FloatTensor(discount_rewards(rewards, GAMMA))
        actions_v = torch.LongTensor(actions)
        optimizer.zero_grad()
    
        # policy loss
        if args.cell:
          logit = net(states_v) #Linear
        else:
            logit, hn = net(states_v) ##GRU
        log_p = F.log_softmax(logit, dim=1)

        # Gather probabilities with taken actions
        log_p_a = batch_scale_v * log_p[range(len(actions)), actions]
        policy_loss = - log_p_a.mean()

        # entropy loss
        probs_v = F.softmax(logit, dim=1)
        entropy = - (probs_v * log_p).sum(dim=1).mean()
        entropy_loss = - ENTROPY_BETA * entropy
        
        #total loss
        loss = policy_loss + entropy_loss
        loss.backward()
        optimizer.step()