How much deep a Neural Network Required for 12 inputs of ranging from -5000 to 5000 in a3c Reinforcement Learning

granth_jain · November 13, 2020, 1:04pm

I am trying to use A3C with LSTM for an environment where states has 12 inputs ranging from -5000 to 5000. I am using an LSTM layer of size 12 and then 2 fully connected hidden layers of size 256, then 1 fc for 3 action dim and 1 fc for 1 value function. The reward is in range (-1,1).

However during initial training I am unable to get good results.

My question is- Is this Neural Network good enough for this kind of environment? Or this bad performance initially is due to lstm?

Below is the code for Actor Critic

class ActorCritic(torch.nn.Module):

    def __init__(self, params):
        super(ActorCritic, self).__init__()

        self.state_dim = params.state_dim
        self.action_space = params.action_dim
        self.hidden_size = params.hidden_size
        state_dim = params.state_dim
        self.lstm = nn.LSTMCell(state_dim, state_dim)
        self.lstm.bias_ih.data.fill_(0)
        self.lstm.bias_hh.data.fill_(0)
        lst = [state_dim]
        for i in range(params.layers):
            lst.append(params.hidden_size)
        
        self.hidden = nn.ModuleList()
        for k in range(len(lst)-1):
            self.hidden.append(nn.Linear(lst[k], lst[k+1]))
        for layer in self.hidden:
            layer.apply(init_weights)

        self.critic_linear = nn.Linear(params.hidden_size, 1)
        self.critic_linear.apply(init_weights)
        self.actor_linear = nn.Linear(params.hidden_size, self.action_space)
        self.actor_linear.apply(init_weights)
        self.train()

    def forward(self, inputs):
        inputs, (hx, cx) = inputs
        inputs = inputs.reshape(1,-1)
        hx, cx = self.lstm(inputs, (hx, cx))
        x = hx
        for layer in self.hidden:
            x = torch.tanh(layer(x))
        return self.critic_linear(x), self.actor_linear(x), (hx, cx)

class Params():
    def __init__(self):
        self.lr = 0.0001
        self.gamma = 0.99
        self.tau = 1.
        self.num_processes = os.cpu_count()
        self.state_dim = 12
        self.action_dim = 3
        self.hidden_size = 256
        self.layers = 2
        self.lstm_layers = 1
        self.lstm_size = self.state_dim
        self.num_steps = 20

Henry_Chibueze · November 13, 2020, 1:10pm

The LSTM network is a type of Recurrent Neural network used for detecting and recognizing sequencial patterns for some given time steps or time series data.

So except the rewards of u RL environment are gotten in a sequencial manner of exploration / exploitation then I suggest u use a different architecture

Anyways even without all these details I’ll still suggest u change ur architecture just to check if it’s really from the network or from sth else u did or failed to do.

Just saying

granth_jain · November 13, 2020, 1:40pm

Hi,
Thanks for the reply.

Actually I want my model to remember some information about the past and therefore I am using LSTM.
However I am not sure if this precision of states can be handled by the neural network.

Henry_Chibueze · November 13, 2020, 1:48pm

Yes it can be handled by a Neural network
I don’t really see anything other practical way of doing this without a neural network except for some strange reasons, u have access to petabytes of ram and a processor that can process petabytes of data in seconds then the standard Q-learning algorithm will suffice, but u and I know that’s not really possible tho lol😅

So yah

Then again can u give me brief run down of this environment u r using?

Henry_Chibueze · November 13, 2020, 1:52pm

Also when u said:

What exactly is it u want the model to remember?
Is it the previous states and actions it took when it was there?

granth_jain · November 13, 2020, 1:59pm

yes, I want it to remember the state information from the past.

Henry_Chibueze · November 13, 2020, 2:27pm

Well u can still use an LSTM with some dense layers as u initially did
There’s really no rule of thumb here or sth like that.

Hmmmm🤔

I’m kinda curious tho. How would it affect the decision it takes next?
Coz from my knowledge given a state the action selected is the one with the maximum approximate value of Q outputted by the network

Or is there sth else u wish to do with this?