Why does multi layer perceprons outperform RNN in CartPole?

Recently I compared two models for a DQN on CartPole-v0 environment. One of them is a multilayer perceptron with 3 layer and the other is an RNN built up from an LSTM and 1 fully connected layer. I have an experience replay buffer of size 200000 and the training doesn’t start until it is filled up. Although MLP has solved the problem under a reasonable amount of training steps (this means achieve a mean reward of 195 for the last 100 episodes), the RNN model could not converge as quickly and its maximum mean reward did not even reached 195 too!

I have already tried to increase batch size, add more neurons to the LSTM’S hidden state, increase the RNN’S sequence length and making the fully connected layer more complex - but every attempt failed as I saw enormous fluctuations in mean reward so the model harly converged at all. May these are the sings of early overfitting?

class DQN(nn.Module):
    def __init__(self, n_input, output_size, n_hidden, n_layers, dropout=0.3):
        super(DQN, self).__init__()

        self.n_layers = n_layers
        self.n_hidden = n_hidden

        self.lstm = nn.LSTM(input_size=n_input,
            hidden_size=n_hidden,
            num_layers=n_layers,
            dropout=dropout,
            batch_first=True)

        self.dropout= nn.Dropout(dropout)

        self.fully_connected = nn.Linear(n_hidden, output_size)

    def forward(self, x, hidden_parameters):
        batch_size = x.size(0)

        output, hidden_state = self.lstm(x.float(), hidden_parameters)

        seq_length = output.shape[1]

        output1 = output.contiguous().view(-1, self.n_hidden)
        output2 = self.dropout(output1)
        output3 = self.fully_connected(output2)

        new = output3.view(batch_size, seq_length, -1)
        new = new[:, -1]

        return new.float(), hidden_state

    def init_hidden(self, batch_size, device):
        weight = next(self.parameters()).data

        hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_().to(device),
            weight.new(self.n_layers, batch_size, self.n_hidden).zero_().to(device))

        return hidden

Contrarly to what I expected the simpler model gave much better result that the other; even though RNN’s supposed to be better in processing time series data.

Can anybody tell me what’s the reason for this?

Also, I have to state that I applied no feature engineering and both DQN’s worked with raw data. May using normalized features could RNN outperform the MLP? (I mean feeding both model with normalized data)

Is there anything you can recommend me to improve training efficiency on RNN’s to achieve the best results?

CartPole is actually a very simple problem. Just by using a simple random search you can get more than 200 reward, take a look at this.

RNNs should outperform MLP’s when the temporal/sequential information is really relevant. In the case of CarPole, you actually don’t care that much about what happened 2 iterations before, the actual state is much more relevant for you to decide what to do than the previous ones. This is not the case for a lot of applications where RNNs shine, like NLP, where, for instance, the last word of a sentence generally isn’t enough for you to interpret the whole sentence, and more, the order of the words matters a lot in language.

What do you feed as input of the neural network (x)?

@LeviViana
Thank you very much for your response and your provided resource!

Its the last N observation from the environment, and they are raw values.

Well, this explains why the mlp outperforms the lstm. Conceptually, to solve cartpole, all you need are the instant positions, speeds and accelerations. If you feed them directly (or even if you feed 3 consecutive positions, which allows to compute these informations), you don’t capture any additionnal useful information using a lstm.

Maybe using relative data e.g.: the differences betwen each consecutive speeds, accelerations etc… would improve the performance?

I am quite sure position, speed and acceleration are sufficient to describe the full dynamics of the system (a representation of the mass would be learnt at some point). Why would you want to consider higher moments?

I think there was a misunderstooding, I just want the model to learn faster, and I meant to use the deltas of speed/position etc… instead of absolute values of speed/position …