Using multiple Linear layers with ReLU after LSTM layer good practice?

Pytorch-Noob · August 31, 2025, 10:21am

Should I use multiple Linear layers with ReLU after LSTM layer?

Below example is a model class I wrote, that has statefull LSTM implementation. It seems everything else is working good, except the fact that on average, when I use multiple Linear layers with ReLU, after the LSTM layer, it would give me a worse performance then just using single LSTM and Linear Layer without ReLU.

Please guide. You can just answer in theory since I know the code; but here I give code so that you can have easy overview:



class MYLSTM(nn.Module):
    def __init__(self, input_size, hidden_size, batch_size, num_layers=4):
        super(MYLSTM, self).__init__()
        # variables
        self.input_size = input_size # no. of features
        self.hidden_size = hidden_size
        self.batch_size = batch_size # the no. of previous steps in x for future y
        self.num_layers = num_layers
        # model properties
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers,dropout=0.2)
        self.fir = nn.Linear(hidden_size, 32) # 1 is single number as output
        self.sec = nn.Linear(32, 4) # 1 is single number as output
        self.last = nn.Linear(4, 1) # 1 is single number as output

        # activation func
        self.relu = nn.ReLU()

    def forward(self, input_data, h_0, c_0):
        # we cannot feed input_data directly to lstm without reshaping acc. to pytorch documentation
        sequence_length = len(input_data) # sequence length aka L, basically no. of rows
        if h_0 == None and c_0 == None: # runs only when h_0, c_0 is None which's at beginning of each epoch
            h_0 = torch.zeros(self.num_layers, self.batch_size, self.hidden_size) # num_layers, N, hidden size aka Hcell
            c_0 = torch.zeros(self.num_layers, self.batch_size, self.hidden_size) # num_layers, N, hidden size aka Hcell
        # to keep state values (h_0,c_0), but to detach them from the previous calculations
        h_0 = h_0.detach()
        c_0 = c_0.detach()
        input_lstm = input_data.view(sequence_length, self.batch_size, self.input_size) # L,N, input size aka Hin
        
        # we give our defined inputs in the lstm model as inputs and get lstm_out
        # lstm_out is treated further to get output for feeding to Linear layer
        lstm_out, (h_n, c_n) = self.lstm(input_lstm, (h_0, c_0))
        output = lstm_out.view(sequence_length, self.batch_size, self.hidden_size) # L,N,Hcell
        output = torch.mean(output, dim=1, keepdim=True) # output is 30,5,256(Hcell) shape we mean the 5 results at dim 1 so they all can combine as 1
        input_data = self.fir(output[:,-1,:])
        input_data = self.relu(input_data)
        input_data = self.sec(input_data)
        input_data = self.relu(input_data)
        input_data = self.last(input_data)

        return input_data, (h_n, c_n)

model = MYLSTM(1,64,batch_size).to(device)
model

Arunprakash-A · September 1, 2025, 6:32am

After reviewing your code, I assume you are working on a regression problem. In general, having a deeper output layer (Linear → ReLU → Linear → ReLU …) is not required (but it is fine), since the LSTM already compresses information across time steps. However, simply stating that the “performance is worse” is not sufficient to draw a conclusion.

Did you establish a baseline? Are you referring to test performance or training performance? Did you try other non-linear activations, such as LeakyReLU?

Including these additional details would be helpful.

Pytorch-Noob · September 1, 2025, 6:59am

Thank you for the reply Arun!

Well I am not interested in the performance specific to my application, but wanted a general understanding of whether Linear layers with an Activation function (such as ReLU) is used or not as a good practice after an LSTM layer. I think your reply implies that indeed it can be done, depending upon situation. However, you gave a very deep architectural insight which is:
”since the LSTM already compresses information across time steps”

One thing more, Should I use LSTM layers after using an LSTM layer? For example:
LSTM → Linear → ReLU → LSTM → Linear → ReLU → Linear
Does it makes any sense or logic to do this? Please give architectural insight for this example.
See I am trying to understand these insights and implications in this post, rather than hard values of judging performance via loss values.

Thank again and awaiting your reply!

Arunprakash-A · September 1, 2025, 9:24am

In principle, you can.

It depends on the specific task you are tackling. People have used it in the past for various problems. Sometimes, exploring the structure in the input (for example, using a bidirectional LSTM for text) may reduce the need for multiple stacked layers. However, you should be mindful that stacking more LSTM layers makes the model more prone to issues such as vanishing or exploding gradients and overfitting on small datasets

I would also recommend considering transformer-based architectures, which overcome many of these challenges and also offer parallelization of computation (like CNN)

vdw · September 1, 2025, 12:38pm

In general, more layers means more parameters, which in turn typically means that you have to train longer. It also depends on the kind of data and task.

Apart from that, when you say “worse performance“, are you referring to the training performance or test performance?

Pytorch-Noob · September 1, 2025, 2:25pm

I was referring to training, validation and testing performance being on average worse then when I used the LSTM+Linear instead of LSTM+multiple Linear and ReLU layers. The dataset was quite small synthetic one, so yeah I think I would have need to have a larger dataset to get some better performance, especially, when LSTM needs larger datset to work better.

vdw · September 2, 2025, 12:02am

If even the training performance drops when using more linear layers, then I would bet the larger model simply needs more training. Are more complex model should always be able to better overfit compared to a smaller model on the same dataset.