How to create LSTM network with different hidden sizes in each layer

I am currently working on a network for speech sentiment analysis. I want to use an LSTM architecture-based model.

My data is of the shape (10039, 4, 68). So I have 10039 samples, and each sample has 20 timesteps with 68 features for each timestep.

From what I have seen here:

LSTM Cell refers to this:

My implementation below I believe to be correct. At each epoch in training, I will reinitialize my hidden states and retrieve from my whole dataset (10039 samples) a batch_size portion of for example 32. These 32 samples will get into the network and for each sample, I will go timestep by timestep up until 20, feeding the with 68 features, keeping the hidden (h_t2) from the last layer in the network. I will do this from each sample in my batch. Then I will concat the outputs from each batch and return them.

However I am confused because when I researched how to approach this problem, I haven’t seen anyone doing what I am doing, I see everyone else feeding the LSTM cells with inputs of shape=(timesteps, num_features). Also by doing what I do, my hidden_states have shape=(1,num_featuers) which seems odd. Still, if I feed the LSTMCell as everyone else will it not be completely missing the point?

I mean if you don’t feed the cells timestep by timestep and just give it the whole thing (timestep, num_features) arent you losing the information you gradually obtain overtime in the LSTMCells? Or if you give them (timesteps, num_features) they already handle this issue inside in the PyTorch implementation?

Basically, if I want to implement a 3 layer LSTM with hidden sizes=(50, 25, 25) is this the correct way to do it in pytorch?

from torch import nn
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
class AudioLSTM(nn.Module):
  def __init__(self, batch_size, timesteps, feature_size, hidden_size, dropout):
    super(AudioLSTM, self).__init__()
    self.hidden_size = hidden_size
    self.timesteps = timesteps
    self.batch_size = batch_size
    self.feature_size = feature_size

    self.batch_norm = nn.BatchNorm1d(self.timesteps)
    self.lstm1 = nn.LSTMCell(feature_size, hidden_size[0])
    self.lstm2 = nn.LSTMCell(hidden_size[0], hidden_size[1])
    self.lstm3 = nn.LSTMCell(hidden_size[1], hidden_size[2])
    self.dropout = nn.Dropout(dropout)

  def init_hidden(self):
    h_t0 = torch.zeros(1, self.hidden_size[0], dtype=torch.float32)
    c_t0 = torch.zeros(1, self.hidden_size[0], dtype=torch.float32)
    h_t1 = torch.zeros(1, self.hidden_size[1], dtype=torch.float32)
    c_t1 = torch.zeros(1, self.hidden_size[1], dtype=torch.float32)
    h_t2 = torch.zeros(1, self.hidden_size[2], dtype=torch.float32)
    c_t2 = torch.zeros(1, self.hidden_size[2], dtype=torch.float32)
    return [(h_t0, c_t0), (h_t1, c_t1), (h_t2, c_t2)]

  def forward(self, input, hidden):    
    input = self.batch_norm(input)
    outputs = []
    for b in range(self.batch_size):
      outputs_t = []

      for t in range(self.timesteps):
        input_t = input[b][t]
        input_t = input_t.view(1, self.feature_size)

        #hidden[0] = (h_t0, c_t0)
        hidden[0] = self.lstm1(input_t, hidden[0]) 

        hidden[1] = self.lstm2(hidden[0][0], hidden[1])

        hidden[2] = self.lstm3(hidden[1][0], hidden[2])
        output = hidden[2][0]


      outputs_t =, dim=1)
      outputs_t = outputs_t.reshape(1, -1).squeeze() #to "flatten"?

    outputs = torch.stack(outputs)
    outputs = self.dropout(outputs)

    return outputs 
model = AudioLSTM(batch_size=1, timesteps=4, feature_size=68, hidden_size=(50, 25, 2), dropout=0.3)

# toy example training
a = torch.arange(10039*4*68).reshape(1,4,68).type(torch.FloatTensor)
batch_size = 32
for epoch in range(10):
    a = split a in batches of batch_size
    hidden = model.init_hidden()
    out = model(a, hidden) 

Keras implementation I am trying to emulate:

# speech network
    input_speech = Input(shape=(20, 68), name='speech_input')
    net_speech = BatchNormalization()(input_speech)
    net_speech = LSTM(50, return_sequences=True)(net_speech) 
    net_speech = LSTM(25, return_sequences=True)(net_speech) 
    net_speech = LSTM(25, return_sequences=True)(net_speech) 

    net_speech = Flatten()(net_speech)
    model_speech = Dropout(0.3)(net_speech)

How about using the LSTM with projection? In torch

rnn0 = nn.LSTM(input_dim, 50, proj_size=25)
rnn1 = nn.LSTM(25, 25)
rnn2 = nn.LSTM(25, 25)

Thank you so much for your answer. I have read the docs but I can’t understand what proj_size is really doing. Could you give some explanation?

Just as the official doc:

But wouldn’t that be the same as doing this?

lstm1 = nn.LSTM(input_dim, 50)
lstm2 = nn.LSTM(50, 25)
lstm3 = nn.LSTM(25, 25)

Yes, I think this is closer to the Keras impl you post. Mine with projections is due to misunderstanding of your question.