Lstm input size, hidden size and sequence lenght

I have a sequence of [Bacth=2, SeqLenght=128, InputFeatures=4]
I was reading about LSTM, but I am confuse.
According the documentation , there are two main parameters :

  • input_size – The number of expected features in the input x
  • hidden_size – The number of features in the hidden state h

Given and input, the LSTM outputs a vector h_n containing the final hidden state for each element in the sequence, however this size is [1, Batch, hidden_size], then I expect something like [SeqLength, Batch, hidden_size],

Also I noticed that LSTM does not matter the SeqLenght, so someone can me explain me how LSTM resolve the variable length?

LSTM can be thought as a data fusion, because the input features are reduced to hidden_size vector?

I am sorry, perhaps I am misunderstanding, but I like to know in detail how LSTM is working in order to use it properly.


I am currently developing a model using LSTM. For a good understanding of how it works in detail, I advise you to see this page.

If any doubts persists feel free to contact me or here in the forum to clarify any doubt.

Hi, Andre, I read the referred post.

First, the plots is unclear, which operation is performed between the input data x_t and the previous hidden state ? Addition?, dot product?

Second, the author claims:

This property enables LSTMs to process entire sequences of data (e.g. time series) without treating each point in the sequence independently, but rather, retaining useful information about previous data in the sequence to help with the processing of new data points…

According to it if I have a sequence vector of 128 points, x_i = [x_i,1, …, x_i,128], it is processed as a whole (“process entire sequences… without treating each point independently”), is it correct?

I read many papers that say that LSTM is able learn temporal features, i imagine that it really pass around each points in the sequence.

Another doubt is, what is the meaning the “previous data in the sequence”. Does it refer to other sequences?, for instance I have a dataset of x_i samples, i, 1…N, where each sample x_i contains a sequence of 128 samples.

In my test, I change the dimension of x_i, to 256, and LSTM seems invariant to time length, but I dont figure out how each sequence is being processed,

In addition, the classic example of Pytorch of LSTM has zero initial hidden state and cell state (h0, c0) in the forward method, so it seems that the LSTM internally traverse across individual sample point of the given sequence.

You are supposed to get something like that. Using your original example:

Let’s try a short code snippet:

import torch
from torch import nn

batch_size = 2
sequence_length = 128
input_features = 4
output_features = 16

# produce random data
x = torch.randn(batch_size, sequence_length, input_features)
# torch.Size([2, 128, 4])

lstm_layer = nn.LSTM(

expected_output_shape = (batch_size, sequence_length, output_features)

x_out, _ = lstm_layer(x)

print(x_out.shape == expected_output_shape)
# True

# torch.Size([2, 128, 16])

If your data is in the form (batch, sequence, features) you have to create the LSTM with the batch_first=True argument (it is False by default). Otherwise, you have to pass your data to the LSTM in the form (sequence, batch, features) instead, and you will similarly get your output as sequence, batch, features_out).

This article would be a good read for you. Look at the following image, of an “unrolled” LSTM:

In the left, we see that there is a “loop” connecting the LSTM A to itself. This is explained in the right side, and I will try to break it down to very simple steps:

  • Your input, X, is a sequence of length T. (Let us consider batch = 1 for simplicity.) This means the sequence X is composed of [x_0, x_1, ... , x_t]
  • x_0 is first passed to the LSTM cell
    • You can consider x_0 (and any x_i) to have the size (1, 1, input_features). It is just a single item inside the sequence, and our batch is 1.
  • From x_0, the corresponding output h_0 is computed
    • h_0 will have the size (1, 1, output_features)
  • The cell state of A is updated. Some old information has been retained, some old information has been forgotten, and some new information has been gained.
  • Now, x_1 is passed to A. Similarly, h_1 is computed, and A is again updated
  • This process will occur T times. Each time, an item in the sequence will pass through A, and A’s cell state will be updated
  • Thus, for X = [x_0, x_1, ... , x_t], you will get the output [h_0, h_1, ... , h_t]

So naturally, LSTMs are invariant to sequence_length or the time dimension. Whatever length of sequence you give, the output will also have the same sequence length. For more reading, continue here.

1 Like

@ID56, First, I am grateful with you. Your code works fine and clarifies the both input and output of Pytorch LSTM.

Now I am understand about the invariant property. I draw a plot for:

output_features= 16

please feel free to correct me if I am still wrong.

Thank you for your time and shared links.


Yes it looks great! But I would put the (1, 1, input_features) vectors below A, because they may be a bit confusing otherwise. Maybe this would illustrate it better:

I’m drawing for a smaller hidden_size=6, so that it is easier to show the result.

This last picture is great for better understanding. Thank you.