How to retrieve hidden states for all time steps in LSTM or BiLSTM?

According to the docs of nn.LSTM outputs:

output (seq_len, batch, hidden_size * num_directions): tensor containing the output features (h_t) from the last layer of the RNN, for each t. If a torch.nn.utils.rnn.PackedSequence has been given as the input, the output will also be a packed sequence.
h_n (num_layers * num_directions, batch, hidden_size): tensor containing the hidden state for t=seq_len
c_n (num_layers * num_directions, batch, hidden_size): tensor containing the cell state for t=seq_len

If I want to get the hidden states for all t which means t =1, 2, …, seq_len, How can I do that? One approach is looping through an LSTM cell for all the words of a sentence and get the hidden state, cell state and output. I am doing a language modeling task using LSTM where I need the hidden state representations of all the words of a sentence.

Any help would be appreciated!


To get individual hidden states, you have to indeed loop over for each individual timestep and collect the hidden states.


Isn’t output containing all the hidden states (at least h)?

Thanks smth, I was also thinking about that. Just one more thing to confirm, in case if I loop over the individual timestep, will it be inefficient in terms of time complexity and will it change the loss computation for the entire network? Though i think it will not.

No, the output only contains output of the last time step.


It won’t change loss computation, but it will likely be much slower.

1 Like

no, output contains the hidden states for each time step, but only for the last layer in a stacked model, or the only layer in a single layer model.


Hey @wasiahmad, @smth , Can you share the code snippet on how you got all the input states?

I’m providing my code. Can you suggest my how should I go on?

Although, it is in torch, I guess the idea should be similar.

lstm=nn.LSTM(inputSize, hiddenSize,rho)
input=torch.randn(10,5)–I am providing 10*5 input tensor
print(lstm:getHiddenState(1)) --prints the cell state and hidden state for the first time step

Hi @Abhishek_Arya,

nn.LSTM(inputSize, hiddenSize,rho) can return 2 variables instead of 1
lstm, recent_hidden=nn.LSTM(inputSize, hiddenSize,rho)

lstm will contain the whole list of hidden states
while recent_hidden will give u the last hidden state.

print(lstm:getHiddenState(1)) – not sure what function will this be, but if you print out lstm and recent_hidden, it will on the last row of lstm, the values should be the same as recent_hidden.

Hope it helps

1 Like

RNNs are inherently sequential. They are auto-regressive, meaning the input for timestep t contains the output for timestep (t-1), meaning you have to first calculae the output for timestap (t-1).

This is one reason why ‘attention is all you need’ is quite interesting.

That is assuming that the LSTM has only one layer. It isn’t the case for a stacked LSTM.

1 Like

I experiment with using a for loop to collect every hidden state. The loss doesn’t change, but runtime is ~10x more (with a sequence length of 30).

        embedding = self.drop(self.emb(inputs))
        b_len = embedding.size(0)
        outputs = Variable(torch.cuda.FloatTensor(b_len, embedding.size(1), self.hidden_size))
        for i in range(b_len):
            cur_emb = embedding[i:i+1, :]#.view(1, embedding.size(1), embedding.size(2))
            o, hidden = self.gru_1(cur_emb, hidden)
            outputs[i, :, :] = o

Agree with you. I tried looping through each time step in order to get the h and c of all time steps, and the computation is much much much slower than simply using h, _ = self.lstm(x).

It would be good to have output of nn.LSTM containing both h and c of all timesteps.:sweat_smile:


Why not try GRU, output all the cells of LSTM will cost you huge time with step-by-step LSTMCell

1 Like

GRU and LSTM are essentially the same here. If you want to collect GRU hidden states, you have to loop through it as well. Only difference is GRU has 3/4 of LSTM parameters

for a fixed length of sequence, say X=[x1,x2,…,x1000]
if you use
lstm_output, lstm_states = nn.LSTM(X) it may cost 1 seconds
while, if you use LSTMCell and for loop,it will cost you about 10 seconds, amost 10x slower than nn.LSTM.
Here comes the question, you want save each states of all the sequence of LSTM, you have to use nn.LSTMCell and push the lstm_cell_state into a list to save.(nn.LSTM will only return the last cell state out)
If you use GRU, the hidden states of all the sequence is the hidden states, which are just the output of nn.GRU
gru_outputs,_ = nn.GRU(X)
in which
the gru_outputs is just what you want to retrive


Isn’t the last time step the last hidden state too? Sorry I’m a noob who likes to ask a lot of questions.

1 Like

As you can see in definition of the LSTM module here, the output will give you the last layer hidden layers in all timesteps.

output of shape (seq_len, batch, num_directions * hidden_size): tensor containing the output features (h_t) from the last layer of the LSTM, for each t.

This is not entirely true. The output of the network is different from the hidden state and cell state. The output does have the form of (seq_len, batch_num, num_directions x output_sz) when batch_first=False.

Unfortunately, the hidden and cell states returned are the final state from each layer in the stack, for each sample in the batch:(num_directionsxnum_layers, batch_sz, output_sz). e.g. during the inference of a batch with 2 input samples into a 5-layer bidir LSTM the hidden/cell states tensors would have a shape of (10, 2, output_dim).