output (seq_len, batch, hidden_size * num_directions): tensor containing the output features (h_t) from the last layer of the RNN, for each t. If a torch.nn.utils.rnn.PackedSequence has been given as the input, the output will also be a packed sequence.
h_n (num_layers * num_directions, batch, hidden_size): tensor containing the hidden state for t=seq_len
c_n (num_layers * num_directions, batch, hidden_size): tensor containing the cell state for t=seq_len
If I want to get the hidden states for all t which means t =1, 2, …, seq_len, How can I do that? One approach is looping through an LSTM cell for all the words of a sentence and get the hidden state, cell state and output. I am doing a language modeling task using LSTM where I need the hidden state representations of all the words of a sentence.
Thanks smth, I was also thinking about that. Just one more thing to confirm, in case if I loop over the individual timestep, will it be inefficient in terms of time complexity and will it change the loss computation for the entire network? Though i think it will not.
Hey @wasiahmad, @smth , Can you share the code snippet on how you got all the input states?
I’m providing my code. Can you suggest my how should I go on?
Although, it is in torch, I guess the idea should be similar.
inputSize=5
hiddenSize=5
rho=5
lstm=nn.LSTM(inputSize, hiddenSize,rho)
input=torch.randn(10,5)–I am providing 10*5 input tensor
print(lstm:getHiddenState(1)) --prints the cell state and hidden state for the first time step
nn.LSTM(inputSize, hiddenSize,rho) can return 2 variables instead of 1
lstm, recent_hidden=nn.LSTM(inputSize, hiddenSize,rho)
lstm will contain the whole list of hidden states
while recent_hidden will give u the last hidden state.
print(lstm:getHiddenState(1)) – not sure what function will this be, but if you print out lstm and recent_hidden, it will on the last row of lstm, the values should be the same as recent_hidden.
RNNs are inherently sequential. They are auto-regressive, meaning the input for timestep t contains the output for timestep (t-1), meaning you have to first calculae the output for timestap (t-1).
This is one reason why ‘attention is all you need’ is quite interesting.
Agree with you. I tried looping through each time step in order to get the h and c of all time steps, and the computation is much much much slower than simply using h, _ = self.lstm(x).
It would be good to have output of nn.LSTM containing both h and c of all timesteps.
GRU and LSTM are essentially the same here. If you want to collect GRU hidden states, you have to loop through it as well. Only difference is GRU has 3/4 of LSTM parameters
for a fixed length of sequence, say X=[x1,x2,…,x1000]
if you use lstm_output, lstm_states = nn.LSTM(X) it may cost 1 seconds
while, if you use LSTMCell and for loop,it will cost you about 10 seconds, amost 10x slower than nn.LSTM.
Here comes the question, you want save each states of all the sequence of LSTM, you have to use nn.LSTMCell and push the lstm_cell_state into a list to save.(nn.LSTM will only return the last cell state out)
If you use GRU, the hidden states of all the sequence is the hidden states, which are just the output of nn.GRU
say gru_outputs,_ = nn.GRU(X)
in which gru_outputs=[h1,h2,...h1000]
the gru_outputs is just what you want to retrive
As you can see in definition of the LSTM module here, the output will give you the last layer hidden layers in all timesteps.
output of shape (seq_len, batch, num_directions * hidden_size): tensor containing the output features (h_t) from the last layer of the LSTM, for each t.
This is not entirely true. The output of the network is different from the hidden state and cell state. The output does have the form of (seq_len, batch_num, num_directions x output_sz) when batch_first=False.
Unfortunately, the hidden and cell states returned are the final state from each layer in the stack, for each sample in the batch:(num_directionsxnum_layers, batch_sz, output_sz). e.g. during the inference of a batch with 2 input samples into a 5-layer bidir LSTM the hidden/cell states tensors would have a shape of (10, 2, output_dim).