BiDirectional 3 Layer LSTM Hidden Output

Can I just confirm as I don’t think it says in the docs that if I have a BiDirectional 3 Layer LSTM and it gives me a hidden output of (6, <batch size>, <hidden_state_size>) then [0,:,:] is 1st layer forward, [1,:,:] is 1st layer backward etc.?

Bonus question, if I feed my LSTM a PackedSequence object, I don’t need to unpack it if I am using the hidden state, not the output for the rest of the network?

1 Like

The docs say: h_n of shape (num_layers * num_directions, batch, hidden_size); the layers can be separated using h_n.view(num_layers, num_directions, batch, hidden_size). So in the bidirectional case, you have 2 tensors with hidden_size for each direction. I think most people just concatenate them before the next step, e.g., pushing through a Linear(2*hidden_size, output_size). I think I also saw both tensors being averaged or summed up element-wise.

Regarding your bonus question: I think your right. At least I got essentially the same result when I compared using PackedSequence with variable sequence lengths and padding vs. sequences with the same length and no padding.

Thanks for your response! I realize I mis-worded (and have since fixed) my question a bit. I said I would get an output tensor of size (3, <batch size>, <hidden_state_size>) but I meant (6, <batch size>, <hidden_state_size>) for a bidirectional 3 layer LSTM.

The question is: Are [0,:,:] and [1,:,:] of the hidden state output the first layer’s hidden forward and backward directions, respectively?

In other words, does the hidden state output (in order): Layer 1 Forward, Layer 1 Backward, Layer 2 Forward, … ?

Because I could see instead the output being Layer 1 Forward, Layer 2 Forward, …, Layer 1 Backward, Layer 2 Backward … It seems unlikely it would be like this but I wanted to confirm it was not!

1 Like

Please see blow my forward() function of GRU/LSTM classifier. In this code h_1 and h_2 represent the last hidden states for the forward and backward pass in case of a bidirectional RNN. I’m reasonably sure that I’ve read the docs correctly. It’s definitely work just fine. The code has a bunch of comments, but let me know if you have further questions:

def forward(self, X_sorted, X_length_sorted):
    batch_size = X_sorted.shape[0]
    # Push through embedding layer
    X = self.word_embeddings(X_sorted)
    # Transpose (batch_size, seq_len, dim) to (seq_len, batch_size, dim)
    X = torch.transpose(X, 0, 1)
    # Pack padded sequence
    X = nn.utils.rnn.pack_padded_sequence(X, X_length_sorted)
    # Push through RNN layer
    X, hidden = self.rnn(X, self.hidden)
    # Unpack packed sequence (not needed anymore, since hidden state is used)
    #X, output_lengths = nn.utils.rnn.pad_packed_sequence(X)

    if self.rnn_type == 'gru':
        final_state = hidden.view(self.num_layers, self.directions_count, batch_size, self.rnn_hidden_dim)[-1]
    elif self.rnn_type == 'lstm':
        final_state = hidden[0].view(self.num_layers, self.directions_count, batch_size, self.rnn_hidden_dim)[-1]
        raise Exception('Unknown rnn_type. Valid options: "gru", "lstm"')

    if self.directions_count == 1:
        X = final_state.squeeze()
    elif self.directions_count == 2:
        h_1, h_2 = final_state[0], final_state[1]
        #X = h_1 + h_2                # Add both states (needs different input size for first linear layer)
        X =, h_2), 1)  # Concatenate both states

    # Push trough series of linear layers (incl. nonlinearity & dropout)
    for l in self.linears:
        X = l(X)
    # Calculate and return normalized log probabilities
    log_probs = F.log_softmax(X, dim=1)
    return log_probs

Thanks for sharing. I guess if I want to be sure I am adding the right things then I should use .view(...) to separate out the layers and directions first before adding them together.

Quick question, why do you call [-1] at the end of the LSTM and GRU .view(). ?

Yeah, in case of multiple layers and two directions I would first call view(num_layers, num_directions, batch, hidden_size) to separate the hidden state cleanly.

The [-1] is simply to get the hidden state(s) of the last layer, given that the shape is (num_layers, num_directions, batch, hidden_size), i.e., the shape of final_state after that is (num_directions, batch, hidden_size). If num_directions=1 so the shape is (1, batch, hidden_size), I only need the squeeze(). If num_direction=2 so the shape is (2, batch_size, hidden) I can access the two directions with [0] and [1].

So instead of final_state, a better name would probably be last_layer_hidden_state or something.


Hi Chris

This concatenation and dense network action works great if you only want to use the final layer, hidden state. But how about when you need to reuse all layer hidden-states (both h and c) e.g. when using the 3 layer bidirectional encoder LSTM h_n–> 3-layer unidirectional decoder LSTM?

I assume I would then have to reshape using h_n.view() to get the grouped bidirectional hidden_states per layer, then iterating over each layer, concatenate the fw & bw states, push each concatenated layer through a dense network to get the final hidden_states variable. Another question is whether this is also necessary for the c_state of the LSTM.