Multi-Layer Bidirectional LSTM/GRU merge modes

I am trying to replicate my code from Keras into PyTorch to compare the performance of multi-layer bidirectional LSTM/GRU models on CPUs and GPUs. I would like to look into different merge modes such as ‘concat’ (which is the default mode in PyTorch), sum, mul, average. Merge mode defines how the output from the forward and backward direction will be passed on to the next layer.

In Keras, it’s just an argument change for the merge mode for a multi-layer bidirectional LSTM/GRU models, does something similar exist in PyTorch as well? One option is to do the merge mode operation manually after every layer and pass to next layer, but I want to study the performance, so I want to know if there is any other efficient way.


From the Keras Docs:

merge_mode: Mode by which outputs of the forward and backward RNNs will be combined. One of {‘sum’, ‘mul’, ‘concat’, ‘ave’, None}. If None, the outputs will not be combined, they will be returned as a list. Default value is ‘concat’.

If I understand this correctly, this concerns only the merging of the last(!) layer of a Bi-RNN, and is therefore independent from the number of layers. This seems intuitive since merging the different layers does not seem that meaningful to me.

So when it comes to merging the forward and backward pass, you can do the following:

# Push through RNN layer
rnn_output, self.hidden = self.rnn(X, self.hidden)

# Extract last hidden state depending on the RNN type
if self.params.rnn_type == RnnType.GRU:
    final_state = self.hidden.view(self.params.num_layers, self.num_directions, batch_size, self.params.rnn_hidden_dim)[-1]
elif self.params.rnn_type == RnnType.LSTM:
    final_state = self.hidden[0].view(self.params.num_layers, self.num_directions, batch_size, self.params.rnn_hidden_dim)[-1]

# Handle directions
final_hidden_state = None
if self.num_directions == 1:    # RNN is unidirectional
    final_hidden_state = final_state.squeeze(0)
elif self.num_directions == 2:  # RNN is bidirectional
    h_1, h_2 = final_state[0], final_state[1]
    # final_hidden_state = h_1 + h_2               # Add both states (requires changes to the input size of first linear layer + attention layer)
    final_hidden_state =, h_2), 1)  # Concatenate both states

(taken from my own code)

Dear Chris,

Thank you for the reply.
As best of my understanding on debugging KERAS code, the merge operation is performed after every layer in a multi-layer bidirectional LSTM/GRU model in KERAS. And I am trying to replicate the same behavior in PyTorch.

Also, thank you for the code snippet. PyTorch only allows to merge mode as ‘concat’ (by default), wouldn’t it be good to have the merge mode configurable, so the programmer could pick any between {‘sum’, ‘mul’, ‘concat’, ‘ave’, None}? My concern is not how can I perform merge mode as {‘sum’, ‘mul’, ‘concat’, ‘ave’, None}, but more from execution time performance perspective.