Concatenation of the hidden states produced by a Bidirectional LSTM

You’re not mistaken – the output parameter of all the PyTorch recurrent units (assuming batch_first=True) is, when used bidirectionally:
(num_examples, seq_len, 2 * hidden_dim)

The two 3D tensors are actually concatenated on the last axis, so to merge them, we usually do something like this:
output = output[:, :, :self.hidden_dim] + output[:, :, self.hidden_dim:]

You might also try averaging them (by dividing the merged hidden state by 2).

Alternately (and this is common) you can just use the concatenated hidden state as is (I assume you’re using the hidden state as a context vector to condition a decoder?) – this is quite common, I think.

This discussion is handy, and helped me out when I couldn’t quite figure out the documentation on PyTorch’s bidirectional RNNs.

Lastly, if you want a fixed-length summary of the hidden state, you can apply L2 pooling to the whole thing. I believe this is the component-wise root-mean-square of all the hidden states – in other words, component-wise square each hidden state at each time step; average them all together (i.e., sum them and divide by the sequence length); take the component-wise square root of the result.