I would have expected the values in h5[1,:,:] to not be equal to the values in h10[1,:5,:]. My understanding of the bidirectional LSTM is that the output from the forward LSTM is then fed into the backward LSTM in reverse order. If the sequence that generated h10 is longer than the sequence that generated h5, then the first value read by the backwards pass of h10 is not the same first value read by the backwards pass of h5, so they should not have the same result. Can somebody explain what I am missing?

Okay, I figured out what I was doing wrong. The first dimension is the sequence length, second dimension is the batch size. I had them reversed and interpreted the output as the hidden state at each timestep, but it was actually the hidden state for 10 different batches.

Can anyone explain the motivation for putting the sequence length in the first dimension? I find it counter-intuitive.

And thank you to @wasiahmad for your question, here:

Hey. Bidirectional LSTM doesn’t feed the output of the forward layer to the backward layer. Rather they independently traverse the sequence and come up with their representations.