Trying to understand behavior of Bidirectional LSTM

Okay, I figured out what I was doing wrong. The first dimension is the sequence length, second dimension is the batch size. I had them reversed and interpreted the output as the hidden state at each timestep, but it was actually the hidden state for 10 different batches.

Can anyone explain the motivation for putting the sequence length in the first dimension? I find it counter-intuitive.

And thank you to @wasiahmad for your question, here: