Why the default inputs' dimensions of LSTM are [sequence_length, batch_size, feature_size]?

Why the default inputs’ dimensions of LSTM are [sequence_length, batch_size, feature_size]?
I think it is more naturally to use the data in the shape of [batch_size, sequence_length, feature_size]

Before I comment on the principle, if your input_data is of shape [batch_size, sequence_length, feature_size], then input_data.permute(1, 0, 2) will transform it into shape [sequence_length, batch_size, feature_size].

I believe that permute doesn’t copy the data, it just alters the strides used for the underlying array, so it is very efficient.

I am running some analyses on some really long time-series data and I wanted to create sequential batches. I found that if my data was of shape [batch_size, sequence_length, feature_size], then selecting slices of the form [:, start:end, :] gave me non-contiguous tensors and the model couldn’t use them directly. So to avoid having to copy the tensor in order to make it contiguous, I first made sure my data was of shape [sequence_length, batch_size, feature_size] and then it all worked.

I also saw this with nn.MultiheadAttention. It’s still not clear to me why you would not have the first dimension be the batch size like for nn.Linear.

The layout is chosen for performance reasons as also said here.
Also, for RNNs you can use batch_first=True to change the shapes.