Inconsistent dimension ordering for 1D networks - NCL vs NLC vs LNC

Dimension ordering seems to be inconsistent for 1D networks (for Natural Language Processing and some other signal processing). In case of combining embeddings, convolutions and recurrent networks it requires multiple dimension permutation operations through the code. It makes it less readable, more error-prone, and precludes from using Sequential to combine layers.


That is, for:

  • N - batch size / sample size
  • L - sequence length
  • C - the number of features / channels / filters

we get:

(N, C, L)

  • Conv1d, MaxPool1d, BatchNorm1d, etc

(N, L, C)

  • LSTM, GRU with batch_first=True
  • Embedding (output)
  • Linear (assuming we typically mix channels; vide 1x1 convolution)

(L, N, C)

  • LSTM, GRU with default options,

(N, *)

  • DataLoader


  • Why different order?
  • At least, is there some canonical PyTorch dimension ordering?
  • Which permute operations affect performance?

In case of batch_first=False, regardless if we use output or hidden units, we need to run x= x.transpose(0, 1).contiguous() to pass it to linear operators. Does the speed-up for using these options outweigh the slowdown for reordering dimensions?


The canonical Pytorch dimension ordering is (N, C, **) where ** is shape dimensions. For a sequence that gives (N, C, L); for an image (N, C, H, W), etc etc.

LSTM, GRU are a little special because it’s much more efficiently for those to run when their batch dimension isn’t first.

1 Like

Vide: Tensor Considered Harmful - a proposal for named tensor dimension.