Dimension ordering seems to be inconsistent for 1D networks (for Natural Language Processing and some other signal processing). In case of combining embeddings, convolutions and recurrent networks it requires multiple dimension permutation operations through the code. It makes it less readable, more error-prone, and precludes from using Sequential
to combine layers.
Examples
That is, for:
- N - batch size / sample size
- L - sequence length
- C - the number of features / channels / filters
we get:
(N, C, L)
-
Conv1d
,MaxPool1d
,BatchNorm1d
, etc
(N, L, C)
-
LSTM
,GRU
withbatch_first=True
-
Embedding
(output) -
Linear
(assuming we typically mix channels; vide 1x1 convolution)
(L, N, C)
-
LSTM
,GRU
with default options,
(N, *)
DataLoader
Questions
- Why different order?
- At least, is there some canonical PyTorch dimension ordering?
- Which permute operations affect performance?
In case of batch_first=False
, regardless if we use output or hidden units, we need to run x= x.transpose(0, 1).contiguous()
to pass it to linear operators. Does the speed-up for using these options outweigh the slowdown for reordering dimensions?