Dimension ordering seems to be inconsistent for 1D networks (for Natural Language Processing and some other signal processing). In case of combining embeddings, convolutions and recurrent networks it requires multiple dimension permutation operations through the code. It makes it less readable, more error-prone, and precludes from using
Sequential to combine layers.
That is, for:
- N - batch size / sample size
- L - sequence length
- C - the number of features / channels / filters
(N, C, L)
(N, L, C)
Linear(assuming we typically mix channels; vide 1x1 convolution)
(L, N, C)
GRUwith default options,
- Why different order?
- At least, is there some canonical PyTorch dimension ordering?
- Which permute operations affect performance?
In case of
batch_first=False, regardless if we use output or hidden units, we need to run
x= x.transpose(0, 1).contiguous() to pass it to linear operators. Does the speed-up for using these options outweigh the slowdown for reordering dimensions?