Just to replicate a historic approach (Kim, 2014), I am experimenting with conv1d for a sentence classification task. Embedding size for each word = 300. Sequence length (i.e. number of words in a sentence) = 512. So my input sentence is [512, 300]. (Let us omit batch size for this discussion). This way I can refer to each word in my input sentence as input[word_index]. A conv layer needs to convolve over sequence length (say, to get bi-grams or tri-grams). This means your embedding needs to be seen as ‘in_channels’.
Now, after digging around SO answers, it turns out, conv1d expects my input to have channels first, i.e. a sentence needs to be [300, 512].
This breaks the row-major convention which are historic defaults by C, NumPy, etc.
For output, as a diagram shows in Kim’s paper, one might expect sequence length to be the first index and channels (corresponding to your multiple filters) to be the second one. nn.conv1d gives output in [out_channels, sequence_length_with_padding_stride_calculation] format.
Why does PyTorch favor channels first?
If it’s a matter of convention, shouldn’t there be an easy way to choose your convention?
If it’s because CuDNN prefers NCHW, can PyTorch not internally change input?
The to and contiguous approach outlined here is convoluted. Why would you want devs to explicitly convert their input and their models? Can we not have a parameter in model declaration that decides channels_first or channels_last and the input is processed accordingly? If you want to maintain backwards compatibility, the parameter might have channels_first as default.
P.S. the conv1d documentation could greatly benefit if there is an explanation of this below the example, and possibly, if your input is channels_last, how to convert it and have contiguous memory allocation. If this needs to be added only in conv1d, I can raise a PR. If this needs explanation across multiple pages, a longer discussion might be helpful.