Conv1d layer performance

Hi, I was working with the Conv1d layer and noticed a weird inference speed degradation comparing 2 ways of input propagation through this layer, lets say we have:

conv_1 = nn.Conv1d(in_channels=1, out_channels=20, kernel_size=(1, 300))
conv_1.weight.data.fill_(0.01)
conv_1.bias.data.fill_(0.01)

conv_2 = nn.Conv1d(in_channels=300, out_channels=20, kernel_size=1)
conv_2.weight.data.fill_(0.01)
conv_2.bias.data.fill_(0.01)

x1 = torch.FloatTensor(np.ones((10, 1, 100000, 300)))
out1 = conv_1(x1).squeeze(3)

x2 = torch.FloatTensor(np.ones((10, 300, 100000)))
out2 = conv_2(x2)

torch.allclose(out1, out2, atol=1e-6)

>>> True

Then I tried to measure performance speed for conv_1 and conv_2 and got the following results:



Can please someone explain me this almost 2-x performance degradation and if this issue is reproducible or not?

Config:
PyTorch==1.6.0 via pip
Operating System: Ubuntu 18.04.5 LTS
Kernel: Linux 4.15.0-123-generic
CPU: product: Intel(R) Core™ i5-7200U CPU @ 2.50GHz

your input tensors are permuted differently (300 element vectors are either contiguous or scattered), so different strategies may be used to obtain the result, in this case mkldnn library does the inner loop, and in second case avx may be unusable.

I still don’t understand, this is weird behavior, the way, how Conv1d ‘supposed’ to be used is a second way when we processing multichannel 1-d inputs, that’s how documentation proposes to use Conv1d, I didn’t even know till recent times that Conv1d can handle 4-d inputs. So why the “correct” way is 2 times slower, or it is not a “correct” way and I’m missing something?

  1. you shouldn’t see such a difference on CUDA
  2. conv_2 is faster for me (1.8.0a0 with OMP/MKL threading disabled). you may also see a different picture if you change 100000 -> 100
  3. I’ve just seen a related PR: https://github.com/pytorch/pytorch/pull/48885
  4. in general, performance & best approach may vary a lot depending on shapes

Thanks for the answers, your comments are helpful.