Depthwise separable convolutions and pointwise convolutions

I am trying to implement MobileNets in PyTorch, and was wondering about its efficiency in this framework.

I found multiple discussions on this site and issues on github discussing the optimisations regarding the depthwise separable convolutions and it seems that if I set groups = num_input_channels, PyTorch will use optimized code to run them.

I was wondering if there also was optimized code for the 1x1 pointwise convolutions, discussed in the paper?
In the paper they claim:

Our model structure puts nearly all of the computation into dense 1 × 1 convolutions. This can be implemented with highly optimized general matrix multiply (GEMM) functions. Often convolutions are implemented by a GEMM but require an initial reordering in memory called im2col in order to map it to a GEMM. For instance, this approach is used in the popular Caffe package [15]. 1×1 convolutions do not require this reordering in memory and can be implemented directly with GEMM which is one of the most optimized numerical linear algebra algorithms.

Does PyTorch also use these GEMM functions and if so, do you skip the memory reordering for 1x1 convolutions? I am trying to judge the added benefits of using these depthwise separable convolutions.

Sorry if this an obvious or stupid question, but I have no experience with the PyTorch backend whatsoever!

1 Like