I am trying to implement MobileNets in PyTorch, and was wondering about its efficiency in this framework.
I found multiple discussions on this site and issues on github discussing the optimisations regarding the depthwise separable convolutions and it seems that if I set groups = num_input_channels
, PyTorch will use optimized code to run them.
I was wondering if there also was optimized code for the 1x1 pointwise convolutions, discussed in the paper?
In the paper they claim:
…
Our model structure puts nearly all of the computation into dense 1 × 1 convolutions. This can be implemented with highly optimized general matrix multiply (GEMM) functions. Often convolutions are implemented by a GEMM but require an initial reordering in memory called im2col in order to map it to a GEMM. For instance, this approach is used in the popular Caffe package [15]. 1×1 convolutions do not require this reordering in memory and can be implemented directly with GEMM which is one of the most optimized numerical linear algebra algorithms.
…
Does PyTorch also use these GEMM functions and if so, do you skip the memory reordering for 1x1 convolutions? I am trying to judge the added benefits of using these depthwise separable convolutions.
Sorry if this an obvious or stupid question, but I have no experience with the PyTorch backend whatsoever!