Quantized depthwise separable convolution with large kernels is extremely slow

Hi everyone!

I’m currently trying to apply static quantization to several more or less modern architectures in vision. It all went reasonably smoothly for efficientnetv2. However, I hit a brick wall with convnext, getting results like these:

``


                    Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  

         model_inference         0.54%       4.274ms       100.00%     785.942ms     785.942ms             1  
            aten::conv2d         0.01%     110.000us        13.33%     104.731ms       4.761ms            22

``
for fp32 inference vs

``


                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  

              model_inference         0.10%      10.904ms       100.00%       10.848s       10.848s             1  
            quantized::conv2d        96.42%       10.460s        96.43%       10.460s     475.474ms            22  

``
for quantized.

I’m running it on an i7-10875H in a single core mode, because that’s our target mode. Latest pytorch version I used was 1.12.0.dev20220404 nighly.

As far as I’ve seen, depthwise separable conv2d slows down significantly after quantization disregard the kernel size. However, convnext utilizes 7x7 convolutions that shoot inference time through the roof. Am I cooking it wrong? Could anybody please point me to how I can fix that?

Could you share the shapes especially the padding? Is the padding size 3 (so called same padding that results in the same output spatial dim as input spatial dim)? This can help us to optimize.

Yes, it’s kernel size 7 with padding size 3, and groups equal to the number of channels. If I run the profiler with shapes, it yields the following:


                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                                          Input Shapes  

              model_inference         0.09%       9.663ms       100.00%       10.782s       10.782s             1                                                    []  
            quantized::conv2d        41.00%        4.421s        41.00%        4.421s        1.105s             4                         [[5, 96, 56, 56], [], [], []]  
            quantized::conv2d        29.97%        3.231s        29.97%        3.231s     323.097ms            10                        [[5, 384, 14, 14], [], [], []]  
            quantized::conv2d        20.52%        2.213s        20.52%        2.213s     553.156ms             4                        [[5, 192, 28, 28], [], [], []]  
            quantized::conv2d         4.92%     530.464ms         4.92%     530.493ms     176.831ms             3                          [[5, 768, 7, 7], [], [], []]  

Self CPU time total: 10.782s

1 Like

@Jongsoo_Park landed a fix in fbgemm recently: add 7x7 depthwise by jspark1105 · Pull Request #1049 · pytorch/FBGEMM · GitHub maybe (assuming padding is 3) you can wait a bit until the fbgemm module is updated in pytorch and try to build from source again.

1 Like

Sure, will do! Thank you, @Jongsoo_Park