Strange latency behavior

Hi,
I am trying to use QAT to speed up a segmentation model on CPU.
The preparation, training and conversion to a quantized model all seem to work fine: negligible drop in performance and reduction in model size by ~4.

However, I am getting some strange latency measurements with the quantized model where larger image take more time for inference than the original model.
Here are a few numbers for a MobileNetV3 large with dilation and reduced tail (see Everything you need to know about TorchVision’s MobileNetV3 implementation | PyTorch) with the LR-ASPP head on top for the segmentation:

  • Fused model CPU latency:
    • 256x256: 76 ms
    • 512x512: 206 ms
    • 1024x1024: 706 ms
  • Quantized model CPU latency:
    • 256x256: 53 ms
    • 512x512: 211 ms
    • 1024x1024: 849 ms

These numbers were obtained with torch.set_num_threads(4) on a Ryzen 7 3700X.

For some reason, at higher resolutions, the model is slower with quantization. I am also using torchvision’s implementation of quantizable MobileNetV3 (vision/mobilenetv3.py at master · pytorch/vision · GitHub).

Any idea where this could come from?

After some investigation, it seems that the culprit here is dilation.

When removing dilation from MobileNetV3 (used in the last 3 blocks), the latency drop significantly. Here are the latency measurements:

  • Fused model CPU latency:
    • 256x256: 62 ms
    • 512x512: 148 ms
    • 1024x1024: 494 ms
  • Quantized model CPU latency:
    • 256x256: 5 ms
    • 512x512: 16 ms
    • 1024x1024: 59 ms

Evaluating a simple Conv(3, 64, kernel_size=5, stride=2) → BN → ReLU on 512x512 inputs, we get the following profiles:

  • Fused model without dilation:
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                    aten::conv2d         0.10%       8.000us        70.95%       5.660ms       5.660ms             1  
               aten::convolution         0.15%      12.000us        70.85%       5.652ms       5.652ms             1  
              aten::_convolution         0.15%      12.000us        70.70%       5.640ms       5.640ms             1  
        aten::mkldnn_convolution        70.40%       5.616ms        70.55%       5.628ms       5.628ms             1  
                aten::batch_norm         0.13%      10.000us        23.15%       1.847ms       1.847ms             1  
    aten::_batch_norm_impl_index         0.11%       9.000us        23.03%       1.837ms       1.837ms             1  
         aten::native_batch_norm        22.74%       1.814ms        22.90%       1.827ms       1.827ms             1  
                     aten::relu_         0.20%      16.000us         5.89%     470.000us     470.000us             1  
                aten::threshold_         5.69%     454.000us         5.69%     454.000us     454.000us             1  
                     aten::empty         0.19%      15.000us         0.19%      15.000us       3.000us             5  
                aten::empty_like         0.11%       9.000us         0.16%      13.000us       4.333us             3  
               aten::as_strided_         0.03%       2.000us         0.03%       2.000us       2.000us             1  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 7.977ms

  • Quantized model without dilation:
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                          forward         0.66%      24.000us       100.00%       3.658ms       3.658ms             1  
           quantized::conv2d_relu        62.41%       2.283ms        76.85%       2.811ms       2.811ms             1  
                 aten::dequantize        18.92%     692.000us        18.94%     693.000us     693.000us             1  
                 aten::contiguous         0.16%       6.000us        14.27%     522.000us     522.000us             1  
                      aten::copy_        13.42%     491.000us        13.48%     493.000us     493.000us             1  
        aten::quantize_per_tensor         3.14%     115.000us         3.14%     115.000us     115.000us             1  
                 aten::empty_like         0.33%      12.000us         0.63%      23.000us      23.000us             1  
                       aten::item         0.19%       7.000us         0.41%      15.000us       7.500us             2  
        aten::_local_scalar_dense         0.22%       8.000us         0.22%       8.000us       4.000us             2  
                    aten::qscheme         0.16%       6.000us         0.16%       6.000us       2.000us             3  
    aten::_empty_affine_quantized         0.14%       5.000us         0.14%       5.000us       2.500us             2  
                    aten::q_scale         0.11%       4.000us         0.11%       4.000us       2.000us             2  
               aten::q_zero_point         0.08%       3.000us         0.08%       3.000us       1.500us             2  
                      aten::empty         0.05%       2.000us         0.05%       2.000us       1.000us             2  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 3.658ms

  • Fused model with dilation:
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                    aten::conv2d         0.08%       9.000us        76.87%       8.417ms       8.417ms             1  
               aten::convolution         0.07%       8.000us        76.79%       8.408ms       8.408ms             1  
              aten::_convolution         0.11%      12.000us        76.72%       8.400ms       8.400ms             1  
        aten::mkldnn_convolution        76.53%       8.379ms        76.61%       8.388ms       8.388ms             1  
                aten::batch_norm         0.07%       8.000us        16.21%       1.775ms       1.775ms             1  
    aten::_batch_norm_impl_index         0.08%       9.000us        16.14%       1.767ms       1.767ms             1  
         aten::native_batch_norm        15.94%       1.745ms        16.04%       1.756ms       1.756ms             1  
                     aten::relu_         0.16%      18.000us         6.91%     757.000us     757.000us             1  
                aten::threshold_         6.75%     739.000us         6.75%     739.000us     739.000us             1  
                     aten::empty         0.11%      12.000us         0.11%      12.000us       2.400us             5  
                aten::empty_like         0.07%       8.000us         0.10%      11.000us       3.667us             3  
               aten::as_strided_         0.02%       2.000us         0.02%       2.000us       2.000us             1  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 10.949ms
  • Quantized model with dilation:
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                          forward         0.24%      24.000us       100.00%       9.854ms       9.854ms             1  
           quantized::conv2d_relu        79.02%       7.787ms        86.20%       8.494ms       8.494ms             1  
                 aten::dequantize        12.05%       1.187ms        12.17%       1.199ms       1.199ms             1  
                 aten::contiguous         0.07%       7.000us         7.10%     700.000us     700.000us             1  
                      aten::copy_         6.80%     670.000us         6.80%     670.000us     670.000us             1  
        aten::quantize_per_tensor         1.26%     124.000us         1.26%     124.000us     124.000us             1  
                 aten::empty_like         0.13%      13.000us         0.23%      23.000us      23.000us             1  
                       aten::item         0.06%       6.000us         0.13%      13.000us       6.500us             2  
                      aten::empty         0.13%      13.000us         0.13%      13.000us       6.500us             2  
        aten::_local_scalar_dense         0.07%       7.000us         0.07%       7.000us       3.500us             2  
                    aten::qscheme         0.04%       4.000us         0.04%       4.000us       1.333us             3  
               aten::q_zero_point         0.04%       4.000us         0.04%       4.000us       2.000us             2  
                    aten::q_scale         0.04%       4.000us         0.04%       4.000us       2.000us             2  
    aten::_empty_affine_quantized         0.04%       4.000us         0.04%       4.000us       2.000us             2  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 9.854ms

From this data we can observe two things:

  • Convolutions with 5x5 dilated kernels are much slower with dilation on CPU
  • Quantized convolutions with 5x5 dilated kernels take an even more important performance hit.

All of this was tested with PyTorch 1.8.1.

Hi,
I’m facing the exact same problem. It seems that it is using depthwise convolution with dilation that can make quantization much slower. Using normal convolution + dilation is fine.

I tested on a single convolution block with in_channel, out_channel = 96, kernel size = 5 and dilation = 5.
Convolution block before quantization: takes ~358ms;
Convolution block after quantization: takes ~64ms;
Depthwise separable convolution block before quantization: takes ~154ms;
Depthwise separable convolution block before quantization: takes ~447ms;

Indeed, it seems it’s worse when the number of groups is larger than 1.
I have opened an issue on github Quantized conv2d with dilation and groups much slower than float32 · Issue #59730 · pytorch/pytorch · GitHub