Hi,
I am trying to use QAT to speed up a segmentation model on CPU.
The preparation, training and conversion to a quantized model all seem to work fine: negligible drop in performance and reduction in model size by ~4.
However, I am getting some strange latency measurements with the quantized model where larger image take more time for inference than the original model.
Here are a few numbers for a MobileNetV3 large with dilation and reduced tail (see Everything you need to know about TorchVision’s MobileNetV3 implementation | PyTorch) with the LR-ASPP head on top for the segmentation:
- Fused model CPU latency:
- 256x256: 76 ms
- 512x512: 206 ms
- 1024x1024: 706 ms
- Quantized model CPU latency:
- 256x256: 53 ms
- 512x512: 211 ms
- 1024x1024: 849 ms
These numbers were obtained with torch.set_num_threads(4)
on a Ryzen 7 3700X.
For some reason, at higher resolutions, the model is slower with quantization. I am also using torchvision’s implementation of quantizable MobileNetV3 (vision/mobilenetv3.py at master · pytorch/vision · GitHub).
Any idea where this could come from?
After some investigation, it seems that the culprit here is dilation.
When removing dilation from MobileNetV3 (used in the last 3 blocks), the latency drop significantly. Here are the latency measurements:
- Fused model CPU latency:
- 256x256: 62 ms
- 512x512: 148 ms
- 1024x1024: 494 ms
- Quantized model CPU latency:
- 256x256: 5 ms
- 512x512: 16 ms
- 1024x1024: 59 ms
Evaluating a simple Conv(3, 64, kernel_size=5, stride=2) → BN → ReLU on 512x512 inputs, we get the following profiles:
- Fused model without dilation:
-------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
-------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
aten::conv2d 0.10% 8.000us 70.95% 5.660ms 5.660ms 1
aten::convolution 0.15% 12.000us 70.85% 5.652ms 5.652ms 1
aten::_convolution 0.15% 12.000us 70.70% 5.640ms 5.640ms 1
aten::mkldnn_convolution 70.40% 5.616ms 70.55% 5.628ms 5.628ms 1
aten::batch_norm 0.13% 10.000us 23.15% 1.847ms 1.847ms 1
aten::_batch_norm_impl_index 0.11% 9.000us 23.03% 1.837ms 1.837ms 1
aten::native_batch_norm 22.74% 1.814ms 22.90% 1.827ms 1.827ms 1
aten::relu_ 0.20% 16.000us 5.89% 470.000us 470.000us 1
aten::threshold_ 5.69% 454.000us 5.69% 454.000us 454.000us 1
aten::empty 0.19% 15.000us 0.19% 15.000us 3.000us 5
aten::empty_like 0.11% 9.000us 0.16% 13.000us 4.333us 3
aten::as_strided_ 0.03% 2.000us 0.03% 2.000us 2.000us 1
-------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 7.977ms
- Quantized model without dilation:
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
forward 0.66% 24.000us 100.00% 3.658ms 3.658ms 1
quantized::conv2d_relu 62.41% 2.283ms 76.85% 2.811ms 2.811ms 1
aten::dequantize 18.92% 692.000us 18.94% 693.000us 693.000us 1
aten::contiguous 0.16% 6.000us 14.27% 522.000us 522.000us 1
aten::copy_ 13.42% 491.000us 13.48% 493.000us 493.000us 1
aten::quantize_per_tensor 3.14% 115.000us 3.14% 115.000us 115.000us 1
aten::empty_like 0.33% 12.000us 0.63% 23.000us 23.000us 1
aten::item 0.19% 7.000us 0.41% 15.000us 7.500us 2
aten::_local_scalar_dense 0.22% 8.000us 0.22% 8.000us 4.000us 2
aten::qscheme 0.16% 6.000us 0.16% 6.000us 2.000us 3
aten::_empty_affine_quantized 0.14% 5.000us 0.14% 5.000us 2.500us 2
aten::q_scale 0.11% 4.000us 0.11% 4.000us 2.000us 2
aten::q_zero_point 0.08% 3.000us 0.08% 3.000us 1.500us 2
aten::empty 0.05% 2.000us 0.05% 2.000us 1.000us 2
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 3.658ms
- Fused model with dilation:
-------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
-------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
aten::conv2d 0.08% 9.000us 76.87% 8.417ms 8.417ms 1
aten::convolution 0.07% 8.000us 76.79% 8.408ms 8.408ms 1
aten::_convolution 0.11% 12.000us 76.72% 8.400ms 8.400ms 1
aten::mkldnn_convolution 76.53% 8.379ms 76.61% 8.388ms 8.388ms 1
aten::batch_norm 0.07% 8.000us 16.21% 1.775ms 1.775ms 1
aten::_batch_norm_impl_index 0.08% 9.000us 16.14% 1.767ms 1.767ms 1
aten::native_batch_norm 15.94% 1.745ms 16.04% 1.756ms 1.756ms 1
aten::relu_ 0.16% 18.000us 6.91% 757.000us 757.000us 1
aten::threshold_ 6.75% 739.000us 6.75% 739.000us 739.000us 1
aten::empty 0.11% 12.000us 0.11% 12.000us 2.400us 5
aten::empty_like 0.07% 8.000us 0.10% 11.000us 3.667us 3
aten::as_strided_ 0.02% 2.000us 0.02% 2.000us 2.000us 1
-------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 10.949ms
- Quantized model with dilation:
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
forward 0.24% 24.000us 100.00% 9.854ms 9.854ms 1
quantized::conv2d_relu 79.02% 7.787ms 86.20% 8.494ms 8.494ms 1
aten::dequantize 12.05% 1.187ms 12.17% 1.199ms 1.199ms 1
aten::contiguous 0.07% 7.000us 7.10% 700.000us 700.000us 1
aten::copy_ 6.80% 670.000us 6.80% 670.000us 670.000us 1
aten::quantize_per_tensor 1.26% 124.000us 1.26% 124.000us 124.000us 1
aten::empty_like 0.13% 13.000us 0.23% 23.000us 23.000us 1
aten::item 0.06% 6.000us 0.13% 13.000us 6.500us 2
aten::empty 0.13% 13.000us 0.13% 13.000us 6.500us 2
aten::_local_scalar_dense 0.07% 7.000us 0.07% 7.000us 3.500us 2
aten::qscheme 0.04% 4.000us 0.04% 4.000us 1.333us 3
aten::q_zero_point 0.04% 4.000us 0.04% 4.000us 2.000us 2
aten::q_scale 0.04% 4.000us 0.04% 4.000us 2.000us 2
aten::_empty_affine_quantized 0.04% 4.000us 0.04% 4.000us 2.000us 2
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 9.854ms
From this data we can observe two things:
- Convolutions with 5x5 dilated kernels are much slower with dilation on CPU
- Quantized convolutions with 5x5 dilated kernels take an even more important performance hit.
All of this was tested with PyTorch 1.8.1.
Hi,
I’m facing the exact same problem. It seems that it is using depthwise convolution with dilation that can make quantization much slower. Using normal convolution + dilation is fine.
I tested on a single convolution block with in_channel, out_channel = 96, kernel size = 5 and dilation = 5.
Convolution block before quantization: takes ~358ms;
Convolution block after quantization: takes ~64ms;
Depthwise separable convolution block before quantization: takes ~154ms;
Depthwise separable convolution block before quantization: takes ~447ms;