Slow inference on quantized MobileNetV3

Hi, I have quantized a MobileNetV3-like Network with ‘qnnpack’ for use in an Android app. However, the quantized model is even slower than the original one.
All layers seem to be quantized correctly and the model file size decreased to 1/4 of the original size.

The model has ~2M Parameters and input resolution is 224x224.
Here are some inference time numbers:
Model (without quantization) on Ryzen 3700x: ~50ms
Model (without quantization) on RTX 2070: ~6ms
Model (without quantization) on Huawei Mate 10 lite: ~1s
Model (with quantization) on Huawei Mate 10 lite: ~1.5s

I did not expect that inference would take ~1s on such a model, even without quantization. Is this expected?
Why would a quantized model be slower? Are there any operations/layers/architecture conventions that should be absolutely avoided?

Also, the output of the quantized model is extremely noisy. What could be causing this?
Here is an output example before and after model quantization:

cc @raghuramank100 @supriyar

@singularity thanks for sharing. Is the entire network quantized or are there some layers running in float? If you can reproduce the behavior on server (using qnnpack) then you can use autograd profiler to get an op level breakdown to see which ops are causing the most slowdown.

It might also be easier to debug accuracy issue on the server in case the quantization noise is reproducible there.

Here is the output from the autograd profiler before and after quantization:

Before

-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Name                     Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Input Shapes                         
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
conv2d                   0.67%            906.899us        0.67%            906.899us        906.899us        1                []                                   
convolution              0.67%            902.559us        0.67%            902.559us        902.559us        1                []                                   
_convolution             0.67%            900.649us        0.67%            900.649us        900.649us        1                []                                   
contiguous               0.00%            1.060us          0.00%            1.060us          1.060us          1                []                                   
contiguous               0.00%            0.140us          0.00%            0.140us          0.140us          1                []                                   
contiguous               0.00%            0.120us          0.00%            0.120us          0.120us          1                []                                   
mkldnn_convolution       0.66%            885.439us        0.66%            885.439us        885.439us        1                []                                   
conv2d                   0.53%            711.017us        0.53%            711.017us        711.017us        1                []                                   
convolution              0.53%            710.247us        0.53%            710.247us        710.247us        1                []                                   
_convolution             0.52%            704.057us        0.52%            704.057us        704.057us        1                []                                   
contiguous               0.00%            0.290us          0.00%            0.290us          0.290us          1                []                                   
contiguous               0.00%            0.090us          0.00%            0.090us          0.090us          1                []                                   
contiguous               0.00%            0.100us          0.00%            0.100us          0.100us          1                []                                   
mkldnn_convolution       0.52%            698.567us        0.52%            698.567us        698.567us        1                []                                   
conv2d                   0.29%            389.754us        0.29%            389.754us        389.754us        1                []                                   
convolution              0.29%            389.274us        0.29%            389.274us        389.274us        1                []                                   
_convolution             0.29%            388.564us        0.29%            388.564us        388.564us        1                []                                   
contiguous               0.00%            0.340us          0.00%            0.340us          0.340us          1                []                                   
contiguous               0.00%            0.090us          0.00%            0.090us          0.090us          1                []                                   
contiguous               0.00%            0.100us          0.00%            0.100us          0.100us          1                []                                   
mkldnn_convolution       0.29%            384.324us        0.29%            384.324us        384.324us        1                []                                   
relu_                    0.02%            29.550us         0.02%            29.550us         29.550us         1                []                                   
conv2d                   0.34%            454.195us        0.34%            454.195us        454.195us        1                []                                   
convolution              0.34%            453.735us        0.34%            453.735us        453.735us        1                []                                   
_convolution             0.34%            453.145us        0.34%            453.145us        453.145us        1                []                                   
contiguous               0.00%            0.240us          0.00%            0.240us          0.240us          1                []                                   
contiguous               0.00%            0.100us          0.00%            0.100us          0.100us          1                []                                   
contiguous               0.00%            0.090us          0.00%            0.090us          0.090us          1                []                                   
mkldnn_convolution       0.33%            448.975us        0.33%            448.975us        448.975us        1                []                                   
relu_                    0.02%            21.830us         0.02%            21.830us         21.830us         1                []                                   
conv2d                   0.22%            291.363us        0.22%            291.363us        291.363us        1                []                                   
convolution              0.22%            290.863us        0.22%            290.863us        290.863us        1                []                                   
_convolution             0.22%            290.223us        0.22%            290.223us        290.223us        1                []                                   
contiguous               0.00%            0.220us          0.00%            0.220us          0.220us          1                []                                   
contiguous               0.00%            0.100us          0.00%            0.100us          0.100us          1                []                                   
contiguous               0.00%            0.180us          0.00%            0.180us          0.180us          1                []                                   
mkldnn_convolution       0.21%            280.402us        0.21%            280.402us        280.402us        1                []                                   
adaptive_avg_pool2d      0.04%            60.060us         0.04%            60.060us         60.060us         1                []                                   
contiguous               0.00%            0.250us          0.00%            0.250us          0.250us          1                []                                   
view                     0.00%            5.270us          0.00%            5.270us          5.270us          1                []                                   
mean                     0.03%            44.300us         0.03%            44.300us         44.300us         1                []                                   
view                     0.00%            1.870us          0.00%            1.870us          1.870us          1                []                                   
view                     0.00%            1.690us          0.00%            1.690us          1.690us          1                []                                   
unsigned short           0.00%            5.701us          0.00%            5.701us          5.701us          1                []                                   
matmul                   0.03%            40.580us         0.03%            40.580us         40.580us         1                []                                   
mm                       0.03%            34.970us         0.03%            34.970us         34.970us         1                []                                   
relu_                    0.00%            3.950us          0.00%            3.950us          3.950us          1                []                                   
unsigned short           0.00%            2.340us          0.00%            2.340us          2.340us          1                []                                   
matmul                   0.00%            4.830us          0.00%            4.830us          4.830us          1                []                                   
mm                       0.00%            4.360us          0.00%            4.360us          4.360us          1                []                                   
sigmoid                  0.01%            13.000us         0.01%            13.000us         13.000us         1                []                                   
view                     0.00%            2.561us          0.00%            2.561us          2.561us          1                []                                   
expand_as                0.00%            4.660us          0.00%            4.660us          4.660us          1                []                                   
expand                   0.00%            3.220us          0.00%            3.220us          3.220us          1                []                                   
mul                      0.02%            21.070us         0.02%            21.070us         21.070us         1                []                                   
relu_                    0.01%            8.960us          0.01%            8.960us          8.960us          1                []                                   
conv2d                   0.21%            286.703us        0.21%            286.703us        286.703us        1                []                                   
convolution              0.21%            286.053us        0.21%            286.053us        286.053us        1                []                                   
_convolution             0.21%            285.113us        0.21%            285.113us        285.113us        1                []                                   
contiguous               0.00%            0.230us          0.00%            0.230us          0.230us          1                []                                   
contiguous               0.01%            17.500us         0.01%            17.500us         17.500us         1                []                                   
contiguous               0.00%            0.200us          0.00%            0.200us          0.200us          1                []                                   
mkldnn_convolution       0.20%            263.112us        0.20%            263.112us        263.112us        1                []                                   
conv2d                   0.33%            443.374us        0.33%            443.374us        443.374us        1                []                                   
convolution              0.33%            442.864us        0.33%            442.864us        442.864us        1                []                                   
_convolution             0.33%            442.304us        0.33%            442.304us        442.304us        1                []                                   
contiguous               0.00%            0.260us          0.00%            0.260us          0.260us          1                []                                   
contiguous               0.00%            0.100us          0.00%            0.100us          0.100us          1                []                                   
contiguous               0.00%            0.090us          0.00%            0.090us          0.090us          1                []                                   
mkldnn_convolution       0.33%            438.134us        0.33%            438.134us        438.134us        1                []                                   
relu_                    0.02%            27.920us         0.02%            27.920us         27.920us         1                []                                   
conv2d                   0.23%            310.863us        0.23%            310.863us        310.863us        1                []                                   
convolution              0.23%            310.383us        0.23%            310.383us        310.383us        1                []                                   
_convolution             0.23%            309.743us        0.23%            309.743us        309.743us        1                []                                   
contiguous               0.00%            0.230us          0.00%            0.230us          0.230us          1                []                                   
contiguous               0.00%            0.090us          0.00%            0.090us          0.090us          1                []                                   
contiguous               0.00%            0.170us          0.00%            0.170us          0.170us          1                []                                   
mkldnn_convolution       0.23%            305.503us        0.23%            305.503us        305.503us        1                []                                   
relu_                    0.01%            14.660us         0.01%            14.660us         14.660us         1                []                                   
conv2d                   0.19%            261.423us        0.19%            261.423us        261.423us        1                []                                   
convolution              0.19%            260.713us        0.19%            260.713us        260.713us        1                []                                   
_convolution             0.19%            255.493us        0.19%            255.493us        255.493us        1                []                                   
contiguous               0.00%            0.260us          0.00%            0.260us          0.260us          1                []                                   
contiguous               0.00%            0.120us          0.00%            0.120us          0.120us          1                []                                   
contiguous               0.00%            0.120us          0.00%            0.120us          0.120us          1                []                                   
mkldnn_convolution       0.19%            250.603us        0.19%            250.603us        250.603us        1                []                                   
conv2d                   0.20%            263.663us        0.20%            263.663us        263.663us        1                []                                   
convolution              0.20%            263.183us        0.20%            263.183us        263.183us        1                []                                   
_convolution             0.20%            262.683us        0.20%            262.683us        262.683us        1                []                                   
contiguous               0.00%            0.280us          0.00%            0.280us          0.280us          1                []                                   
contiguous               0.00%            0.100us          0.00%            0.100us          0.100us          1                []                                   
contiguous               0.00%            0.090us          0.00%            0.090us          0.090us          1                []                                   
mkldnn_convolution       0.19%            258.533us        0.19%            258.533us        258.533us        1                []                                   
relu_                    0.01%            13.400us         0.01%            13.400us         13.400us         1                []                                   
conv2d                   0.15%            196.892us        0.15%            196.892us        196.892us        1                []                                   
convolution              0.15%            196.412us        0.15%            196.412us        196.412us        1                []                                   
_convolution             0.15%            195.812us        0.15%            195.812us        195.812us        1                []                                   
contiguous               0.00%            0.240us          0.00%            0.240us          0.240us          1                []                                   
contiguous               0.00%            0.110us          0.00%            0.110us          0.110us          1                []                                   
contiguous               0.00%            0.100us          0.00%            0.100us          0.100us          1                []                                   
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Self CPU time total: 134.621ms

hit the character limit…

After

---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Input Shapes                         
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
item                         0.01%            7.420us          0.01%            7.420us          7.420us          1                []                                   
_local_scalar_dense          0.01%            4.550us          0.01%            4.550us          4.550us          1                []                                   
aten::Int                    0.00%            1.720us          0.00%            1.720us          1.720us          1                []                                   
item                         0.00%            0.470us          0.00%            0.470us          0.470us          1                []                                   
_local_scalar_dense          0.00%            0.240us          0.00%            0.240us          0.240us          1                []                                   
quantize_per_tensor          0.09%            69.811us         0.09%            69.811us         69.811us         1                []                                   
quantized::conv2d            1.53%            1.183ms          1.53%            1.183ms          1.183ms          1                []                                   
contiguous                   0.22%            167.211us        0.22%            167.211us        167.211us        1                []                                   
empty_like                   0.01%            9.810us          0.01%            9.810us          9.810us          1                []                                   
qscheme                      0.00%            0.920us          0.00%            0.920us          0.920us          1                []                                   
q_zero_point                 0.00%            0.710us          0.00%            0.710us          0.710us          1                []                                   
q_scale                      0.00%            0.740us          0.00%            0.740us          0.740us          1                []                                   
_empty_affine_quantized      0.00%            2.930us          0.00%            2.930us          2.930us          1                []                                   
q_scale                      0.00%            0.170us          0.00%            0.170us          0.170us          1                []                                   
contiguous                   0.00%            0.160us          0.00%            0.160us          0.160us          1                []                                   
_empty_affine_quantized      0.00%            1.440us          0.00%            1.440us          1.440us          1                []                                   
quantize_per_tensor          0.01%            5.550us          0.01%            5.550us          5.550us          1                []                                   
_empty_affine_quantized      0.00%            1.320us          0.00%            1.320us          1.320us          1                []                                   
q_zero_point                 0.00%            0.180us          0.00%            0.180us          0.180us          1                []                                   
q_scale                      0.00%            0.150us          0.00%            0.150us          0.150us          1                []                                   
q_zero_point                 0.00%            0.150us          0.00%            0.150us          0.150us          1                []                                   
q_scale                      0.00%            0.130us          0.00%            0.130us          0.130us          1                []                                   
quantized::conv2d            0.34%            266.002us        0.34%            266.002us        266.002us        1                []                                   
contiguous                   0.00%            0.140us          0.00%            0.140us          0.140us          1                []                                   
q_scale                      0.00%            0.160us          0.00%            0.160us          0.160us          1                []                                   
contiguous                   0.00%            0.100us          0.00%            0.100us          0.100us          1                []                                   
_empty_affine_quantized      0.00%            1.260us          0.00%            1.260us          1.260us          1                []                                   
quantize_per_tensor          0.01%            4.290us          0.01%            4.290us          4.290us          1                []                                   
_empty_affine_quantized      0.00%            1.180us          0.00%            1.180us          1.180us          1                []                                   
q_zero_point                 0.00%            0.180us          0.00%            0.180us          0.180us          1                []                                   
q_scale                      0.00%            0.150us          0.00%            0.150us          0.150us          1                []                                   
q_zero_point                 0.00%            0.140us          0.00%            0.140us          0.140us          1                []                                   
q_scale                      0.00%            0.140us          0.00%            0.140us          0.140us          1                []                                   
quantized::conv2d_relu       1.11%            856.897us        1.11%            856.897us        856.897us        1                []                                   
contiguous                   0.00%            0.170us          0.00%            0.170us          0.170us          1                []                                   
q_scale                      0.00%            0.160us          0.00%            0.160us          0.160us          1                []                                   
contiguous                   0.00%            0.150us          0.00%            0.150us          0.150us          1                []                                   
_empty_affine_quantized      0.00%            1.260us          0.00%            1.260us          1.260us          1                []                                   
quantize_per_tensor          0.01%            4.370us          0.01%            4.370us          4.370us          1                []                                   
_empty_affine_quantized      0.00%            1.270us          0.00%            1.270us          1.270us          1                []                                   
q_zero_point                 0.00%            0.170us          0.00%            0.170us          0.170us          1                []                                   
q_scale                      0.00%            0.150us          0.00%            0.150us          0.150us          1                []                                   
q_zero_point                 0.00%            0.150us          0.00%            0.150us          0.150us          1                []                                   
q_scale                      0.00%            0.130us          0.00%            0.130us          0.130us          1                []                                   
quantized::conv2d_relu       0.49%            378.753us        0.49%            378.753us        378.753us        1                []                                   
contiguous                   0.00%            0.200us          0.00%            0.200us          0.200us          1                []                                   
q_scale                      0.00%            0.170us          0.00%            0.170us          0.170us          1                []                                   
contiguous                   0.00%            0.140us          0.00%            0.140us          0.140us          1                []                                   
_empty_affine_quantized      0.00%            1.290us          0.00%            1.290us          1.290us          1                []                                   
quantize_per_tensor          0.01%            4.700us          0.01%            4.700us          4.700us          1                []                                   
_empty_affine_quantized      0.00%            1.260us          0.00%            1.260us          1.260us          1                []                                   
q_zero_point                 0.00%            0.170us          0.00%            0.170us          0.170us          1                []                                   
q_scale                      0.00%            0.150us          0.00%            0.150us          0.150us          1                []                                   
q_zero_point                 0.00%            0.140us          0.00%            0.140us          0.140us          1                []                                   
q_scale                      0.00%            0.130us          0.00%            0.130us          0.130us          1                []                                   
quantized::conv2d            0.18%            140.401us        0.18%            140.401us        140.401us        1                []                                   
contiguous                   0.00%            0.180us          0.00%            0.180us          0.180us          1                []                                   
q_scale                      0.00%            0.160us          0.00%            0.160us          0.160us          1                []                                   
contiguous                   0.00%            0.100us          0.00%            0.100us          0.100us          1                []                                   
_empty_affine_quantized      0.00%            1.060us          0.00%            1.060us          1.060us          1                []                                   
quantize_per_tensor          0.01%            3.920us          0.01%            3.920us          3.920us          1                []                                   
_empty_affine_quantized      0.00%            1.260us          0.00%            1.260us          1.260us          1                []                                   
q_zero_point                 0.00%            0.170us          0.00%            0.170us          0.170us          1                []                                   
q_scale                      0.00%            0.160us          0.00%            0.160us          0.160us          1                []                                   
q_zero_point                 0.00%            0.150us          0.00%            0.150us          0.150us          1                []                                   
q_scale                      0.00%            0.140us          0.00%            0.140us          0.140us          1                []                                   
size                         0.00%            0.970us          0.00%            0.970us          0.970us          1                []                                   
size                         0.00%            0.190us          0.00%            0.190us          0.190us          1                []                                   
adaptive_avg_pool2d          0.02%            16.650us         0.02%            16.650us         16.650us         1                []                                   
_adaptive_avg_pool2d         0.02%            14.410us         0.02%            14.410us         14.410us         1                []                                   
view                         0.01%            4.641us          0.01%            4.641us          4.641us          1                []                                   
quantized::linear            0.02%            14.480us         0.02%            14.480us         14.480us         1                []                                   
contiguous                   0.00%            0.160us          0.00%            0.160us          0.160us          1                []                                   
q_scale                      0.00%            0.340us          0.00%            0.340us          0.340us          1                []                                   
_empty_affine_quantized      0.00%            1.010us          0.00%            1.010us          1.010us          1                []                                   
quantize_per_tensor          0.01%            4.510us          0.01%            4.510us          4.510us          1                []                                   
_empty_affine_quantized      0.00%            0.930us          0.00%            0.930us          0.930us          1                []                                   
q_scale                      0.00%            0.220us          0.00%            0.220us          0.220us          1                []                                   
q_zero_point                 0.00%            0.200us          0.00%            0.200us          0.200us          1                []                                   
relu_                        0.00%            3.630us          0.00%            3.630us          3.630us          1                []                                   
quantized::linear            0.01%            10.470us         0.01%            10.470us         10.470us         1                []                                   
contiguous                   0.00%            0.150us          0.00%            0.150us          0.150us          1                []                                   
q_scale                      0.00%            0.170us          0.00%            0.170us          0.170us          1                []                                   
_empty_affine_quantized      0.00%            0.850us          0.00%            0.850us          0.850us          1                []                                   
quantize_per_tensor          0.01%            4.310us          0.01%            4.310us          4.310us          1                []                                   
_empty_affine_quantized      0.00%            0.810us          0.00%            0.810us          0.810us          1                []                                   
q_scale                      0.00%            0.180us          0.00%            0.180us          0.180us          1                []                                   
q_zero_point                 0.00%            0.150us          0.00%            0.150us          0.150us          1                []                                   
sigmoid                      0.01%            7.770us          0.01%            7.770us          7.770us          1                []                                   
view                         0.00%            1.290us          0.00%            1.290us          1.290us          1                []                                   
expand_as                    0.01%            5.330us          0.01%            5.330us          5.330us          1                []                                   
expand                       0.01%            4.190us          0.01%            4.190us          4.190us          1                []                                   
quantized::mul               0.20%            152.521us        0.20%            152.521us        152.521us        1                []                                   
qscheme                      0.00%            0.290us          0.00%            0.290us          0.290us          1                []                                   
qscheme                      0.00%            0.180us          0.00%            0.180us          0.180us          1                []                                   
qscheme                      0.00%            0.140us          0.00%            0.140us          0.140us          1                []                                   
_empty_affine_quantized      0.00%            1.380us          0.00%            1.380us          1.380us          1                []                                   
q_zero_point                 0.00%            0.170us          0.00%            0.170us          0.170us          1                []                                   
q_scale                      0.00%            0.200us          0.00%            0.200us          0.200us          1                []                                   
q_zero_point                 0.00%            0.140us          0.00%            0.140us          0.140us          1                []                                   
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Self CPU time total: 77.276ms

Most of the time is spent in Conv2d/ReLU operations, but they are quantized. So it seems that quantization is indeed working as Desktop CPU time decreases from 134ms to 77ms.
However, when I run the quantized model on my mobile device (Huawei Mate 10 lite), there are no performance gains. Any ideas?

Also, I found that the noise is likely caused by this qnnpack bug https://github.com/pytorch/pytorch/issues/36253. When I train my model only for a few epochs, I can quantize the model without any errors. However, when I fully train the model, these errors: “output scale: convolution scale 4.636909 is greater or equal to 1.0” are thrown during quantization and the model is extremely noisy after quantization. Is there a fix for this yet?

Regarding the performance, could you set the number of threads to 1 and see if it is still slower?

Regarding the noise - Could you try with pytorch nightly build? There was a fix for the scale issue as mentioned here - https://github.com/pytorch/pytorch/issues/33466#issuecomment-627660191

Upgrading to PyTorch Nightly fixed the errors and the output is looking much better now!

Edit: I spoke a bit too soon… there is also something wrong with the calibration. It seems that calibration is actually hurting performance. Here is the output of the quantized model after 1,10,100 and 1000 calibration images:


This is what it should look like:
(output before quantization)
Figure_1

I have tried setting the number of CPU threads with org.pytorch.PyTorchAndroid.setNumThreads(1); but it does not make a difference. I have also tried 1,2,3,4. Is this the correct way to set the thread count?