Slow inference on quantized MobileNetV3

Hi, I have quantized a MobileNetV3-like Network with ‘qnnpack’ for use in an Android app. However, the quantized model is even slower than the original one.
All layers seem to be quantized correctly and the model file size decreased to 1/4 of the original size.

The model has ~2M Parameters and input resolution is 224x224.
Here are some inference time numbers:
Model (without quantization) on Ryzen 3700x: ~50ms
Model (without quantization) on RTX 2070: ~6ms
Model (without quantization) on Huawei Mate 10 lite: ~1s
Model (with quantization) on Huawei Mate 10 lite: ~1.5s

I did not expect that inference would take ~1s on such a model, even without quantization. Is this expected?
Why would a quantized model be slower? Are there any operations/layers/architecture conventions that should be absolutely avoided?

Also, the output of the quantized model is extremely noisy. What could be causing this?
Here is an output example before and after model quantization:

2 Likes

cc @raghuramank100 @supriyar

@singularity thanks for sharing. Is the entire network quantized or are there some layers running in float? If you can reproduce the behavior on server (using qnnpack) then you can use autograd profiler to get an op level breakdown to see which ops are causing the most slowdown.

It might also be easier to debug accuracy issue on the server in case the quantization noise is reproducible there.

Here is the output from the autograd profiler before and after quantization:

Before

-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Name                     Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Input Shapes                         
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
conv2d                   0.67%            906.899us        0.67%            906.899us        906.899us        1                []                                   
convolution              0.67%            902.559us        0.67%            902.559us        902.559us        1                []                                   
_convolution             0.67%            900.649us        0.67%            900.649us        900.649us        1                []                                   
contiguous               0.00%            1.060us          0.00%            1.060us          1.060us          1                []                                   
contiguous               0.00%            0.140us          0.00%            0.140us          0.140us          1                []                                   
contiguous               0.00%            0.120us          0.00%            0.120us          0.120us          1                []                                   
mkldnn_convolution       0.66%            885.439us        0.66%            885.439us        885.439us        1                []                                   
conv2d                   0.53%            711.017us        0.53%            711.017us        711.017us        1                []                                   
convolution              0.53%            710.247us        0.53%            710.247us        710.247us        1                []                                   
_convolution             0.52%            704.057us        0.52%            704.057us        704.057us        1                []                                   
contiguous               0.00%            0.290us          0.00%            0.290us          0.290us          1                []                                   
contiguous               0.00%            0.090us          0.00%            0.090us          0.090us          1                []                                   
contiguous               0.00%            0.100us          0.00%            0.100us          0.100us          1                []                                   
mkldnn_convolution       0.52%            698.567us        0.52%            698.567us        698.567us        1                []                                   
conv2d                   0.29%            389.754us        0.29%            389.754us        389.754us        1                []                                   
convolution              0.29%            389.274us        0.29%            389.274us        389.274us        1                []                                   
_convolution             0.29%            388.564us        0.29%            388.564us        388.564us        1                []                                   
contiguous               0.00%            0.340us          0.00%            0.340us          0.340us          1                []                                   
contiguous               0.00%            0.090us          0.00%            0.090us          0.090us          1                []                                   
contiguous               0.00%            0.100us          0.00%            0.100us          0.100us          1                []                                   
mkldnn_convolution       0.29%            384.324us        0.29%            384.324us        384.324us        1                []                                   
relu_                    0.02%            29.550us         0.02%            29.550us         29.550us         1                []                                   
conv2d                   0.34%            454.195us        0.34%            454.195us        454.195us        1                []                                   
convolution              0.34%            453.735us        0.34%            453.735us        453.735us        1                []                                   
_convolution             0.34%            453.145us        0.34%            453.145us        453.145us        1                []                                   
contiguous               0.00%            0.240us          0.00%            0.240us          0.240us          1                []                                   
contiguous               0.00%            0.100us          0.00%            0.100us          0.100us          1                []                                   
contiguous               0.00%            0.090us          0.00%            0.090us          0.090us          1                []                                   
mkldnn_convolution       0.33%            448.975us        0.33%            448.975us        448.975us        1                []                                   
relu_                    0.02%            21.830us         0.02%            21.830us         21.830us         1                []                                   
conv2d                   0.22%            291.363us        0.22%            291.363us        291.363us        1                []                                   
convolution              0.22%            290.863us        0.22%            290.863us        290.863us        1                []                                   
_convolution             0.22%            290.223us        0.22%            290.223us        290.223us        1                []                                   
contiguous               0.00%            0.220us          0.00%            0.220us          0.220us          1                []                                   
contiguous               0.00%            0.100us          0.00%            0.100us          0.100us          1                []                                   
contiguous               0.00%            0.180us          0.00%            0.180us          0.180us          1                []                                   
mkldnn_convolution       0.21%            280.402us        0.21%            280.402us        280.402us        1                []                                   
adaptive_avg_pool2d      0.04%            60.060us         0.04%            60.060us         60.060us         1                []                                   
contiguous               0.00%            0.250us          0.00%            0.250us          0.250us          1                []                                   
view                     0.00%            5.270us          0.00%            5.270us          5.270us          1                []                                   
mean                     0.03%            44.300us         0.03%            44.300us         44.300us         1                []                                   
view                     0.00%            1.870us          0.00%            1.870us          1.870us          1                []                                   
view                     0.00%            1.690us          0.00%            1.690us          1.690us          1                []                                   
unsigned short           0.00%            5.701us          0.00%            5.701us          5.701us          1                []                                   
matmul                   0.03%            40.580us         0.03%            40.580us         40.580us         1                []                                   
mm                       0.03%            34.970us         0.03%            34.970us         34.970us         1                []                                   
relu_                    0.00%            3.950us          0.00%            3.950us          3.950us          1                []                                   
unsigned short           0.00%            2.340us          0.00%            2.340us          2.340us          1                []                                   
matmul                   0.00%            4.830us          0.00%            4.830us          4.830us          1                []                                   
mm                       0.00%            4.360us          0.00%            4.360us          4.360us          1                []                                   
sigmoid                  0.01%            13.000us         0.01%            13.000us         13.000us         1                []                                   
view                     0.00%            2.561us          0.00%            2.561us          2.561us          1                []                                   
expand_as                0.00%            4.660us          0.00%            4.660us          4.660us          1                []                                   
expand                   0.00%            3.220us          0.00%            3.220us          3.220us          1                []                                   
mul                      0.02%            21.070us         0.02%            21.070us         21.070us         1                []                                   
relu_                    0.01%            8.960us          0.01%            8.960us          8.960us          1                []                                   
conv2d                   0.21%            286.703us        0.21%            286.703us        286.703us        1                []                                   
convolution              0.21%            286.053us        0.21%            286.053us        286.053us        1                []                                   
_convolution             0.21%            285.113us        0.21%            285.113us        285.113us        1                []                                   
contiguous               0.00%            0.230us          0.00%            0.230us          0.230us          1                []                                   
contiguous               0.01%            17.500us         0.01%            17.500us         17.500us         1                []                                   
contiguous               0.00%            0.200us          0.00%            0.200us          0.200us          1                []                                   
mkldnn_convolution       0.20%            263.112us        0.20%            263.112us        263.112us        1                []                                   
conv2d                   0.33%            443.374us        0.33%            443.374us        443.374us        1                []                                   
convolution              0.33%            442.864us        0.33%            442.864us        442.864us        1                []                                   
_convolution             0.33%            442.304us        0.33%            442.304us        442.304us        1                []                                   
contiguous               0.00%            0.260us          0.00%            0.260us          0.260us          1                []                                   
contiguous               0.00%            0.100us          0.00%            0.100us          0.100us          1                []                                   
contiguous               0.00%            0.090us          0.00%            0.090us          0.090us          1                []                                   
mkldnn_convolution       0.33%            438.134us        0.33%            438.134us        438.134us        1                []                                   
relu_                    0.02%            27.920us         0.02%            27.920us         27.920us         1                []                                   
conv2d                   0.23%            310.863us        0.23%            310.863us        310.863us        1                []                                   
convolution              0.23%            310.383us        0.23%            310.383us        310.383us        1                []                                   
_convolution             0.23%            309.743us        0.23%            309.743us        309.743us        1                []                                   
contiguous               0.00%            0.230us          0.00%            0.230us          0.230us          1                []                                   
contiguous               0.00%            0.090us          0.00%            0.090us          0.090us          1                []                                   
contiguous               0.00%            0.170us          0.00%            0.170us          0.170us          1                []                                   
mkldnn_convolution       0.23%            305.503us        0.23%            305.503us        305.503us        1                []                                   
relu_                    0.01%            14.660us         0.01%            14.660us         14.660us         1                []                                   
conv2d                   0.19%            261.423us        0.19%            261.423us        261.423us        1                []                                   
convolution              0.19%            260.713us        0.19%            260.713us        260.713us        1                []                                   
_convolution             0.19%            255.493us        0.19%            255.493us        255.493us        1                []                                   
contiguous               0.00%            0.260us          0.00%            0.260us          0.260us          1                []                                   
contiguous               0.00%            0.120us          0.00%            0.120us          0.120us          1                []                                   
contiguous               0.00%            0.120us          0.00%            0.120us          0.120us          1                []                                   
mkldnn_convolution       0.19%            250.603us        0.19%            250.603us        250.603us        1                []                                   
conv2d                   0.20%            263.663us        0.20%            263.663us        263.663us        1                []                                   
convolution              0.20%            263.183us        0.20%            263.183us        263.183us        1                []                                   
_convolution             0.20%            262.683us        0.20%            262.683us        262.683us        1                []                                   
contiguous               0.00%            0.280us          0.00%            0.280us          0.280us          1                []                                   
contiguous               0.00%            0.100us          0.00%            0.100us          0.100us          1                []                                   
contiguous               0.00%            0.090us          0.00%            0.090us          0.090us          1                []                                   
mkldnn_convolution       0.19%            258.533us        0.19%            258.533us        258.533us        1                []                                   
relu_                    0.01%            13.400us         0.01%            13.400us         13.400us         1                []                                   
conv2d                   0.15%            196.892us        0.15%            196.892us        196.892us        1                []                                   
convolution              0.15%            196.412us        0.15%            196.412us        196.412us        1                []                                   
_convolution             0.15%            195.812us        0.15%            195.812us        195.812us        1                []                                   
contiguous               0.00%            0.240us          0.00%            0.240us          0.240us          1                []                                   
contiguous               0.00%            0.110us          0.00%            0.110us          0.110us          1                []                                   
contiguous               0.00%            0.100us          0.00%            0.100us          0.100us          1                []                                   
-----------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Self CPU time total: 134.621ms

hit the character limit…

After

---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Input Shapes                         
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
item                         0.01%            7.420us          0.01%            7.420us          7.420us          1                []                                   
_local_scalar_dense          0.01%            4.550us          0.01%            4.550us          4.550us          1                []                                   
aten::Int                    0.00%            1.720us          0.00%            1.720us          1.720us          1                []                                   
item                         0.00%            0.470us          0.00%            0.470us          0.470us          1                []                                   
_local_scalar_dense          0.00%            0.240us          0.00%            0.240us          0.240us          1                []                                   
quantize_per_tensor          0.09%            69.811us         0.09%            69.811us         69.811us         1                []                                   
quantized::conv2d            1.53%            1.183ms          1.53%            1.183ms          1.183ms          1                []                                   
contiguous                   0.22%            167.211us        0.22%            167.211us        167.211us        1                []                                   
empty_like                   0.01%            9.810us          0.01%            9.810us          9.810us          1                []                                   
qscheme                      0.00%            0.920us          0.00%            0.920us          0.920us          1                []                                   
q_zero_point                 0.00%            0.710us          0.00%            0.710us          0.710us          1                []                                   
q_scale                      0.00%            0.740us          0.00%            0.740us          0.740us          1                []                                   
_empty_affine_quantized      0.00%            2.930us          0.00%            2.930us          2.930us          1                []                                   
q_scale                      0.00%            0.170us          0.00%            0.170us          0.170us          1                []                                   
contiguous                   0.00%            0.160us          0.00%            0.160us          0.160us          1                []                                   
_empty_affine_quantized      0.00%            1.440us          0.00%            1.440us          1.440us          1                []                                   
quantize_per_tensor          0.01%            5.550us          0.01%            5.550us          5.550us          1                []                                   
_empty_affine_quantized      0.00%            1.320us          0.00%            1.320us          1.320us          1                []                                   
q_zero_point                 0.00%            0.180us          0.00%            0.180us          0.180us          1                []                                   
q_scale                      0.00%            0.150us          0.00%            0.150us          0.150us          1                []                                   
q_zero_point                 0.00%            0.150us          0.00%            0.150us          0.150us          1                []                                   
q_scale                      0.00%            0.130us          0.00%            0.130us          0.130us          1                []                                   
quantized::conv2d            0.34%            266.002us        0.34%            266.002us        266.002us        1                []                                   
contiguous                   0.00%            0.140us          0.00%            0.140us          0.140us          1                []                                   
q_scale                      0.00%            0.160us          0.00%            0.160us          0.160us          1                []                                   
contiguous                   0.00%            0.100us          0.00%            0.100us          0.100us          1                []                                   
_empty_affine_quantized      0.00%            1.260us          0.00%            1.260us          1.260us          1                []                                   
quantize_per_tensor          0.01%            4.290us          0.01%            4.290us          4.290us          1                []                                   
_empty_affine_quantized      0.00%            1.180us          0.00%            1.180us          1.180us          1                []                                   
q_zero_point                 0.00%            0.180us          0.00%            0.180us          0.180us          1                []                                   
q_scale                      0.00%            0.150us          0.00%            0.150us          0.150us          1                []                                   
q_zero_point                 0.00%            0.140us          0.00%            0.140us          0.140us          1                []                                   
q_scale                      0.00%            0.140us          0.00%            0.140us          0.140us          1                []                                   
quantized::conv2d_relu       1.11%            856.897us        1.11%            856.897us        856.897us        1                []                                   
contiguous                   0.00%            0.170us          0.00%            0.170us          0.170us          1                []                                   
q_scale                      0.00%            0.160us          0.00%            0.160us          0.160us          1                []                                   
contiguous                   0.00%            0.150us          0.00%            0.150us          0.150us          1                []                                   
_empty_affine_quantized      0.00%            1.260us          0.00%            1.260us          1.260us          1                []                                   
quantize_per_tensor          0.01%            4.370us          0.01%            4.370us          4.370us          1                []                                   
_empty_affine_quantized      0.00%            1.270us          0.00%            1.270us          1.270us          1                []                                   
q_zero_point                 0.00%            0.170us          0.00%            0.170us          0.170us          1                []                                   
q_scale                      0.00%            0.150us          0.00%            0.150us          0.150us          1                []                                   
q_zero_point                 0.00%            0.150us          0.00%            0.150us          0.150us          1                []                                   
q_scale                      0.00%            0.130us          0.00%            0.130us          0.130us          1                []                                   
quantized::conv2d_relu       0.49%            378.753us        0.49%            378.753us        378.753us        1                []                                   
contiguous                   0.00%            0.200us          0.00%            0.200us          0.200us          1                []                                   
q_scale                      0.00%            0.170us          0.00%            0.170us          0.170us          1                []                                   
contiguous                   0.00%            0.140us          0.00%            0.140us          0.140us          1                []                                   
_empty_affine_quantized      0.00%            1.290us          0.00%            1.290us          1.290us          1                []                                   
quantize_per_tensor          0.01%            4.700us          0.01%            4.700us          4.700us          1                []                                   
_empty_affine_quantized      0.00%            1.260us          0.00%            1.260us          1.260us          1                []                                   
q_zero_point                 0.00%            0.170us          0.00%            0.170us          0.170us          1                []                                   
q_scale                      0.00%            0.150us          0.00%            0.150us          0.150us          1                []                                   
q_zero_point                 0.00%            0.140us          0.00%            0.140us          0.140us          1                []                                   
q_scale                      0.00%            0.130us          0.00%            0.130us          0.130us          1                []                                   
quantized::conv2d            0.18%            140.401us        0.18%            140.401us        140.401us        1                []                                   
contiguous                   0.00%            0.180us          0.00%            0.180us          0.180us          1                []                                   
q_scale                      0.00%            0.160us          0.00%            0.160us          0.160us          1                []                                   
contiguous                   0.00%            0.100us          0.00%            0.100us          0.100us          1                []                                   
_empty_affine_quantized      0.00%            1.060us          0.00%            1.060us          1.060us          1                []                                   
quantize_per_tensor          0.01%            3.920us          0.01%            3.920us          3.920us          1                []                                   
_empty_affine_quantized      0.00%            1.260us          0.00%            1.260us          1.260us          1                []                                   
q_zero_point                 0.00%            0.170us          0.00%            0.170us          0.170us          1                []                                   
q_scale                      0.00%            0.160us          0.00%            0.160us          0.160us          1                []                                   
q_zero_point                 0.00%            0.150us          0.00%            0.150us          0.150us          1                []                                   
q_scale                      0.00%            0.140us          0.00%            0.140us          0.140us          1                []                                   
size                         0.00%            0.970us          0.00%            0.970us          0.970us          1                []                                   
size                         0.00%            0.190us          0.00%            0.190us          0.190us          1                []                                   
adaptive_avg_pool2d          0.02%            16.650us         0.02%            16.650us         16.650us         1                []                                   
_adaptive_avg_pool2d         0.02%            14.410us         0.02%            14.410us         14.410us         1                []                                   
view                         0.01%            4.641us          0.01%            4.641us          4.641us          1                []                                   
quantized::linear            0.02%            14.480us         0.02%            14.480us         14.480us         1                []                                   
contiguous                   0.00%            0.160us          0.00%            0.160us          0.160us          1                []                                   
q_scale                      0.00%            0.340us          0.00%            0.340us          0.340us          1                []                                   
_empty_affine_quantized      0.00%            1.010us          0.00%            1.010us          1.010us          1                []                                   
quantize_per_tensor          0.01%            4.510us          0.01%            4.510us          4.510us          1                []                                   
_empty_affine_quantized      0.00%            0.930us          0.00%            0.930us          0.930us          1                []                                   
q_scale                      0.00%            0.220us          0.00%            0.220us          0.220us          1                []                                   
q_zero_point                 0.00%            0.200us          0.00%            0.200us          0.200us          1                []                                   
relu_                        0.00%            3.630us          0.00%            3.630us          3.630us          1                []                                   
quantized::linear            0.01%            10.470us         0.01%            10.470us         10.470us         1                []                                   
contiguous                   0.00%            0.150us          0.00%            0.150us          0.150us          1                []                                   
q_scale                      0.00%            0.170us          0.00%            0.170us          0.170us          1                []                                   
_empty_affine_quantized      0.00%            0.850us          0.00%            0.850us          0.850us          1                []                                   
quantize_per_tensor          0.01%            4.310us          0.01%            4.310us          4.310us          1                []                                   
_empty_affine_quantized      0.00%            0.810us          0.00%            0.810us          0.810us          1                []                                   
q_scale                      0.00%            0.180us          0.00%            0.180us          0.180us          1                []                                   
q_zero_point                 0.00%            0.150us          0.00%            0.150us          0.150us          1                []                                   
sigmoid                      0.01%            7.770us          0.01%            7.770us          7.770us          1                []                                   
view                         0.00%            1.290us          0.00%            1.290us          1.290us          1                []                                   
expand_as                    0.01%            5.330us          0.01%            5.330us          5.330us          1                []                                   
expand                       0.01%            4.190us          0.01%            4.190us          4.190us          1                []                                   
quantized::mul               0.20%            152.521us        0.20%            152.521us        152.521us        1                []                                   
qscheme                      0.00%            0.290us          0.00%            0.290us          0.290us          1                []                                   
qscheme                      0.00%            0.180us          0.00%            0.180us          0.180us          1                []                                   
qscheme                      0.00%            0.140us          0.00%            0.140us          0.140us          1                []                                   
_empty_affine_quantized      0.00%            1.380us          0.00%            1.380us          1.380us          1                []                                   
q_zero_point                 0.00%            0.170us          0.00%            0.170us          0.170us          1                []                                   
q_scale                      0.00%            0.200us          0.00%            0.200us          0.200us          1                []                                   
q_zero_point                 0.00%            0.140us          0.00%            0.140us          0.140us          1                []                                   
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Self CPU time total: 77.276ms

Most of the time is spent in Conv2d/ReLU operations, but they are quantized. So it seems that quantization is indeed working as Desktop CPU time decreases from 134ms to 77ms.
However, when I run the quantized model on my mobile device (Huawei Mate 10 lite), there are no performance gains. Any ideas?

Also, I found that the noise is likely caused by this qnnpack bug https://github.com/pytorch/pytorch/issues/36253. When I train my model only for a few epochs, I can quantize the model without any errors. However, when I fully train the model, these errors: “output scale: convolution scale 4.636909 is greater or equal to 1.0” are thrown during quantization and the model is extremely noisy after quantization. Is there a fix for this yet?

Regarding the performance, could you set the number of threads to 1 and see if it is still slower?

Regarding the noise - Could you try with pytorch nightly build? There was a fix for the scale issue as mentioned here - https://github.com/pytorch/pytorch/issues/33466#issuecomment-627660191

Upgrading to PyTorch Nightly fixed the errors and the output is looking much better now!

Edit: I spoke a bit too soon… there is also something wrong with the calibration. It seems that calibration is actually hurting performance. Here is the output of the quantized model after 1,10,100 and 1000 calibration images:


This is what it should look like:
(output before quantization)
Figure_1

I have tried setting the number of CPU threads with org.pytorch.PyTorchAndroid.setNumThreads(1); but it does not make a difference. I have also tried 1,2,3,4. Is this the correct way to set the thread count?

Hi singularity,

Have you solved the quantization performance issue on Android device?

I meet a similar one with MobileNetV3 that is performance gain can be obtained on Desktop PC but not on Android devices.

Thanks.

Hi @supriyar ,
So I have similar problem with mobilenet_v3: I’m testing the time performance of my float32 and quantized model. The quantized model is significantly slower than the float32 model, both on ‘fbgemm’ and ‘qnnpack’ and both on PC and Android:
PC, one thread, fbgemm: 0.011s vs 0.034s (avegare from 100 trials)
PC, one thread, qnnpack: 0.012s vs 0.035s (avegare from 100 trials)

What I basically did is:

  • took Duo Li implementation of MobileNetV3
  • added QuantStub at the beginning and DeQuantStub at the end of the model
  • changed all adds, muls and divs into FloatFunctional for quantized tensors support
  • set model.qconfig to ‘fbgemm’ or ‘qnnpack’
  • prepared model for qat
  • converted model to quantized model
  • compared performance between quantized version and non-quantized

Quantized model is ~4x smaller, but the inference is taking signifficantly slower.
Is there something that I’m missing?

to reproduce:
torchvision 0.8.2
pytorch 1.7.1
Windows10

"""
MIT License

Copyright (c) 2019 Duo LI

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
"""

import torch
from tqdm import tqdm
import time
import torch.nn as nn
import math


def _make_divisible(v, divisor, min_value=None):
    """
    This function is taken from the original tf repo.
    It ensures that all layers have a channel number that is divisible by 8
    It can be seen here:
    https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py
    :param v:
    :param divisor:
    :param min_value:
    :return:
    """
    if min_value is None:
        min_value = divisor
    new_v = max(min_value, int(v + divisor / 2) // divisor * divisor)
    # Make sure that round down does not go down by more than 10%.
    if new_v < 0.9 * v:
        new_v += divisor
    return new_v


##################################################################################
# FLOAT32 ARCHITECTURE
##################################################################################
class h_sigmoid(nn.Module):
    def __init__(self, inplace=True):
        super(h_sigmoid, self).__init__()
        self.relu = nn.ReLU6(inplace=inplace)

    def forward(self, x):
        return self.relu(x + 3) / 6


class h_swish(nn.Module):
    def __init__(self, inplace=True):
        super(h_swish, self).__init__()
        self.sigmoid = h_sigmoid(inplace=inplace)

    def forward(self, x):
        return x * self.sigmoid(x)


class SELayer(nn.Module):
    def __init__(self, channel, reduction=4):
        super(SELayer, self).__init__()
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Sequential(
                nn.Linear(channel, _make_divisible(channel // reduction, 8)),
                nn.ReLU(inplace=True),
                nn.Linear(_make_divisible(channel // reduction, 8), channel),
                h_sigmoid()
        )

    def forward(self, x):
        b, c, _, _ = x.size()
        y = self.avg_pool(x).view(b, c)
        y = self.fc(y).view(b, c, 1, 1)
        return x * y


def conv_3x3_bn(inp, oup, stride):
    return nn.Sequential(
        nn.Conv2d(inp, oup, 3, stride, 1, bias=False),
        nn.BatchNorm2d(oup),
        h_swish()
    )


def conv_1x1_bn(inp, oup):
    return nn.Sequential(
        nn.Conv2d(inp, oup, 1, 1, 0, bias=False),
        nn.BatchNorm2d(oup),
        h_swish()
    )


class InvertedResidual(nn.Module):
    def __init__(self, inp, hidden_dim, oup, kernel_size, stride, use_se, use_hs):
        super(InvertedResidual, self).__init__()
        assert stride in [1, 2]

        self.identity = stride == 1 and inp == oup

        if inp == hidden_dim:
            self.conv = nn.Sequential(
                # dw
                nn.Conv2d(hidden_dim, hidden_dim, kernel_size, stride, (kernel_size - 1) // 2, groups=hidden_dim, bias=False),
                nn.BatchNorm2d(hidden_dim),
                h_swish() if use_hs else nn.ReLU(inplace=True),
                # Squeeze-and-Excite
                SELayer(hidden_dim) if use_se else nn.Identity(),
                # pw-linear
                nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False),
                nn.BatchNorm2d(oup),
            )
        else:
            self.conv = nn.Sequential(
                # pw
                nn.Conv2d(inp, hidden_dim, 1, 1, 0, bias=False),
                nn.BatchNorm2d(hidden_dim),
                h_swish() if use_hs else nn.ReLU(inplace=True),
                # dw
                nn.Conv2d(hidden_dim, hidden_dim, kernel_size, stride, (kernel_size - 1) // 2, groups=hidden_dim, bias=False),
                nn.BatchNorm2d(hidden_dim),
                # Squeeze-and-Excite
                SELayer(hidden_dim) if use_se else nn.Identity(),
                h_swish() if use_hs else nn.ReLU(inplace=True),
                # pw-linear
                nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False),
                nn.BatchNorm2d(oup),
            )

    def forward(self, x):
        if self.identity:
            return x + self.conv(x)
        else:
            return self.conv(x)


class MobileNetV3(nn.Module):
    def __init__(self, cfgs, mode, num_classes=1, width_mult=1.):
        super(MobileNetV3, self).__init__()
        # setting of inverted residual blocks
        self.cfgs = cfgs
        assert mode in ['large', 'small']

        # building first layer
        input_channel = _make_divisible(16 * width_mult, 8)
        layers = [conv_3x3_bn(3, input_channel, 2)]
        # building inverted residual blocks
        block = InvertedResidual
        for k, t, c, use_se, use_hs, s in self.cfgs:
            output_channel = _make_divisible(c * width_mult, 8)
            exp_size = _make_divisible(input_channel * t, 8)
            layers.append(block(input_channel, exp_size, output_channel, k, s, use_se, use_hs))
            input_channel = output_channel
        self.features = nn.Sequential(*layers)
        # building last several layers
        self.conv = conv_1x1_bn(input_channel, exp_size)
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        output_channel = {'large': 1280, 'small': 1024}
        output_channel = _make_divisible(output_channel[mode] * width_mult, 8) if width_mult > 1.0 else output_channel[mode]
        self.classifier = nn.Sequential(
            nn.Linear(exp_size, output_channel),
            h_swish(),
            nn.Dropout(0.2),
            nn.Linear(output_channel, num_classes),
        )

        self._initialize_weights()

    def forward(self, x):
        x = self.features(x)
        x = self.conv(x)
        x = self.avgpool(x)
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        return x

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
                m.weight.data.normal_(0, math.sqrt(2. / n))
                if m.bias is not None:
                    m.bias.data.zero_()
            elif isinstance(m, nn.BatchNorm2d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()
            elif isinstance(m, nn.Linear):
                n = m.weight.size(1)
                m.weight.data.normal_(0, 0.01)
                m.bias.data.zero_()


def mobilenetv3_small(**kwargs):
    """
    Constructs a MobileNetV3-Small model
    """
    return MobileNetV3(cfgs, mode='small', **kwargs)


##################################################################################
# QUANTIZED ARCHITECTURE
##################################################################################
class h_sigmoid_quant(nn.Module):
    def __init__(self, inplace=True):
        super(h_sigmoid_quant, self).__init__()
        self.relu = nn.ReLU6(inplace=inplace)
        self.q_add = nn.quantized.FloatFunctional()

    def forward(self, x):
        return self.q_add.mul_scalar(self.relu(self.q_add.add_scalar(x, 3.)), 1/6)
        # return self.relu(x)


class h_swish_quant(nn.Module):
    def __init__(self, inplace=True):
        super(h_swish_quant, self).__init__()
        self.sigmoid = h_sigmoid_quant(inplace=inplace)
        self.q_mul = nn.quantized.FloatFunctional()

    def forward(self, x):
        return self.q_mul.mul(x, self.sigmoid(x))


class SELayerQuant(nn.Module):
    def __init__(self, channel, reduction=4):
        super(SELayerQuant, self).__init__()
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Sequential(
                nn.Linear(channel, _make_divisible(channel // reduction, 8)),
                nn.ReLU(inplace=True),
                nn.Linear(_make_divisible(channel // reduction, 8), channel),
                h_sigmoid_quant()
        )
        self.q_mul = nn.quantized.FloatFunctional()

    def forward(self, x):
        b, c, _, _ = x.size()
        y = self.avg_pool(x).view(b, c)
        y = self.fc(y).view(b, c, 1, 1)
        return self.q_mul.mul(x, y)


def conv_3x3_bn_quant(inp, oup, stride):
    return nn.Sequential(
        nn.Conv2d(inp, oup, 3, stride, 1, bias=False),
        nn.BatchNorm2d(oup),
        h_swish_quant()
    )


def conv_1x1_bn_quant(inp, oup):
    return nn.Sequential(
        nn.Conv2d(inp, oup, 1, 1, 0, bias=False),
        nn.BatchNorm2d(oup),
        h_swish_quant()
    )


class InvertedResidualQuant(nn.Module):
    def __init__(self, inp, hidden_dim, oup, kernel_size, stride, use_se, use_hs):
        super(InvertedResidualQuant, self).__init__()
        assert stride in [1, 2]

        self.identity = stride == 1 and inp == oup
        self.q_add = nn.quantized.FloatFunctional()

        if inp == hidden_dim:
            self.conv = nn.Sequential(
                # dw
                nn.Conv2d(hidden_dim, hidden_dim, kernel_size, stride, (kernel_size - 1) // 2, groups=hidden_dim, bias=False),
                nn.BatchNorm2d(hidden_dim),
                h_swish_quant() if use_hs else nn.ReLU(inplace=True),
                # Squeeze-and-Excite
                SELayerQuant(hidden_dim) if use_se else nn.Identity(),
                # pw-linear
                nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False),
                nn.BatchNorm2d(oup),
            )
        else:
            self.conv = nn.Sequential(
                # pw
                nn.Conv2d(inp, hidden_dim, 1, 1, 0, bias=False),
                nn.BatchNorm2d(hidden_dim),
                h_swish_quant() if use_hs else nn.ReLU(inplace=True),
                # dw
                nn.Conv2d(hidden_dim, hidden_dim, kernel_size, stride, (kernel_size - 1) // 2, groups=hidden_dim, bias=False),
                nn.BatchNorm2d(hidden_dim),
                # Squeeze-and-Excite
                SELayerQuant(hidden_dim) if use_se else nn.Identity(),
                h_swish_quant() if use_hs else nn.ReLU(inplace=True),
                # pw-linear
                nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False),
                nn.BatchNorm2d(oup),
            )

    def forward(self, x):
        if self.identity:
            return self.q_add.add(x, self.conv(x))
        else:
            return self.conv(x)


class MobileNetV3_quant(nn.Module):
    def __init__(self, cfgs, mode, num_classes=1, width_mult=1.):
        super(MobileNetV3_quant, self).__init__()
        # setting of inverted residual blocks
        self.cfgs = cfgs
        assert mode in ['large', 'small']

        self.quant = torch.quantization.QuantStub()
        self.dequant = torch.quantization.DeQuantStub()

        # building first layer
        input_channel = _make_divisible(16 * width_mult, 8)
        layers = [conv_3x3_bn_quant(3, input_channel, 2)]
        # building inverted residual blocks
        block = InvertedResidualQuant
        for k, t, c, use_se, use_hs, s in self.cfgs:
            output_channel = _make_divisible(c * width_mult, 8)
            exp_size = _make_divisible(input_channel * t, 8)
            layers.append(block(input_channel, exp_size, output_channel, k, s, use_se, use_hs))
            input_channel = output_channel
        self.features = nn.Sequential(*layers)
        # building last several layers
        self.conv = conv_1x1_bn_quant(input_channel, exp_size)
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        output_channel = {'large': 1280, 'small': 1024}
        output_channel = _make_divisible(output_channel[mode] * width_mult, 8) if width_mult > 1.0 else output_channel[mode]
        self.classifier = nn.Sequential(
            nn.Linear(exp_size, output_channel),
            h_swish_quant(),
            nn.Dropout(0.2),
            nn.Linear(output_channel, num_classes),
        )

        self._initialize_weights()

    def forward(self, x):
        x = self.quant(x)
        x = self.features(x)
        x = self.conv(x)
        x = self.avgpool(x)
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        x = self.dequant(x)
        return x

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
                m.weight.data.normal_(0, math.sqrt(2. / n))
                if m.bias is not None:
                    m.bias.data.zero_()
            elif isinstance(m, nn.BatchNorm2d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()
            elif isinstance(m, nn.Linear):
                n = m.weight.size(1)
                m.weight.data.normal_(0, 0.01)
                m.bias.data.zero_()


def mobilenetv3_small_quant(**kwargs):
    """
    Constructs a MobileNetV3-Small model
    """
    return MobileNetV3_quant(cfgs, mode='small', **kwargs)
##################################################################################
# RUN COMPARISION
##################################################################################
def test_net(mimage, quant):

    if quant:
        model = mobilenetv3_small_quant()
        model.qconfig = torch.quantization.get_default_qat_qconfig('qnnpack')
        torch.quantization.prepare_qat(model, inplace=True)
    else:
        model = mobilenetv3_small()

    if quant:
        model = torch.quantization.convert(model)
    model.eval()
    model.to(torch.device("cpu"))

    t0 = time.time()
    with torch.no_grad():
        with tqdm(total=RUNS, ncols=100) as pbar:
            for _ in range(RUNS):
                model(mimage)
                pbar.update()

    return (time.time() - t0) / RUNS


if __name__ == '__main__':
    RUNS = 100

    cfgs = [
        # k, t, c, SE, HS, s
        [3,    1,  16, 1, 0, 2],
        [3,  4.5,  24, 0, 0, 2],
        [3, 3.67,  24, 0, 0, 1],
        [5,    4,  40, 1, 1, 2],
        [5,    6,  40, 1, 1, 1],
        [5,    6,  40, 1, 1, 1],
        [5,    3,  48, 1, 1, 1],
        [5,    3,  48, 1, 1, 1],
        [5,    6,  96, 1, 1, 2],
        [5,    6,  96, 1, 1, 1],
        [5,    6,  96, 1, 1, 1],
    ]

    torch.set_num_threads(1)
    image = torch.rand(1, 3, 224, 224)
    print(f"time float32: {test_net(image, False)}")
    print(f"time quant: {test_net(image, True)}")
1 Like

Having a operator level profile might help to narrow down if certain ops are causing the slowdown.
From the model code, seems like there are some 1x1 conv’s in the network, and my understanding is that the performance of these may not be as efficient on fbgemm.
cc @dskhudia in case anything else stands out that may be causing slower inference on fbgemm

So I’ve changed every 1x1 convs to 3x3 convs (with padding=1 to preserve output shapes), and there is significant slowdown in float32 architecture (from 0.011s to 0.019s) and little slowdown in quantized architecture (from 0.034 to 0.037) but it’s still 2x slower than float32. Any other ideas?

@supriyar 1x1 should be performant using fbgemm backend. Also in the reply by Singularity I do see that quantization improves time.

@Racek : Is this issue resolved?

1 Like

Sadly no :frowning_face:

I’m facing same issue on UNet architecture

What I’ve found out is that when passing smaller input like (1, 3, 32, 32) quantized model performs similarly to fp32. When passing even smaller input like (1,3,16,16) quantized model performs slightly better (6% speedup on fbgemm). Nonetheless input of size 16x16 is quite extreme scenario.

@supriyar @dskhudia maybe this is helpful to trace what is wrong?

I’m facing the same issue with FX tracing on timm's efficientnet_b3. I get 10x speedup on regular backend but a 4x slow down on qnnpack (torch.backends.quantized.engine = 'qnnpack'). Here are the top 10 time eaters:

---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                  model_inference         1.91%       9.592ms       100.00%     503.021ms     503.021ms             1  
                   quantized::add        41.99%     211.224ms        42.02%     211.352ms      11.124ms            19  
                quantized::conv2d        20.23%     101.774ms        20.45%     102.887ms     791.436us           130  
                   quantized::mul        11.50%      57.855ms        11.56%      58.131ms       2.236ms            26  
                    aten::sigmoid        11.39%      57.277ms        11.40%      57.321ms       2.205ms            26  
                 aten::dequantize        11.26%      56.655ms        11.29%      56.795ms     530.796us           107  
                      aten::silu_         0.05%     238.570us         0.72%       3.618ms      46.382us            78  
                       aten::silu         0.67%       3.379ms         0.67%       3.379ms      43.323us            78  
        aten::quantize_per_tensor         0.34%       1.694ms         0.34%       1.694ms      20.918us            81  
                       aten::mean         0.14%     719.232us         0.16%     803.762us      29.769us            27  
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  

It’s sad that so much time is being taken by add and mul. The average time for one add is 11 ms! The convs are taking about as long as the FP32 version.

Why is this thread getting so little attention? Serious question as I’m new to putting AI on edge devices and I’m starting to wonder if gone down some rarely followed path.