I am following the official mobile performance recipe to deploy a bunch of models on Android devices with PyTorch v2.0.0, including the steps:
- Fuse conv and bn operations
- Conduct post training static quantization
- Apply mobile_optimizer on the scripted model
During the experiments, I observed performance degradation for some models after quantization. For example, the latency measurement of ResNet-18 on Pixel 5 (limited to one thread on the 2.4GHz core) is 132ms without quantization while 213ms with quantization.
One observation from the trace is that the convolutions in fp32 always select the xnnpack backend (i.e., conv2d_clamp_run), while the quantized models utilize the qnnpack backend (e.g., qnnpackConv) - I guess this can be one explanation of the distinct performance (due to different backend implementations), but I am surprised that the quantized models can be slower.
Also, I noticed that the quantized convolution (qconv) would actually check the applicability of xnnpack (i.e., can_use_xnnp), but I found the type of qconv is kQUInt8, which fails in the following condition:
bool supported_dtypes = dtype == c10::kQInt8;
I am curious it is possible to make use of xnnpack for quantized model to see if we can achieve any potential performance improvement?
I would really appreciate it if someone could share some insights.