Performance degradation on Android after quantization

165749 · September 1, 2023, 1:35am

Hi everyone,

I am following the official mobile performance recipe to deploy a bunch of models on Android devices with PyTorch v2.0.0, including the steps:

Fuse conv and bn operations
Conduct post training static quantization
Apply mobile_optimizer on the scripted model

During the experiments, I observed performance degradation for some models after quantization. For example, the latency measurement of ResNet-18 on Pixel 5 (limited to one thread on the 2.4GHz core) is 132ms without quantization while 213ms with quantization.

One observation from the trace is that the convolutions in fp32 always select the xnnpack backend (i.e., conv2d_clamp_run), while the quantized models utilize the qnnpack backend (e.g., qnnpackConv) - I guess this can be one explanation of the distinct performance (due to different backend implementations), but I am surprised that the quantized models can be slower.

Also, I noticed that the quantized convolution (qconv) would actually check the applicability of xnnpack (i.e., can_use_xnnp), but I found the type of qconv is kQUInt8, which fails in the following condition:

bool supported_dtypes = dtype == c10::kQInt8;

I am curious it is possible to make use of xnnpack for quantized model to see if we can achieve any potential performance improvement?

I would really appreciate it if someone could share some insights.

jerryzh168 · September 21, 2023, 10:54pm

can you add a mobile tag for this?

165749 · September 21, 2023, 11:42pm

Thanks for the suggestions. Unfortunately I joined the forum recently and haven’t got enough permission (of trust level) to edit my post. Would you recommend to create another post with mobile tag?

jerryzh168 · September 21, 2023, 11:55pm

yeah that sounds good

ptrblck · September 22, 2023, 12:23am

I’ll move it for you to mobile.