Quantized model is slow and gpu usage becomes high with qnnpack

Vasiliy_Kuznetsov · November 4, 2020, 3:25am

Hi @Sining_Sun, quantized LayerNorm currently has an efficient kernel in fbgemm (x86), but it does not have an efficient kernel in qnnpack (ARM). So, you are likely seeing the slow fallback path of the kernel on ARM.

A workaround for now could be to let LayerNorm stay in fp32. You can do this by setting the qconfig to None for the LayerNorm module, and moving the dequant to be before LayerNorm.