Hi all,

Recently, I am deploying my model to Android. I found that my quantized model has very high cpu usage, even much higher than fp32 float model. So I tried to run a model with only one Linear layer like:

Class Linear（torch.nn.Module):

def _

init_(self, idim, odim):self.quant= torch.quantization.QuantStub()

self.dequant = torch.quantization.DequantStub()

self.linear = torch.nn.Sequential(torch.nn.Linear(idim, 60), torch.nn.Linear(60, odim), torch.nn.ReLU())

def forward(self, x):

x = self.quant(x)

x = self.Linear(x)

x = self.dequant(x)

idim=400

odim=200

model=Linear(idim, odim)

model.eval()

qconfig = torch.quantization.get_default_qconfig(‘qnnpack’)

model.qconfig = qconfig

torch.backends.quantized.engine = ‘qnnpack’

torch.quantization.prepare(model, inplace=True)

#Calibration

for i in range(1,100):

x = model(torch.randn(15, idim))

torch.quantization.convert(model, inplace=True)

script_module = torch.jit.script(model)

script_module.save(‘model.pt’)

Then I load the exported script model and do inference using C++. In order to control the inference frequency, I do inference every 160ms to imitate the streaming speech recognition task.

Even this simple model, the quantized model’s cpu usage is about 30% while the fp32 model is 1%.

My pytorch is 1.6.0 and ndk is r19c. I test my model using arm32 and aarch64 both.

My libtorch builds cmd:

./script/build_android.sh

This picture is the cpu usage of quantized (rectangle) and fp32 (circle) models on on arm32: