Recently, I am deploying my model to Android. I found that my quantized model has very high cpu usage, even much higher than fp32 float model. So I tried to run a model with only one Linear layer like:
def _init_(self, idim, odim):
self.dequant = torch.quantization.DequantStub()
self.linear = torch.nn.Sequential(torch.nn.Linear(idim, 60), torch.nn.Linear(60, odim), torch.nn.ReLU())
def forward(self, x):
x = self.quant(x)
x = self.Linear(x)
x = self.dequant(x)
qconfig = torch.quantization.get_default_qconfig(‘qnnpack’)
model.qconfig = qconfig
torch.backends.quantized.engine = ‘qnnpack’
for i in range(1,100):
x = model(torch.randn(15, idim))
script_module = torch.jit.script(model)
Then I load the exported script model and do inference using C++. In order to control the inference frequency, I do inference every 160ms to imitate the streaming speech recognition task.
Even this simple model, the quantized model’s cpu usage is about 30% while the fp32 model is 1%.
My pytorch is 1.6.0 and ndk is r19c. I test my model using arm32 and aarch64 both.
My libtorch builds cmd:
This picture is the cpu usage of quantized (rectangle) and fp32 (circle) models on on arm32: