Recently, I am deploying my model to Android. I found that my quantized model has very high cpu usage, even much higher than fp32 float model. So I tried to run a model with only one Linear layer like:
def _init_(self, idim, odim):
self.dequant = torch.quantization.DequantStub()
self.linear = torch.nn.Sequential(torch.nn.Linear(idim, 60), torch.nn.Linear(60, odim), torch.nn.ReLU())
def forward(self, x):
x = self.quant(x)
x = self.Linear(x)
x = self.dequant(x)
qconfig = torch.quantization.get_default_qconfig(‘qnnpack’)
model.qconfig = qconfig
torch.backends.quantized.engine = ‘qnnpack’
for i in range(1,100):
x = model(torch.randn(15, idim))
script_module = torch.jit.script(model)
Then I load the exported script model and do inference using C++. In order to control the inference frequency, I do inference every 160ms to imitate the streaming speech recognition task.
Even this simple model, the quantized model’s cpu usage is about 30% while the fp32 model is 1%.
My pytorch is 1.6.0 and ndk is r19c. I test my model using arm32 and aarch64 both.
My libtorch builds cmd:
This picture is the cpu usage of quantized (rectangle) and fp32 (circle) models on on arm32:
@ptrblck @jerryzh168 @tom Do you have any ideas about this issue?
I don’t think this is how that kind of comparison works.
- If you run it separately, do you see a difference in timings?
- If you run each separately, is there a difference in CPU utilization?
- Do you have enough cores to even run this reasonably?
I also test the runing time. My quantized model is much slower than the fp32 model.
If I run each (quantized and non-quantized) model separately, the cpu utilization is just like the forgoing picture.
I have 4 cores. For this simple model, I think it’s enough.
@tom, You can test my quantized and fp32 models using “speed_benchmark_torch” using the flowing cmd:
./speed_benchmark_torch --model=fsmn_quant_script.pt --input_type=float --input_dim=“15,400” --iter=10000
./speed_benchmark_torch --model=fsmn_noquant_script.pt --input_type=float --input_dim=“15,400” --iter=10000
Here are my quant and noquant models
It seems that the quantized model runs faster but the cpu usage is much higher than the fp32 models.
Well, you could limit the number of threads to make it a better comparison. The overall work done should somewhat decrease.
What might happen here is that the backends are set up differently and this is what you see.
@tom. Thanks for your replying. How can I set the number of threads when I do inference using C++?
I think you have (from torch/include/ATen/Parallel.h) void set_num_threads(int); void set_num_interop_threads(int);
Thank you very much. This works for me!