Static quantization: different training and inference backends

Serhiy_Shekhovtsov · July 2, 2021, 9:49am

What is the best way to handle different training and inference backends? In this blog it says:

static quantization must be performed on a machine with the same architecture as your deployment target. If you are using FBGEMM, you must perform the calibration pass on an x86 CPU; if you are using QNNPACK, calibration needs to happen on an ARM CPU

But I can’t find anything about this in the official tutorial. How accurate is this statement? Is it true for both options(post-training calibration and quantization-aware training) or only for calibration-based one?

tom · July 7, 2021, 5:35pm

Hi, I meant to reply here, but forgot.

There are some subtle caveats, but I don’t think the author’s description is completely accurate here.

So the quantization configuration is backend specific and the operator coverage may differ. This means that it is likely preferable to use the same quantization backend, i.e. when I do this, I use QNNPACK for both the training/conversion (on x86) and on the ARM target.
Except for bugs, I would expect that there is actually less potential for variation in the computation results for the quantized model, as the machine precision deviations originating from floating point non-commutativity should go away for the quantized (part of the) computation.
However, there is no firm requirement to apply absolutely the same. If there were, we’d be in lots of trouble with Quantization Aware Training…

Best regards

Thomas

P.S.: I have an ongoing four-part series on quantizing an audio model. In the part 2 posted today I am trying to cover my world-view of what is going on with quantization and in the next part we’ll actually do the quantization.

Serhiy_Shekhovtsov · July 8, 2021, 10:24am

Thanks @tom this helps a lot!

One more thing I’d like to confirm: Once you train, calibrate, and quantize the model on QNNPACK is it still fine to evaluate the quantized model on x86 machine? Can we expect an accurate score, or should we use the target device(target CPU architecture) for the evaluation run?