Is using multiple (separate) QuantStubs useful?

yannbane · September 20, 2022, 11:35am

Every quantization guide I read seems to gravitate towards using only one instance and call of QuantStub.quant: quantize the input, pass the quantized tensor through the model, and then dequantize the output.

However it makes sense for different layers/modules to use different quantization parameters (scale, zero-point).

Am I wrong about this? Is there a reason no one mentions it? Is it that the de/quant ops add too much of an overhead and aren’t worth it for the accuracy gain?
In my initial experiments, using separate QuantStubs for quant operations significantly improves accuracy. Is there something I’m not seeing?

At first I’ve been forced to use multiple QuantStubs because there are parts of my model (matmul/bmm primarily) that cannot be quantized. Now I’m wondering if there is an actual additional benefit to this.

Vasiliy_Kuznetsov · September 20, 2022, 2:34pm

Hi @yannbane,

It does make sense for different layers to use different scales and zero_points. Observers are used to collect these statistics. QuantStub is not the only place where observers are inserted - they are also often inserted at the output of quantizeable layers. So, even though you might be only manually adding one QuantStub, the quantization framework will also add observers to other places. For best accuracy, usually each point in the program which needs observation should use a separate observer.