Every quantization guide I read seems to gravitate towards using only one instance and call of QuantStub.quant
: quantize the input, pass the quantized tensor through the model, and then dequantize the output.
However it makes sense for different layers/modules to use different quantization parameters (scale, zero-point).
- Am I wrong about this? Is there a reason no one mentions it? Is it that the de/quant ops add too much of an overhead and aren’t worth it for the accuracy gain?
- In my initial experiments, using separate QuantStubs for
quant
operations significantly improves accuracy. Is there something I’m not seeing?
At first I’ve been forced to use multiple QuantStub
s because there are parts of my model (matmul/bmm primarily) that cannot be quantized. Now I’m wondering if there is an actual additional benefit to this.