Every quantization guide I read seems to gravitate towards using only one instance and call of
QuantStub.quant: quantize the input, pass the quantized tensor through the model, and then dequantize the output.
However it makes sense for different layers/modules to use different quantization parameters (scale, zero-point).
- Am I wrong about this? Is there a reason no one mentions it? Is it that the de/quant ops add too much of an overhead and aren’t worth it for the accuracy gain?
- In my initial experiments, using separate QuantStubs for
quantoperations significantly improves accuracy. Is there something I’m not seeing?
At first I’ve been forced to use multiple
QuantStubs because there are parts of my model (matmul/bmm primarily) that cannot be quantized. Now I’m wondering if there is an actual additional benefit to this.