Isn't Bias normally int Quantized in INT8 PTSQ model?

minjun_jo · October 9, 2024, 11:08am

I am a student who studies AI accelerators.
We are doing related research by performing Pytorch’s Eager mode Quantization.
As a result of the output, it was confirmed that quantization was made for the weight and activation of Qint8, but the bias is not.

I have a few questions, please.

Doesn’t Bias usually Quantize to INT8 in Quantization model?

2.Isn’t it recommended to run Eager mode to experiment with the Quantization model for INT8? Usually, it seems to be customized, but I’m not very good at coding.

If you want to experiment with INT8 quantization, it is said that QINT8 is decoded with FP32 and then calculated. If you want to study only with INT8, what method should I use? (Eager mode, FX, ONNX or Tensorflow…
Also, most backend(fbgem or quantization backend) doesn’t support INT8 operation well, is there a way to support it well?

Please answer what you know.

Khumbaba · October 9, 2024, 11:40am

For your first question, no, bias is not usually quantized to int8. It is quantized to int32 (or even int64). From my understanding, the goal is to speed up the multiplications, which will be done in int8, with the result stored in int32. Which in turn makes sense why the bias would want to be in int32.

An extract from Google’s quantization paper which I recommend reading if you want to get to know quantization better:

Note that the biases are not quantized because they are
represented as 32-bit integers in the inference process, with
a much higher range and precision compared to the 8 bit
weights and activations. Furthermore, quantization param-
eters used for biases are inferred from the quantization pa-
rameters of the weights and activations