How to use a quantized model on INT8 harware?

Recently I used pytorch quantization-aware training to quantize my model.
The result still has good accuracy, and it uses per channel scales.

However, our hardware colleagues told me that because it has FP scales and zero-points in channels, the hardware should still support FP in order to implement it.

They also argued that in each internal stage, the values (in-channels) should be dequantized and converted to FP and quantized again for the next layer.

I’m not sure if this argument is valid or if such limitation is applied on such quantization?

For the first argument you are right, since scales and zero-points are FP, hardware need to support FP for the computation.

The second argument may not be true, for static quantization the output of the previous layer can be fed into next layer without dequantizing to FP. Maybe they are thinking about dynamic quantization, which keeps tensors between two layers in FP.


Are these per channel scales determined for quantizing the network weights or for the tensors (layer outputs)?

If it’s the first case, then can’t we compute INT8 weights offline (using the obtained scales) and upload them to the hardware without the need for having FP support? I only use the hardware for the inference phase.

If these scales are required for quantizing the internal tensors (layers outputs/inputs), then doesn’t it mean that they have to be dequantized first in each stage to be quantized again with a different scale?

For static quantization, you can fed directly the outputs to the next layer, but how do you add the zero points if they are 32FP? or when exactly in the flow?

if it helps, I found out that they use the scales to convert to 8 int the weights and data (but each feature map, kernel, input has its own scale) and after that do the multiplications in 8 int, they accumulate the result to 32int for each layer and requantize to 8 int using another scale which you can find it only when you acces the model as a dictionary.

1 Like

So, does this requantization step in each layer (to convert its output to INT8 with another scale) requires the hardware to support FP32 arithmetic?

Also, how expensive would this conversion be compared to doing everything in FP32?

I think the answer is yes if you want to do in hardware the whole inference at once… And I do not know how expensive it is in comparison. I think it actually depends on the piece of hardware

Yes, in our implementations we do a floating point multiply to requantize from 32 bit to 8 bit. You can also do this with integer arithmetic if needed. We found that on ARM and x86 CPUs, doing the requantization with fp32 is more efficient.

Is there other possibilities like having INT8/INT32 scale factors or having INT32 but non-per channel quantization?

The per-channel quantization is great regarding the accuracy, but I do not need that much of accuracy and instead need to perform all the calculations in INT8 or even INT32.

I see, we don’t have this right now, you will need to write a new Quantizer to enable this:

Is there a way to fake quantize scale factors or zero points (instead of defining new Quantizers)?

if you only need this in quantization aware training, you’ll need to define your own fake quantize module( and fake quantize op