How to use a quantized model on INT8 harware?

babak_hss · February 21, 2020, 4:07pm

Recently I used pytorch quantization-aware training to quantize my model.
The result still has good accuracy, and it uses per channel scales.

However, our hardware colleagues told me that because it has FP scales and zero-points in channels, the hardware should still support FP in order to implement it.

They also argued that in each internal stage, the values (in-channels) should be dequantized and converted to FP and quantized again for the next layer.

I’m not sure if this argument is valid or if such limitation is applied on such quantization?

hx89 · February 23, 2020, 1:23am

For the first argument you are right, since scales and zero-points are FP, hardware need to support FP for the computation.

The second argument may not be true, for static quantization the output of the previous layer can be fed into next layer without dequantizing to FP. Maybe they are thinking about dynamic quantization, which keeps tensors between two layers in FP.

babak_hss · February 24, 2020, 10:43am

Are these per channel scales determined for quantizing the network weights or for the tensors (layer outputs)?

If it’s the first case, then can’t we compute INT8 weights offline (using the obtained scales) and upload them to the hardware without the need for having FP support? I only use the hardware for the inference phase.

If these scales are required for quantizing the internal tensors (layers outputs/inputs), then doesn’t it mean that they have to be dequantized first in each stage to be quantized again with a different scale?

dassima · February 24, 2020, 12:26pm

For static quantization, you can fed directly the outputs to the next layer, but how do you add the zero points if they are 32FP? or when exactly in the flow?

dassima · February 24, 2020, 12:32pm

if it helps, I found out that they use the scales to convert to 8 int the weights and data (but each feature map, kernel, input has its own scale) and after that do the multiplications in 8 int, they accumulate the result to 32int for each layer and requantize to 8 int using another scale which you can find it only when you acces the model as a dictionary.

babak_hss · February 24, 2020, 2:17pm

So, does this requantization step in each layer (to convert its output to INT8 with another scale) requires the hardware to support FP32 arithmetic?

Also, how expensive would this conversion be compared to doing everything in FP32?

dassima · February 24, 2020, 2:34pm

I think the answer is yes if you want to do in hardware the whole inference at once… And I do not know how expensive it is in comparison. I think it actually depends on the piece of hardware

raghuramank100 · March 3, 2020, 1:50am

Yes, in our implementations we do a floating point multiply to requantize from 32 bit to 8 bit. You can also do this with integer arithmetic if needed. We found that on ARM and x86 CPUs, doing the requantization with fp32 is more efficient.

babak_hss · April 4, 2020, 7:34am

Is there other possibilities like having INT8/INT32 scale factors or having INT32 but non-per channel quantization?

The per-channel quantization is great regarding the accuracy, but I do not need that much of accuracy and instead need to perform all the calculations in INT8 or even INT32.

jerryzh168 · April 8, 2020, 11:46pm

I see, we don’t have this right now, you will need to write a new Quantizer to enable this: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/quantized/Quantizer.h

mg1371 · April 14, 2020, 7:11am

Is there a way to fake quantize scale factors or zero points (instead of defining new Quantizers)?

jerryzh168 · April 17, 2020, 8:41pm

if you only need this in quantization aware training, you’ll need to define your own fake quantize module(https://github.com/pytorch/pytorch/blob/v1.3.1/torch/quantization/fake_quantize.py) and fake quantize op
(https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/fake_quant_per_tensor_affine.cpp)