Inference with own scaling factors

bok3948 · January 21, 2024, 5:36am

Hi, first of all, thank you for your great work.

I’m trying to perform INT8 inference with a quantized model using specific post-training quantization methods like PTQ4ViT and EasyQuant. I’ve reviewed the GitHub code from the quantization papers. Even though they use PyTorch, they obtain scaling factors and other quantization parameters, but they only restrict the range (e.g., INT8 [-128 to 127]). However, the actual calculation seems to be performed in FP32, which will not accelerate inference speed. (Check here: (1) GitHub - hahnyuan/PTQ4ViT: Post-Training Quantization for Vision transformers., (2) GitHub - hustvl/PD-Quant: [CVPR 2023] PD-Quant: Post-Training Quantization Based on Prediction Difference Metric).

So, I wonder if there is a solution for performing inference with low precision after obtaining my quantization parameters.

Thanks.