SmoothQuant can only run on GPUs with Cutlass support.
There are many models quantized with SmoothQuant on Hugging Face. I want to run inference with them on an ARM CPU-only server, performance notwithstanding. I have struggled for a long time but haven’t found a viable way to achieve this.
- Should I modify SmoothQuant or Torch-int?
- PyTorch supports quantization with QNNPACK, and it provides both module (e.g., quantized_linear, with unspecified input tensor datatype) and functional interfaces (e.g., linear, with input as quint8 and weight as qint8). These do not match the interface with Cutlass in Torch-int.
I would greatly appreciate it if someone could provide thoughts, workflows, or example code/pseudo code.