Quantized LLM inference vs quantized matrix multiplication speed in CPU

luentong · February 26, 2025, 3:29am

Hi!
I did some experiments on quantized LLM inference vs quantized matrix multiplication speed in CPU, for both x86 Ubuntu machine and Kunpeng EulerOS machine.

Matrix muliplication speed:
x86 Ubuntu Python 1000x1000 bfloat16 0.87-0.88 sec
x86 Ubuntu Python 1000x1000 float16 0.83-0.84 sec
x86 Ubuntu Python 1000x1000 float32 0.015-0.03 sec
arch64 Euler Python 1000x1000 float32 0.021-0.022 sec
arch64 Euler Python 1000x1000 float16 0.062-0.064 sec
arch64 Euler Python 1000x1000 bfloat16 0.043-0.045 sec

Qwen2.5-0.5B-Instruct:
x86 Ubuntu CPU inference，bfloat16 :11.21 sec
x86 Ubuntu CPU inference，float16 :2.70 sec
x86 Ubuntu CPU inference，float32 :24.84 sec
arch64 Euler CPU inference，bfloat16: 26.96 sec
arch64 Euler CPU inference，float16: 11.62 sec
arch64 Euler CPU inference，float32: 27.2 sec

LLM inference speed:

Video-Llama-3-2b:
x86 Ubuntu CPU， bfloat16：11.73 sec
x86 Ubuntu CPU ， float16：23.00 sec
x86 Ubuntu CPU， float32：140.00 sec
arch64 Euler CPU， bfloat16：333.67 sec
arch64 Euler CPU， float16：49.55 sec
arch64 Euler CPU， float32：189 sec

So for some LLMs, it’s running faster with float16/bfloat16 than float32 on CPU, as expected. But for pytorch matrix multiplication, it always doesn’t. I’m wondering how this works? Basically how does certain LLM implement speed-ups on quantized calculations that goes over what Pytorch can do.

jerryzh168 · April 6, 2025, 12:10am

looks like this is not quantized.

maybe you can try out our new int4 cpu from intel:

pip install torchao

from torchao import quantize_
from torchao.quantization import Int4WeightOnlyConfig
from torchao.dtypes import Int4CPULayout

quantize_(model, Int4WeightOnlyConfig(group_size=128, layout=Int4CPULayout())

jerryzh168 · April 6, 2025, 12:11am

in the end these will run the quantized ops, such as for int4 weight only cuda gpu, we have: pytorch/aten/src/ATen/native/native_functions.yaml at cfea55dbecf93a88a40290a69c5e3b324dcec69c · pytorch/pytorch · GitHub

for bfloat16/float16 etc. they just use the same op, and there will be some dispatch inside the op based on dtype that routes the compute to different kernels for each dtype