Hi!
I did some experiments on quantized LLM inference vs quantized matrix multiplication speed in CPU, for both x86 Ubuntu machine and Kunpeng EulerOS machine.
Matrix muliplication speed:
x86 Ubuntu Python 1000x1000 bfloat16 0.87-0.88 sec
x86 Ubuntu Python 1000x1000 float16 0.83-0.84 sec
x86 Ubuntu Python 1000x1000 float32 0.015-0.03 sec
arch64 Euler Python 1000x1000 float32 0.021-0.022 sec
arch64 Euler Python 1000x1000 float16 0.062-0.064 sec
arch64 Euler Python 1000x1000 bfloat16 0.043-0.045 sec
Qwen2.5-0.5B-Instruct:
x86 Ubuntu CPU inference,bfloat16 :11.21 sec
x86 Ubuntu CPU inference,float16 :2.70 sec
x86 Ubuntu CPU inference,float32 :24.84 sec
arch64 Euler CPU inference,bfloat16: 26.96 sec
arch64 Euler CPU inference,float16: 11.62 sec
arch64 Euler CPU inference,float32: 27.2 sec
LLM inference speed:
Video-Llama-3-2b:
x86 Ubuntu CPU, bfloat16:11.73 sec
x86 Ubuntu CPU , float16:23.00 sec
x86 Ubuntu CPU, float32:140.00 sec
arch64 Euler CPU, bfloat16:333.67 sec
arch64 Euler CPU, float16:49.55 sec
arch64 Euler CPU, float32:189 sec
So for some LLMs, it’s running faster with float16/bfloat16 than float32 on CPU, as expected. But for pytorch matrix multiplication, it always doesn’t. I’m wondering how this works? Basically how does certain LLM implement speed-ups on quantized calculations that goes over what Pytorch can do.