Quantized LLM inference vs quantized matrix multiplication speed in CPU

in the end these will run the quantized ops, such as for int4 weight only cuda gpu, we have: pytorch/aten/src/ATen/native/native_functions.yaml at cfea55dbecf93a88a40290a69c5e3b324dcec69c · pytorch/pytorch · GitHub

for bfloat16/float16 etc. they just use the same op, and there will be some dispatch inside the op based on dtype that routes the compute to different kernels for each dtype