How to convert a 32-bit operation to a 4-bit or 8-bit operation on cpu?

To the best of my knowledge, the existing quantization method is operating on 32-bit.
In order to quantize weight of CNN as well as reduce memory footprint and then port the quantized model into the mobile device, how to convert a 32-bit operation to a 4-bit or 8-bit operation on cpu?

PyTorch quantization supports int8 (but not int4), with fast kernels for CPU on mobile via QNNPACK. has some information to get started, and you would want to set the backend to qnnpack to target mobile CPUs.