How to convert a 32-bit operation to a 4-bit or 8-bit operation on cpu?

To the best of my knowledge, the existing quantization method is operating on 32-bit.
In order to quantize weight of CNN as well as reduce memory footprint and then port the quantized model into the mobile device, how to convert a 32-bit operation to a 4-bit or 8-bit operation on cpu?

PyTorch quantization supports int8 (but not int4), with fast kernels for CPU on mobile via QNNPACK. https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html has some information to get started, and you would want to set the backend to qnnpack to target mobile CPUs.