Question on quantize_per_channel and dequantize

mseeger · March 6, 2025, 5:00pm

I’d like to use low-level quantization of a tensor x of shape (num_channels, blocksize). I know how to use MovingAveragePerChannelMinMaxObserver to obtain the quantisation params.

I also understand how to use torch.quantize_per_channel then, to get q_x. From the docs, the quant params see to somehow be attached to q_x.

But I’d now (for other reasons) like to store quantisation params separately from q_x.int_repr().

Even though this is not documented anywhere, I suspect that torch.dequantize would do the right thing when applied to q_x. But in my case, I store q_x.int_repr() plus the quantisation params separately. What is a good way to dequantize then, given I have the internal representation and the parameters? How can I safely get a “quantized tensor” (i.e., like output of quantize_per_channel) from its parts?

mseeger · March 6, 2025, 7:59pm

I found some issue pages with torch._make_per_channel_quantized_tensor:

But I am not sure this is even callable. Why is this so hidden? Especially given that tensors with torch.qint8 do not really implement many things, for example torch.cat fails for the per_channel variant.

mseeger · March 7, 2025, 7:25am

This actually works. I still wonder why this is not properly documented, and why I need to call an internal method to do something that seems pretty useful to me.

mseeger · March 7, 2025, 8:35am

Just a heads up. Quantization is not just useful for what your high-level stuff here uses it, namely quantize weights of big model. We also need it for inference, say to compress KV cache content. And this does need clean access to the low level primitives.

I wanted to use bitsandbytes, but found out this works on GPU only, and only on some, so I’ll implement the baseline with torch code. If you know a good way to get 4-bit quantization which always works (also on CPU), please let me know. I don’t want to do it myself, though.

jerryzh168 · April 6, 2025, 12:14am

please checkout our new repo: torchao GitHub - pytorch/ao: PyTorch native quantization and sparsity for training and inference that optimizes for LLM low precision inference, training and finetuning (QAT)

jerryzh168 · April 6, 2025, 12:15am

for CPU int4 weight only quant, you can check out this: Quantized LLM inference vs quantized matrix multiplication speed in CPU - #3 by jerryzh168