Question on quantize_per_channel and dequantize

I’d like to use low-level quantization of a tensor x of shape (num_channels, blocksize). I know how to use MovingAveragePerChannelMinMaxObserver to obtain the quantisation params.

I also understand how to use torch.quantize_per_channel then, to get q_x. From the docs, the quant params see to somehow be attached to q_x.

But I’d now (for other reasons) like to store quantisation params separately from q_x.int_repr().

Even though this is not documented anywhere, I suspect that torch.dequantize would do the right thing when applied to q_x. But in my case, I store q_x.int_repr() plus the quantisation params separately. What is a good way to dequantize then, given I have the internal representation and the parameters? How can I safely get a “quantized tensor” (i.e., like output of quantize_per_channel) from its parts?

I found some issue pages with torch._make_per_channel_quantized_tensor:

But I am not sure this is even callable. Why is this so hidden? Especially given that tensors with torch.qint8 do not really implement many things, for example torch.cat fails for the per_channel variant.

This actually works. I still wonder why this is not properly documented, and why I need to call an internal method to do something that seems pretty useful to me.

Just a heads up. Quantization is not just useful for what your high-level stuff here uses it, namely quantize weights of big model. We also need it for inference, say to compress KV cache content. And this does need clean access to the low level primitives.

I wanted to use bitsandbytes, but found out this works on GPU only, and only on some, so I’ll implement the baseline with torch code. If you know a good way to get 4-bit quantization which always works (also on CPU), please let me know. I don’t want to do it myself, though.