I’d like to use low-level quantization of a tensor x
of shape (num_channels, blocksize)
. I know how to use MovingAveragePerChannelMinMaxObserver
to obtain the quantisation params.
I also understand how to use torch.quantize_per_channel
then, to get q_x
. From the docs, the quant params see to somehow be attached to q_x
.
But I’d now (for other reasons) like to store quantisation params separately from q_x.int_repr()
.
Even though this is not documented anywhere, I suspect that torch.dequantize
would do the right thing when applied to q_x
. But in my case, I store q_x.int_repr()
plus the quantisation params separately. What is a good way to dequantize then, given I have the internal representation and the parameters? How can I safely get a “quantized tensor” (i.e., like output of quantize_per_channel
) from its parts?
I found some issue pages with torch._make_per_channel_quantized_tensor
:
But I am not sure this is even callable. Why is this so hidden? Especially given that tensors with torch.qint8
do not really implement many things, for example torch.cat
fails for the per_channel variant.
This actually works. I still wonder why this is not properly documented, and why I need to call an internal method to do something that seems pretty useful to me.
Just a heads up. Quantization is not just useful for what your high-level stuff here uses it, namely quantize weights of big model. We also need it for inference, say to compress KV cache content. And this does need clean access to the low level primitives.
I wanted to use bitsandbytes
, but found out this works on GPU only, and only on some, so I’ll implement the baseline with torch
code. If you know a good way to get 4-bit quantization which always works (also on CPU), please let me know. I don’t want to do it myself, though.