Why is Quantization on the GPU actually not supported?

Why is the quantization on the GPU actually not supported?

The CPU quantization works really well and the basic quantization algorithms seem to be mature and on the conceptual level not related to any device. I understand that very large model present new challenges for quantization (outlier features) and I am also exclusively thinking of PTQ.

So, out of genuine curiosity: What makes GPU quantization different from CPU quantization? Why is it difficult to implement?

1 Like

we haven’t had a major use case for int8 quantization on GPU, since the speedup from fp16 seems to work for most models at inference. Moreover, for fast int8 inference there is a dependency on using a 3p backend like TensorRT or custom cuda/cudnn int8 kernels from Nvidia.

However, we are starting to look into enabling int8 quantization on GPU using triton in the context of speeding up transformer based models. cc @cdhernandez