Fused 4-bit quantization support

Lewuathe · March 31, 2021, 4:55am

Hi

I have found an issue closed to support 4-bit quantization although it’s not officially described in the doc.

model-compiler seems not to support Int4 quantization.

model-compiler: for the -quantization-precision option: Cannot find option named 'Int4'!

We want to try this quantization mode. It is available in the latest Glow? How can we use that?

jfix · April 2, 2021, 4:13am

Hi, I believe -quantization-precision only supports Int8 and Int16 right now. For 4-bit quantization, it’s currently only supported by a few ops, such as EmbeddingBag and SparseLengthsSum given that these ops often are loading from extremely large embedding tables that can be shrunk significantly using 4-bit quantization.

If you wanted to use it for other operators we’d need to expand its support across a variety of different operators.

Lewuathe · April 6, 2021, 3:36am

Thank you for the information. I wanted to try 4-bit quantization if possible to compare the accuracy and performance of models generated by Glow.

For the case of when we want to do 4-bit quantization for some operators like EmbeddingBag and SparseLengthSum, how can we enable that? -quantization-precision does not support the Int4 for now.

Is that automatically enabled?

jfix · April 7, 2021, 4:40pm

So right now I don’t think we have automatic Glow-based quantization support for this. We have pre-quantized Glow kernels for executing these ops, but they are only ever loaded pre-quantized from the input model, i.e. they are quantized in PyTorch, Caffe2, etc. before Glow ever loads them.

In order to support this we’d need to extend the Glow profiler to (1) support per-row profiling, and then (2) use that per-row profiling to do the 4-bit quantization.

(Note that our 4bit quantization support for EmbeddingBag and SparseLengthsSum are both rowwise quantized)

Lewuathe · April 16, 2021, 2:20am

Got it. Thank you for the tailored explanation!