Variable-bit (sub 8-bits) quantization for custom hardware deployment with power-of-two (pot) scales

d0uwe · March 20, 2023, 9:48am

Hey everyone! I am looking for a way to perform Quantization-Aware Training (QAT) using PyTorch.

My usecase concerns deploying trained PyTorch models on custom hardware (silicon) and so I have a few requirements:

Needs to support nn.Conv1d (as this is part of the network that I want to deploy)
Needs to support some form of batch-norm folding
Needs to have power-of-two scales (as this avoids integer divisions in hardware)
Preferably does not require me to redefine my model with quantized modules
Supports up-to-date Python and PyTorch versions

In my search, I checked out all possible quantization-aware-training frameworks I could find and made a list of them (see below). In brackets () indicates the last commit date and in square brackets [] it shows whether I tried it or not.

QONNX: GitHub - fastmachinelearning/qonnx (3 weeks ago)
- [-] ONNX-only
QKeras: GitHub - google/qkeras: QKeras: a quantization deep learning library for Tensorflow Keras (yesterday)
- [+] Power of two scaling
- [-] Need to rewrite network
- [-] Keras-only
TorchQuant: GitHub - camlsys/torchquant: A Hackable Quantization Library for PyTorch (29 March 2021)
- [-] Doesnt support Conv1d
MQBench: GitHub - ModelTC/MQBench: Model Quantization Benchmark (14 Feb 2023) [tried]
- [-] Doesnt work with PyTorch 1.13.1
NNCF: GitHub - openvinotoolkit/nncf: Neural Network Compression Framework for enhanced OpenVINO™ inference (2 days ago) [tried]
- [-] Doesnt support power of-two scaling
- [+] Amazing Github support
- [-] Hard to export quantized model parameters
QPyTorch: GitHub - Tiiiger/QPyTorch: Low Precision Arithmetic Simulation in PyTorch (14 Jan 2022)
- [-] Not maintained
AI Model Efficiency Toolkit (AIMET): GitHub - quic/aimet: AIMET is a library that provides advanced quantization and compression techniques for trained neural network models. (19 hours ago) [tried]
- [-] Only supports Python 3.8
- [-] Really bad support via Github
- [+] Very large feature set
- [+] Tutorial videos
- [+/-] Supports PyTorch 1.13 since latest commit, but no pre-built package available yet
- [-] Building is on Python 3.10 is not trivial!
Brevitas: GitHub - Xilinx/brevitas: Brevitas: quantization-aware training in PyTorch (10 Jan 2023)
- [-] Doesnt support batch norm
- [-] Requires you to rewrite your entire network
pytorch-quantization: pytorch-quantization’s documentation — pytorch-quantization master documentation (very recent)
- [-] Only int8
NEMO: GitHub - pulp-platform/nemo: NEural Minimizer for pytOrch (23 Feb 2022)
- [-] Not maintained anymore
model_optimization: GitHub - sony/model_optimization: Model Compression Toolkit (MCT) is an open source project for neural network model optimization under efficient, constrained hardware. This project provides researchers, developers, and engineers advanced quantization and compression tools for deploying state-of-the-art neural networks. (1 hour ago) [paper: https://arxiv.org/pdf/2109.09113.pdf, tried]
- [+] Power-of-two quantization
- [+] Simple to use
- [+] Very active repository and members
- [-] Not possible to get access to quantized weights?
PyTorch quantization:
- [-] Need to implement custom observer + fake quantizer

However, none of these options really work or have all the features that I need. Does anyone else have suggestions on what I can use/do?

jcaip · March 20, 2023, 3:50pm

Hi @d0uwe,

We have a tutorial for QAT here that you might find helpful.

d0uwe · March 20, 2023, 4:00pm

Thanks for the quick response! I have read this tutorial but it is unclear to me how I can quantize the weights lower than 8 bits (for example 5).

111357 · March 21, 2023, 1:08am

As far as I know, PyTorch 2.0 does not support quantized weight lower than 8 bits natively. But you can emulate it numerically with a customized observer.

For example, if you want to quantize weight to int4, you can try the following setting:

from torch.ao.quantization.observer import MinMaxObserver
custom_observer = MinMaxObserver(quant_min=-8, quant_max=7)

d0uwe · March 21, 2023, 10:03am

Thanks, that looks good! How would I then for example implement that that the scales of the quantizer are always a power of two? I.e a scale factor of 1.9 is rounded to 2 (2^1), a scale factor of 59 is rounded to 64 (2^6) etc.

Vasiliy_Kuznetsov · March 22, 2023, 8:30pm

You could create your own observer class and override this function (pytorch/observer.py at 5cc2e4d7c939852f6de6f8497dc89d311e333dce · pytorch/pytorch · GitHub) to calculate the scales with a power of two restriction.

d0uwe · March 28, 2023, 3:29pm

So if I understand correctly, the scale and zero-point are currently not learned but calculated right?

For creating my own observer class, from which class would I inherit? From UniformQuantizationObserverBase (or any other observer) or from ObserverBase?

Then for a learned scale and zero-point, is the below the correct approach?

def quantize_to_closest_power_of_two(x: torch.Tensor):
   ...

   return quantized_value

class MyObserverClass(...):
   def __init__(...):
      ...
      
      self.scale = nn.Parameter(torch.ones(required_shape))
      self.zero_point = nn.Parameter(torch.ones(required_shape))

   @torch.jit.export
   def calculate_qparams(self):
        return quantize_to_closest_power_of_two(self.scale), self.zero_point

Then a related question: what if I also want to only allow the weights to be powers of two, next to the scales being powers of two? Do I create a new class with FakeQuantize as base and then in the forward pass I clamp the weights to the closest powers of two?

Tobias-Fischer · November 10, 2023, 5:56am

Hi @d0uwe @Vasiliy_Kuznetsov - did you ever figure this out? Just running into the same issue at the moment.

jerryzh168 · November 10, 2023, 9:10pm

you can create a custom Observer/FakeQuantize class that makes sure weights are close to power of two I think

d0uwe · April 2, 2024, 5:32pm

Hi Tobias, no, I didn’t figure out how to properly use the PyTorch APIs for this. I finally decided to use Brevitas because it offers a lot of flexibility.

However, since PoT weight quantization was not supported, I built that myself here as part of a mini-library to to use Brevitas to quantize neural networks for ASICs.