Extract quantized parameters for use in custom compute unit

Hello,

I could not find in the related documentation the following key points (for achieving what the topic title says):

  1. How to extract the quantized parameters from a quantized model?

  2. How these parameters can be used for integer only computation?

  3. What are the transformations required for keeping integer arithmetic when transitioning from one layer to the next one?
    E.g. when two consecutive layers have different quantization parameters then the results of the first layer must be adjusted for use in the next layer - what is the adjustment in done in pytorch and how it is implemented?

Please, provide a simple example of two consecutive quantized layers along with the required arithmetic for integer-only computation.

The goal is to implement this arithmetic using a custom computing machine in order to be able to execute on it models trained and quantized using pytorch.

Thanks in advance,
Panagiotis

  1. you can get the parameters from model.state_dict()
  2. They aren’t. The quantized ops use a combination of float and integer arithmetic (the scale quantization parameter is a float after all). A good description of the math can be found here: Quantization for Neural Networks - Lei Mao's Log Book
  1. What are the transformations required for keeping integer arithmetic when transitioning from one layer to the next one?
    E.g. when two consecutive layers have different quantization parameters then the results of the first layer must be adjusted for use in the next layer - what is the adjustment in done in pytorch and how it is implemented?

Please, provide a simple example of two consecutive quantized layers along with the required arithmetic for integer-only computation.

I believe both these questions will be answered by reading the above link which may clarify some misconceptions about the process. Otherwise the specifics of the implementations of the quantized ops depend on the backend in question. FBGEMM (GitHub - pytorch/FBGEMM: FB (Facebook) + GEMM (General Matrix-Matrix Multiplication) - https://code.fb.com/ml-applications/fbgemm/) and QNNPACK (GitHub - pytorch/QNNPACK: Quantized Neural Network PACKage - mobile-optimized implementation of quantized neural network operators) handle this differently and you may be able to find better answers by digging into the specific backend you wish to emulate.

Thank you very much for the very informative answer!