I could not find in the related documentation the following key points (for achieving what the topic title says):
How to extract the quantized parameters from a quantized model?
How these parameters can be used for integer only computation?
What are the transformations required for keeping integer arithmetic when transitioning from one layer to the next one?
E.g. when two consecutive layers have different quantization parameters then the results of the first layer must be adjusted for use in the next layer - what is the adjustment in done in pytorch and how it is implemented?
Please, provide a simple example of two consecutive quantized layers along with the required arithmetic for integer-only computation.
The goal is to implement this arithmetic using a custom computing machine in order to be able to execute on it models trained and quantized using pytorch.
you can get the parameters from model.state_dict()
They aren’t. The quantized ops use a combination of float and integer arithmetic (the scale quantization parameter is a float after all). A good description of the math can be found here: Quantization for Neural Networks - Lei Mao's Log Book
What are the transformations required for keeping integer arithmetic when transitioning from one layer to the next one?
E.g. when two consecutive layers have different quantization parameters then the results of the first layer must be adjusted for use in the next layer - what is the adjustment in done in pytorch and how it is implemented?
Please, provide a simple example of two consecutive quantized layers along with the required arithmetic for integer-only computation.