Is the Static Quantization(PTSQ) model performs Integer-Arithmetic-Only Inference?

I follow the pytorch Post-Training Static Quantization flow on my model and finally convert my model to quantized_model. Quantized_model’s structure is shown below:

  (quant): Quantize(scale=tensor([0.0256]), zero_point=tensor([17]), dtype=torch.quint8)
  (dequant): DeQuantize()
  (features): Sequential(
    (0): QuantizedConvReLU2d(1, 16, kernel_size=(3, 3), stride=(2, 2), scale=0.13089396059513092, zero_point=0, padding=(1, 1))
    (1): Identity()
    (2): Identity()
    (3): QuantizedConvReLU2d(16, 32, kernel_size=(3, 3), stride=(2, 2), scale=0.1375075727701187, zero_point=0, padding=(1, 1))
    (4): Identity()
    (5): Identity()
    (6): QuantizedConvReLU2d(32, 64, kernel_size=(3, 3), stride=(2, 2), scale=0.12878330051898956, zero_point=0, padding=(1, 1))
    (7): Identity()
    (8): Identity()
    (9): QuantizedConvReLU2d(64, 64, kernel_size=(2, 2), stride=(2, 2), scale=0.02376161329448223, zero_point=0)
    (10): Identity()
    (11): Identity()
  (classifier): Sequential(
    (0): QuantizedDropout(p=0.2, inplace=False)
    (1): QuantizedLinearReLU(in_features=256, out_features=100, scale=0.291303813457489, zero_point=0, qscheme=torch.per_tensor_affine)
    (2): Identity()
    (3): QuantizedDropout(p=0.2, inplace=False)
    (4): QuantizedLinear(in_features=100, out_features=10, scale=5.715658187866211, zero_point=114, qscheme=torch.per_tensor_affine)

Then I use Quantized_model perform inference simply by

output = quantized_model(input)

Here’s my issues:
1, I want to know whether the quantized_model can perform Integer-Arithmetic-Only Inference? I notice a paper called
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
which introduces an Integer-Arithmetic-Only Inference. I’ve tried to step into my code to find out the pytorch implemetation of a quantized model’s inference flow, but I eventually fall into the following function


It seems to be implemented in C++, so I can’t see any details of this function. Pytorch quantization documentation explicitly indicate that

# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)

Is there any method to see how the int8-calculations will be implemented in quantized_model’s inference flow?
2, A quantized_model not only involves quantized weight, but also involves statically quantized activations for every layer, right? So how can I get the activations’ scales and zero_point of every layer? Thanks a lot :blush:

Hi Rocket, you are using eager mode quantization flow tool that will quantize the model and lower to a specific backend, in this case probably fbgemm or qnnpack backend, it is implemented in but it might still be hard to fllow the implementation, some of the real implementation might be written in assembly I think.

  1. yes static quantization quantizes both activation and weight, scale/zero_point for output are attributes for the quantized module (e.g. quantized conv2d relu module), for the model you printed, you can do something like: model.features[0].scale, model.features[0].zero_point I think