Is the Static Quantization(PTSQ) model performs Integer-Arithmetic-Only Inference?

TPDD · November 5, 2023, 5:51am

I follow the pytorch Post-Training Static Quantization flow on my model and finally convert my model to quantized_model. Quantized_model’s structure is shown below:

Net(
  (quant): Quantize(scale=tensor([0.0256]), zero_point=tensor([17]), dtype=torch.quint8)
  (dequant): DeQuantize()
  (features): Sequential(
    (0): QuantizedConvReLU2d(1, 16, kernel_size=(3, 3), stride=(2, 2), scale=0.13089396059513092, zero_point=0, padding=(1, 1))
    (1): Identity()
    (2): Identity()
    (3): QuantizedConvReLU2d(16, 32, kernel_size=(3, 3), stride=(2, 2), scale=0.1375075727701187, zero_point=0, padding=(1, 1))
    (4): Identity()
    (5): Identity()
    (6): QuantizedConvReLU2d(32, 64, kernel_size=(3, 3), stride=(2, 2), scale=0.12878330051898956, zero_point=0, padding=(1, 1))
    (7): Identity()
    (8): Identity()
    (9): QuantizedConvReLU2d(64, 64, kernel_size=(2, 2), stride=(2, 2), scale=0.02376161329448223, zero_point=0)
    (10): Identity()
    (11): Identity()
  )
  (classifier): Sequential(
    (0): QuantizedDropout(p=0.2, inplace=False)
    (1): QuantizedLinearReLU(in_features=256, out_features=100, scale=0.291303813457489, zero_point=0, qscheme=torch.per_tensor_affine)
    (2): Identity()
    (3): QuantizedDropout(p=0.2, inplace=False)
    (4): QuantizedLinear(in_features=100, out_features=10, scale=5.715658187866211, zero_point=114, qscheme=torch.per_tensor_affine)
  )
)

Then I use Quantized_model perform inference simply by

output = quantized_model(input)

Here’s my issues:
1, I want to know whether the quantized_model can perform Integer-Arithmetic-Only Inference? I notice a paper called
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
which introduces an Integer-Arithmetic-Only Inference. I’ve tried to step into my code to find out the pytorch implemetation of a quantized model’s inference flow, but I eventually fall into the following function

torch.ops.quantized.conv2d_relu

It seems to be implemented in C++, so I can’t see any details of this function. Pytorch quantization documentation explicitly indicate that

# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)

Is there any method to see how the int8-calculations will be implemented in quantized_model’s inference flow?
2, A quantized_model not only involves quantized weight, but also involves statically quantized activations for every layer, right? So how can I get the activations’ scales and zero_point of every layer? Thanks a lot

jerryzh168 · November 10, 2023, 1:17am

Hi Rocket, you are using eager mode quantization flow tool that will quantize the model and lower to a specific backend, in this case probably fbgemm or qnnpack backend, it is implemented in https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/quantized/cpu/qconv.cpp but it might still be hard to fllow the implementation, some of the real implementation might be written in assembly I think.

yes static quantization quantizes both activation and weight, scale/zero_point for output are attributes for the quantized module (e.g. quantized conv2d relu module), for the model you printed, you can do something like: model.features[0].scale, model.features[0].zero_point I think