I follow the pytorch Post-Training Static Quantization flow on my model and finally convert my model to quantized_model. Quantized_model’s structure is shown below:
Net(
(quant): Quantize(scale=tensor([0.0256]), zero_point=tensor([17]), dtype=torch.quint8)
(dequant): DeQuantize()
(features): Sequential(
(0): QuantizedConvReLU2d(1, 16, kernel_size=(3, 3), stride=(2, 2), scale=0.13089396059513092, zero_point=0, padding=(1, 1))
(1): Identity()
(2): Identity()
(3): QuantizedConvReLU2d(16, 32, kernel_size=(3, 3), stride=(2, 2), scale=0.1375075727701187, zero_point=0, padding=(1, 1))
(4): Identity()
(5): Identity()
(6): QuantizedConvReLU2d(32, 64, kernel_size=(3, 3), stride=(2, 2), scale=0.12878330051898956, zero_point=0, padding=(1, 1))
(7): Identity()
(8): Identity()
(9): QuantizedConvReLU2d(64, 64, kernel_size=(2, 2), stride=(2, 2), scale=0.02376161329448223, zero_point=0)
(10): Identity()
(11): Identity()
)
(classifier): Sequential(
(0): QuantizedDropout(p=0.2, inplace=False)
(1): QuantizedLinearReLU(in_features=256, out_features=100, scale=0.291303813457489, zero_point=0, qscheme=torch.per_tensor_affine)
(2): Identity()
(3): QuantizedDropout(p=0.2, inplace=False)
(4): QuantizedLinear(in_features=100, out_features=10, scale=5.715658187866211, zero_point=114, qscheme=torch.per_tensor_affine)
)
)
Then I use Quantized_model perform inference simply by
output = quantized_model(input)
Here’s my issues:
1, I want to know whether the quantized_model can perform Integer-Arithmetic-Only Inference? I notice a paper called
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
which introduces an Integer-Arithmetic-Only Inference. I’ve tried to step into my code to find out the pytorch implemetation of a quantized model’s inference flow, but I eventually fall into the following function
torch.ops.quantized.conv2d_relu
It seems to be implemented in C++, so I can’t see any details of this function. Pytorch quantization documentation explicitly indicate that
# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)
Is there any method to see how the int8-calculations will be implemented in quantized_model’s inference flow?
2, A quantized_model not only involves quantized weight, but also involves statically quantized activations for every layer, right? So how can I get the activations’ scales and zero_point of every layer? Thanks a lot