Input(activation) quantization

ingu_jeong · March 14, 2022, 3:45am

Hi, all.
I have some questions about the quantization.
When I quantize a model, the following results is obtained.
I know that scale and zero_point in torch,per_channel_affine can make int8 weight.
Then, what do fc1.scale and fc1.zero_point mean?
Also, I wonder what zero_point and scale are used to quantize the input(activation).

model.qconfig = torch.quantization.get_default_qat_qconfig("fbgemm")
model_q = torch.quantization.prepare_qat(model, inplace=False)
model_q.load_state_dict(torch.load('220307_quant_mnist.pt'))
model.eval().to('cpu')
model_int8 = torch.quantization.convert(model_q, inplace=False)
torch.save(model_int8.state_dict(), './220307_quantized_mnist.pt')
test()
params = torch.load('220307_quantized_mnist.pt', map_location = "cpu")
print(params)
print(params['fc1._packed_params._packed_params'][0].int_repr())

OrderedDict ([('fc1.scale', tensor(8.2687topic deleted by author)), ('fc1.zero_point', tensor(44)), ('fc1._packed_params.dtype', torch.qint8), ('fc1._packed_params._packed_params', (tensor([[-0.5753, -0.6056, -0.6056,  ..., -0.5753, -0.6056, -0.6056],
        [ 0.0289,  0.0289,  0.0000,  ...,  0.0867,  0.0867,  0.0289],
        [ 0.2502,  0.0834,  0.0556,  ...,  0.0556,  0.1390,  0.0556],
        ...,
        [-0.0295, -0.0295,  0.0000,  ...,  0.0000,  0.0000, -0.0295],
        [-0.0677,  0.0000, -0.0677,  ..., -0.0338,  0.0000, -0.0677],
        [ 0.0000,  0.0000,  0.0625,  ...,  0.0937,  0.0937,  0.0000]],
       size=(512, 784), dtype=torch.qint8,
       quantization_scheme=torch.per_channel_affine,
       scale=tensor([0.0303, 0.0289, 0.0278, 0.0299, 0.0323, 0.0261, 0.0319, 0.0226, 0.0196,
        0.0299, 0.0369, 0.0324, 0.0264, 0.0330, 0.0228, 0.0257, 0.0376, 0.0149,
        0.0333, 0.0402, 0.0261, 0.0377, 0.0397, 0.0280, 0.0422, 0.0287, 0.0244,
        0.0251, 0.0341, 0.0225, 0.0246, 0.0277, 0.0177, 0.0381, 0.0314, 0.0341,
        0.0274, 0.0269, 0.0248, 0.0344, 0.0365, 0.0262, 0.0267, 0.0302, 0.0271,
        0.0299, 0.0246, 0.0228, 0.0263, 0.0267, 0.0329, 0.0165, 0.0367, 0.0304,

HDCharles · March 14, 2022, 6:22pm

most quantized ops for static quantizaztion take as an input:

qint8 activation
a packedparams object (which is essentially the weight and bias)
a scale
a zero_point.

It then uses the activation and packedparams to calculate the output which is quantizes using the scale and zero point to give a qint8 output. An example of this can be seen with quantized linear: pytorch/linear.py at 39470387cf17896cf38f3e04b5b81d4647734275 · pytorch/pytorch · GitHub

so the fc1 scale and fc1 zero_point are just the output quantization parameters.

ingu_jeong · March 15, 2022, 2:07am

Oh, Thanks for reply my question.
I understand fc1 scale and fc1 zero_point.
Then, how are qint8 activations quantized??

HDCharles · March 15, 2022, 1:43pm

Well lets say we have a bunch of quantized ops, the activation to op 2 is quantized because its just the output for op 1. The same is true for all op n and n-1. This leaves op 1 having a non quantized input but a quantized output. If you look at the various tutorials, you have to add a QuantStub() at the start and a DeQuantStub() at the end of your model which has the purpose of the initial quantize and final dequantize steps for the activations. During the convert step those are converted to quantize_per_tensor/_per_channel depending on your quantization scheme.

ingu_jeong · March 20, 2022, 12:30pm

Oh, Thanks to your clear explanation, I understood perfectly.