I tried to understand the computation flow of pytorch MobileNet V2 int8 model and want to know how the bias, scale and zero-point are applied to a fused convolution layer. For instance as following, this layer has 4 params in state_dict: weight, bias, scale, zero-point. The weight is quantized from FP32 to INT8 with its own scale 0.106 and zero point; and the scale 0.0693 is supposed to convert accumulated result from FP32 to INT8 for next layer. But how to apply the bias? Does the bias applied to accumulated results after multiplication? These bias looks pretty small number comparing to accumulated results.
(‘features.1.conv.0.0.weight’,
tensor([[[[ -0.1069, -0.1069, -0.1069],
[ -0.1069, 0.0000, 0.8550],
[ -0.1069, -0.1069, 0.1069]]],
…
[[[ 0.9619, -0.4275, -0.7482],
[ 4.3820, -0.3206, -3.9545],
[ 0.9619, -0.2138, -0.5344]]]], size=(32, 1, 3, 3),
dtype=torch.qint8, quantization_scheme=torch.per_tensor_affine,
scale=0.10687889158725739, zero_point=0)),
(‘features.1.conv.0.0.bias’,
tensor([-1.1895e-02, 8.7035e-01, -6.8617e-02, 3.8501e-01, 3.2915e-01,
…
8.4619e-01, -1.9708e-01], requires_grad=True)),
(‘features.1.conv.0.0.scale’, tensor(0.0693)),
(‘features.1.conv.0.0.zero_point’, tensor(0)),
(‘features.1.conv.1.weight’,