Is bias quantized while doing pytorch static quantization?

Why is bias not quantized upon pytorch static quantization? Or is it not required for deployment?

if you break down the quantized operations into the integer components used to speed up the computation, all the integer stuff happens before the bias is added in so it wouldn’t speed anything up.

Given the float values of bias vectors, can I convert them to torch.qint32 using this rule?

bias_int32 = torch.quantize_per_tensor(bias_float_vector, scale=S1*S2, zero_point=0, torch.qint32)

Is this what happens during inference?

At a high level it depends on the kernel/backend being used, you’d be better off asking the fbgemm or qnnpack folks. At a lower level though I believe the broad strokes are correct.

Here is a document that may be more helpful: gemmlowp/ at master · google/gemmlowp · GitHub

yeah this is true, we would quantize the bias with the scale of input and weight.
This is what we do, from our internal notes:

z = qconv(wq, xq)
# z is at scale (weight_scale*input_scale) and at int32
# Convert to int32 and perform 32 bit add
bias_q = round(bias/(input_scale*weight_scale))
z_int = z + bias_q
# rounding to 8 bits
z_out  = round[(z_int)*(input_scale*weight_scale)/output_scale) - z_zero_point]
z_out = saturate(z_out)

Not exactly sure if we add bias with floating point add or int32 add, but it’s one of them.
We’ll add documentations for this somewhere in the future.

1 Like

btw, if you want to do quantization differently, e.g. like passing in int32 bias, and evaluate the impact on accuracy, here is the design that support this: rfcs/ at master · pytorch/rfcs · GitHub, this will be more mature in beta release

1 Like

Thanks for your reply.

@HDCharles @jerryzh168

Can you tell me the equivalent equation for the residual_block skip addition operation?

  • bias_q = round(bias/(input_scale*weight_scale))

bias_q = torch.quantize_per_tensor(bias_float_vector, scale=input_scale*weight_scale, zero_point=0, dtype=torch.qint32)

Is this code-snippet true for the above statement? I wanted to verify with you.

yeah, and this happens in the quantized operator itself, we still just pass in float bias to the quantized operator

there’s also a clamp operation, but otherwise yes the integer values of the quantize_per_tensor output will match the output of the other equation.

you mean the formula for quantized add? I think you can derive through the definition of quantization function and add:

out_int8 = out_fp32 / out_scale + out_zero_point
= (a_fp32 + b_fp32) / out_scale + out_zero_point
= ((a_int8 - a_zero_point) * a_scale + (b_int8 - b_zero_point) * b_scale)/out_scale + out_zero_point
= …

1 Like