yeah this is true, we would quantize the bias with the scale of input and weight.
This is what we do, from our internal notes:
z = qconv(wq, xq)
# z is at scale (weight_scale*input_scale) and at int32
# Convert to int32 and perform 32 bit add
bias_q = round(bias/(input_scale*weight_scale))
z_int = z + bias_q
# rounding to 8 bits
z_out = round[(z_int)*(input_scale*weight_scale)/output_scale) - z_zero_point]
z_out = saturate(z_out)
Not exactly sure if we add bias with floating point add or int32 add, but it’s one of them.
We’ll add documentations for this somewhere in the future.