Static/Dynamic Quantization

mhamdan · November 25, 2020, 2:50am

I tried quantizing a model using both static and dynamic quantization. Both schemes quantized the weights of the layers but did not quantize the biases. Is there a reason why? and how can I quantize the biases?

Implementation is similar to this

dskhudia · December 2, 2020, 5:48pm

biases are not quantized and kept in fp32. For convs and Linears, bias are dynamically quantized before addition while doing convs/Linears.

mhamdan · December 2, 2020, 9:55pm

Hi, Thanks for your reply.
If you’re saying that biases are not quantized, what do you then mean by biases are dynamically quantized for linears?

dskhudia · December 3, 2020, 8:10pm

When the linear is run, it converts biases to int32 before adding to matmul result.

mhamdan · December 3, 2020, 8:25pm

I see. Do you have any idea why PyTorch does it like that?

dskhudia · December 4, 2020, 2:34am

if quantized, biases are usually quantized with a scale = activation_scale * weight_scale so that quantized bias can directly be added to matmul output in quantized domain. In pytorch eager mode (due to dynamic nature of pytorch graph), knowing activation scale statically is impossible.