Quantize all weights in model

Hi all, not sure if this is possible or not, but I was wondering if there is a way to quantize all layers in a model uniformly, rather than per-layer or per-channel.

On a similar note, I have a question about how data flows through a per-layer quantized model. Let’s say I have a layer conv1 directly feeding another layer conv2. Both of the weights in each of these layers have different quantization parameters (zero point/scale), but same type, say qint8. Using hooks, I was able to verify during a forward pass, the input data to the conv2 layer has the same quantization parameters as conv1, but, the output has quantization parameters of conv2. My question is, is that conv2 layer performing a transform on the input data before running quantized convolution. In other words, are the same integer values that are output from quantized conv1 the same as integer values that the quantized convolution of conv2 use as inputs?

what do you mean by ‘uniformly quantized’? generally we use the term uniform quantization in contrast to techniques like ‘power of 2’ quantization i.e. where the possible quantized data points are/are not equally (uniformly) spaced.

for statically quantized ops, there are 2 sets of parameters, one is the scale and zero point for the weights and one is the scale and zero_point for the output activation. The transformation always occurs on the output, not the input. so yes, if you have “input → conv1 → conv2” then the output of conv1 will be the input to conv2 and will have the scale/zero_point of the conv1 activation quantization parameters.

Thanks for the response! Sorry, my terminology is probably off, still very new to all of this. By uniformly quantized, I mean every layer will share a zero point/scale.

Thank you for that explanation! Do you have any details/code pointers you can share about how/where precisely that conv1 output quantization conversion happens? Is there a way it gets factored into the quantized computation? Thanks again!

it depends on the kernel being used, for example, for FBGEMM i believe it happens here for linear: https://github.com/pytorch/pytorch/blob/7d2f1cd2115ec333767aef8087c8ea3ba6e90ea5/aten/src/ATen/native/quantized/cpu/qlinear.cpp#L165

i.e. the kernel accumulates the quantized matmul to int32, then requantizes the result going from int32 to int8/qint8

there’s some general theory that can be found here: https://github.com/google/gemmlowp/blob/master/doc/quantization.md

otherwise you could ask the actual kernel team for more specifics, i.e. FBGEMM or QNNPACK