Quantization : How to avoid zero subtraction at run time in dequantization process

vanajareedy0426 · November 18, 2021, 12:05pm

I’m applying quantization for my model which is fp32. I’m having convolution layers in my model. Once quantization is done, I have int8 weights. I’m using those int8 weights and in quantize process I need to subtract the zero_point from every weight and multiply the output with scale.

Since the zero_point subtraction is in the loop (need to do it for every weight , We are getting high CPU . Now, we want to avoid that zero subtraction in the loop and plan to place store the weight - zero_point value as weights and only need to do scale multiplication in the dequantization process.

But, when I tried this facing some issues with output mismatch. is there any way to avoid this zero_point subtraction in run time? if not what is the reason? please help

reference link which we are following Lei Mao's Log Book – Quantization for Neural Networks

Thanks in advance!

HDCharles · November 18, 2021, 8:47pm

Can you give a repro of the issue? I’m not sure exactly what you are trying to do, the quantization APIs should be handling all of that for you.

vanajareedy0426 · November 19, 2021, 3:33am

we are not using quantization APIs , I’m doing quantization using C code.
The peace of code I’m using for quantization is this


struct QuantizationParams {
   float scale;
   unsigned char zero_point;
};
void Quantize(const QuantizationParams& qparams, float* src, unsigned char* dst, int size) {

   for (std::size_t i = 0; i < size; i++) {
      const float real_val = src[i];
      const float transformed_val = qparams.zero_point + real_val / qparams.scale;
      const float clamped_val = std::max(0.f, std::min(255.f, transformed_val));
      dst[i] = static_cast<std::uint8_t>(std::round(clamped_val));
   }
}

let’s say Im having a function with input and weight . And i’m using quantized weight for the computation . weight is unsigned char , zero point is unsigned char and scale is float .
Then my function implementation is something like given below , and it is working absolutely fine


for (i_l = 0; i_l <= size1; i_l += size2)
   {
      A_lp = &QA_lp[i_l];
      for (j_l = 0; j_l < size5; j_l++)
      {
         B_lp = (unsigned char*)&B[j_l * size3];
         sum_l = 0;
         for (k_l = 0; k_l < size4; k_l ++)
         {
            sum_l += (A_lp[k_l] ) * (B_lp[k_l] - ZERO_POINT);
         }
		 *C++ = (float) sum_l *SCALE
}

sum_l += (A_lp[k_l] ) * (B_lp[k_l] - ZERO_POINT);
Now , what I’m trying is , the ZERO_POINT subtraction i want to do it before it self and save it as my weights.
Now my weight will be weight - ZERO_POINT .And my function code will be
{
sum_l += (A_lp[k_l] ) * (B_lp[k_l] );
}
*C++ = (float) sum_l *SCALE
}

I’m trying to avoid that subtraction in the loop .And here my data type of weights (weight - ZERO_POINT) is char.
When I’m doing this, the output is not matching with the original one. What will be reason ?

Please help me with this.