Quantization : How to avoid zero subtraction at run time in dequantization process

I’m applying quantization for my model which is fp32. I’m having convolution layers in my model. Once quantization is done, I have int8 weights. I’m using those int8 weights and in quantize process I need to subtract the zero_point from every weight and multiply the output with scale.

Since the zero_point subtraction is in the loop (need to do it for every weight , We are getting high CPU . Now, we want to avoid that zero subtraction in the loop and plan to place store the weight - zero_point value as weights and only need to do scale multiplication in the dequantization process.

But, when I tried this facing some issues with output mismatch. is there any way to avoid this zero_point subtraction in run time? if not what is the reason? please help

reference link which we are following Lei Mao's Log Book – Quantization for Neural Networks

Thanks in advance!

Can you give a repro of the issue? I’m not sure exactly what you are trying to do, the quantization APIs should be handling all of that for you.

we are not using quantization APIs , I’m doing quantization using C code.
The peace of code I’m using for quantization is this


struct QuantizationParams {
   float scale;
   unsigned char zero_point;
};
void Quantize(const QuantizationParams& qparams, float* src, unsigned char* dst, int size) {

   for (std::size_t i = 0; i < size; i++) {
      const float real_val = src[i];
      const float transformed_val = qparams.zero_point + real_val / qparams.scale;
      const float clamped_val = std::max(0.f, std::min(255.f, transformed_val));
      dst[i] = static_cast<std::uint8_t>(std::round(clamped_val));
   }
}

let’s say Im having a function with input and weight . And i’m using quantized weight for the computation . weight is unsigned char , zero point is unsigned char and scale is float .
Then my function implementation is something like given below , and it is working absolutely fine


for (i_l = 0; i_l <= size1; i_l += size2)
   {
      A_lp = &QA_lp[i_l];
      for (j_l = 0; j_l < size5; j_l++)
      {
         B_lp = (unsigned char*)&B[j_l * size3];
         sum_l = 0;
         for (k_l = 0; k_l < size4; k_l ++)
         {
            sum_l += (A_lp[k_l] ) * (B_lp[k_l] - ZERO_POINT);
         }
		 *C++ = (float) sum_l *SCALE
}
 

sum_l += (A_lp[k_l] ) * (B_lp[k_l] - ZERO_POINT);
Now , what I’m trying is , the ZERO_POINT subtraction i want to do it before it self and save it as my weights.
Now my weight will be weight - ZERO_POINT .And my function code will be
{
sum_l += (A_lp[k_l] ) * (B_lp[k_l] );
}
*C++ = (float) sum_l *SCALE
}

I’m trying to avoid that subtraction in the loop .And here my data type of weights (weight - ZERO_POINT) is char.
When I’m doing this, the output is not matching with the original one. What will be reason ?

Please help me with this.