I’m applying quantization for my model which is fp32. I’m having convolution layers in my model. Once quantization is done, I have int8 weights. I’m using those int8 weights and in quantize process I need to subtract the zero_point from every weight and multiply the output with scale.
Since the zero_point subtraction is in the loop (need to do it for every weight , We are getting high CPU . Now, we want to avoid that zero subtraction in the loop and plan to place store the weight - zero_point value as weights and only need to do scale multiplication in the dequantization process.
But, when I tried this facing some issues with output mismatch. is there any way to avoid this zero_point subtraction in run time? if not what is the reason? please help
let’s say Im having a function with input and weight . And i’m using quantized weight for the computation . weight is unsigned char , zero point is unsigned char and scale is float .
Then my function implementation is something like given below , and it is working absolutely fine
sum_l += (A_lp[k_l] ) * (B_lp[k_l] - ZERO_POINT);
Now , what I’m trying is , the ZERO_POINT subtraction i want to do it before it self and save it as my weights.
Now my weight will be weight - ZERO_POINT .And my function code will be
{
sum_l += (A_lp[k_l] ) * (B_lp[k_l] );
}
*C++ = (float) sum_l *SCALE
}
I’m trying to avoid that subtraction in the loop .And here my data type of weights (weight - ZERO_POINT) is char.
When I’m doing this, the output is not matching with the original one. What will be reason ?