Hi,

I’m working with a quantized convolutional network that I want to deploy on an ARM based edge device, and I am looking to perform as many operations as possible in integer/fixed point.

After calculating the convolution, following the reasoning from TFLite integer-only quantization

(section 2.2, equations 5 and 6), I’m planning to convert the re-scaling factor M mentioned there by using the notation

2^-n * M0_mant * 2^(M0_exp)

This conversion will, of course, introduce a slight variation from the floating point implementation and so I expect a certain deviation from the expected output of the network.

The custom hardware implementation will be taken care by my own code, but I was wondering if it’s possible to use this implementation in native PyTorch, too (on cpu for example). This would allow me to use the results obtained from the cpu implementation to validate the results obtained from my edge device.

I tried looking up the code for the quantized operators but I couldn’t find the code related to this step in the calculation of the convolution, maybe an oversight. In that case would you be able to point me in the right direction?