I’m trying to implement a Torch quantized linear layer in Numpy so that I can verify the implementation before translating to my own embedded C code. I have a basic implementation below which correctly quantizes the inputs but the output of the linear layer doesn’t match the torch outputs.
My current code is based on this post, and I’m using the qnnpack backend.
Any tips are appreciated!
# I have float input 'x', linear layer 'fc', and quant stub 'quantx'
# quantize x
x_q = np.round((x / quantx_scale) + quantx_zero_point)
x_q = np.clip(x_q, 0, 255).astype(np.uint8)
# linear layer
matmul_out = np.matmul(x_q, fc_weight.T)
bias_q = np.round(fc_bias / (quantx_scale * fc_weight_scale)).astype(np.int32)
scale_factor = quantx_scale * fc_weight_scale / fc_scale
outq = np.round((matmul_out + bias_q) * scale_factor + fc_zero_point)
outq = np.clip(outq, 0, 255).astype(np.uint8)
I’m extracting the weights like:
fc_weight = fc.weight().int_repr().numpy()
fc_weight_scale = fc.weight().q_scale()
fc_bias = fc.bias().detach().numpy()
fc_scale = fc.scale
fc_zero_point = fc.zero_point
quantx_scale = quantx.scale.numpy()[0]
quantx_zero_point = quantx.zero_point.numpy()[0]