Implementing Quantized Linear Layer in Numpy

I’m trying to implement a Torch quantized linear layer in Numpy so that I can verify the implementation before translating to my own embedded C code. I have a basic implementation below which correctly quantizes the inputs but the output of the linear layer doesn’t match the torch outputs.

My current code is based on this post, and I’m using the qnnpack backend.

Any tips are appreciated!

# I have float input 'x', linear layer 'fc', and quant stub 'quantx'

# quantize x
x_q = np.round((x / quantx_scale) + quantx_zero_point)
x_q = np.clip(x_q, 0, 255).astype(np.uint8)

# linear layer
matmul_out = np.matmul(x_q, fc_weight.T)
bias_q = np.round(fc_bias / (quantx_scale * fc_weight_scale)).astype(np.int32)
scale_factor = quantx_scale * fc_weight_scale / fc_scale
outq = np.round((matmul_out + bias_q) * scale_factor + fc_zero_point)
outq = np.clip(outq, 0, 255).astype(np.uint8)

I’m extracting the weights like:

fc_weight = fc.weight().int_repr().numpy()
fc_weight_scale = fc.weight().q_scale()
fc_bias = fc.bias().detach().numpy()
fc_scale = fc.scale
fc_zero_point = fc.zero_point
quantx_scale = quantx.scale.numpy()[0]
quantx_zero_point = quantx.zero_point.numpy()[0]

Your process looks fine. I refer to this for back engineering the QuantizedLinear OP.
Only one point you may take care is that overflow can happen in

matmul_out = np.matmul(x_q, fc_weight.T)

Better to try the below again:

matmul_out = np.matmul(x_q.astype(np.int32), fc_weight.T.astype(np.int32))