Implementing Quantized Linear Layer in Numpy

chenster_liu · March 21, 2024, 2:33pm

Your process looks fine. I refer to this for back engineering the QuantizedLinear OP.
Only one point you may take care is that overflow can happen in

matmul_out = np.matmul(x_q, fc_weight.T)

Better to try the below again:

matmul_out = np.matmul(x_q.astype(np.int32), fc_weight.T.astype(np.int32))