# Fundamental question on weight conversion fp32 to int8

I had a basic question about quantization of a floating point number to int8 and would like to know the reason for difference between what I am computing.

For example if I have a floating point number 0.033074330538511, then to convert it to an int8 one, I used the following formula

quantized_weight = floor(float_weight.*(2^quant_bits))./(2^quant_bits)


Considering quant_bits as 8, the int8 value would be 0.031250000000000. But using pytorch quantization I am getting a value of 0.032944630831480

How can a int8 model have that much of precision in weight values?

To elaborate more I have an example model,

class M(nn.Module):

def __init__(self):
super(M, self).__init__()
# QuantStub converts tensors from floating point to quantized
self.quant = torch.quantization.QuantStub()
self.conv = torch.nn.Conv2d(in_channels=1, out_channels=1, kernel_size=3, stride=1, padding=1)
self.conv.weight = torch.nn.Parameter(torch.tensor([[[[ 0.03307433053851127625, -0.13484150171279907227, -0.21625524759292602539],
[ 0.14247404038906097412, -0.14247404038906097412, -0.24932956695556640625],
[ 0.32311078906059265137, -0.14501821994781494141, -0.21371106803417205811]]]]))
self.conv.bias = torch.nn.Parameter(torch.tensor([0.1095]))
# DeQuantStub converts tensors from quantized to floating point
self.dequant = torch.quantization.DeQuantStub()

model_fp32 = M()
model_fp32.eval()
model_fp32.qconfig = torch.quantization.get_default_qconfig('fbgemm')
model_fp32_prepared = torch.quantization.prepare(model_fp32)
model_fp32_converted = torch.quantization.convert(model_fp32_prepared, inplace=True)


I didn’t pass any input through the model and when I printed the quantized model weights

model_fp32_converted.conv.weight()

tensor([[[[ 0.03294463083148002625, -0.13431271910667419434,
-0.21540719270706176758],
[ 0.14191532135009765625, -0.14191532135009765625,
-0.24835182726383209229],
[ 0.32184368371963500977, -0.14444953203201293945,
-0.21287299692630767822]]]], size=(1, 1, 3, 3), dtype=torch.qint8,
quantization_scheme=torch.per_channel_affine,
scale=tensor([0.00253420230001211166], dtype=torch.float64),
zero_point=tensor([0]), axis=0)


Although using this formula I can approximately compute the value

qunat_weight = floor(float_weight/scale)*scale
floor(0.03307433053851127625/0.00253420230001211166)*0.00253420230001211166


My query is

• How can a int8 save weight with that level of precision and how is pytorch quantization different from the method I showed?

I think this may be more related to floating point representation/arithmetic than quantization.

I don’t think you actually get the level of precision as what you are seeing when printing out the values in python.

This might be related:
https://docs.python.org/3/tutorial/floatingpoint.html