Quantized model weights

yayapa · February 9, 2021, 10:51am

Hello everyone,
I am quantizing the retinanet using standard methods of Pytorch, namely PTQ and QAT and got a great results. The model size has been reduced from 139MB to 39MB and Inference time on cpu from 90min to 20min for a big valid dataset by accuracy loss smaller that 1%. Thus, although the results are great, I tried to check the weights of the quantized network and found out, if I use

print(model.head.cls_subnet[0].conv.weight().int_repr())

I get a really quantized integer tensor like
[[-33, 6, -56],
[-36, 47, 24],
[ 12, 1, 25]],

     [[-22,  18,  22],
      [-45,  43, -55],
      [  4,   1, -58]],



     [[ 19,  27,  10],
      [-73,   9, -53],
      [  2, -38, -24]]]], dtype=torch.int8)

But if I access a weight without int_repr()

print(model.head.cls_subnet[0].conv.weight())

I get a tensor like
[[-0.0096, 0.0017, -0.0163],
[-0.0105, 0.0137, 0.0070],
[ 0.0035, 0.0003, 0.0073]],

     [[-0.0064,  0.0052,  0.0064],
      [-0.0131,  0.0125, -0.0160],
      [ 0.0012,  0.0003, -0.0169]],



     [[ 0.0055,  0.0079,  0.0029],
      [-0.0212,  0.0026, -0.0154],
      [ 0.0006, -0.0111, -0.0070]]]], size=(256, 256, 3, 3),
   dtype=torch.qint8, quantization_scheme=torch.per_channel_affine,
   scale=tensor([0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003,

So the question is: Was the quantitation done correctly or I still use the full-precision weights? Why does it look like? Is it an intern representations of weights?

Thank you in advance.
Best regards,
yayapa

Vasiliy_Kuznetsov · February 10, 2021, 4:00pm

If you print a quantized tensor, it’s expected to see the floating point values, a scale and a zero point. The internal representation is stored in integers, and you can see that with int_repr(). You can use the equation fp = clamp(std::nearbyint(q - zp) * scale, qmin, qmax) to convert from int + scale + zp to float.