Casting from 32b to 8 bit after accumulation in a multiplication

Hi,

I was reading about the QNNpack and FBGEMM configuration which explains really well the way to multiply, but I am left with the question what is happening after you multiply 8b x 8b and accumulate the result on 16 or 32 bits…how do you cast to 8 bits to get to another convolution?

cc @dskhudia @jianyuhuang

In FBGEMM, we have requantization as a post-processing step to convert the 32 bit to 8 bit after the accumulation. The requantization basically does the following op in the language of numpy:

X_q = (np.round(X / X_scale) + X_zp).clip(0, 255).astype(np.uint8)

The modularized requantization wrapper is here:


And inside it a efficient AVX2 kernel is implemented:

1 Like

ok, this makes sense now. But this raises 3 questions for me when I computed an example:

  1. How do I access the scale of the re-quantization? I just managed to see how the scale of converting from 32float to 8int looks like:
    BASIC EXAMPLE:
input: tensor([[[[ 1.0074,  2.0148,  3.0222,  4.0297],
                 [ 5.0371,  6.0445,  7.0519,  8.0593],
                 [ 8.9408,  9.9482, 10.9556, 11.9630],
                 [12.9704, 13.9779, 14.9853, 15.9927]]]], size=(1, 1, 4, 4),
                 dtype=torch.quint8, quantization_scheme=torch.per_tensor_affine,
                 scale=0.125926584005, zero_point=0)
Weight:  <bound method Conv2d.weight of QuantizedConv2d(1, 1, kernel_size=(3, 3), stride=(1, 1), 
          scale=4.51094579697, zero_point=0)
output: tensor([[[[347.3428, 392.4523],
                  [527.7806, 572.8901]]]], size=(1, 1, 2, 2), dtype=torch.quint8,
                  quantization_scheme=torch.per_tensor_affine, scale=4.51094579697,
                  zero_point=0)

PRINTING OUT THE DICTIONARY OF 8INT MODEL:

[(u'conv1.weight', tensor([[[[0.9882, 1.9765, 2.9647],
                             [4.0235, 5.0118, 6.0000],
                             [6.9882, 7.9765, 8.9647]]]], size=(1, 1, 3, 3), dtype=torch.qint8,
                             quantization_scheme=torch.per_channel_affine,
                             scale=tensor([0.0706], dtype=torch.float64), zero_point=tensor([0]),
                             axis=0)),
(u'conv1.scale', tensor(4.5109)),
(u'conv1.zero_point', tensor(0)), 
(u'conv1.bias', tensor([0.], requires_grad=True)), 
(u'quant.scale', tensor([0.1259])), 
(u'quant.zero_point', tensor([0]))]

Because what I have is:

  • scale 0.12 to convert input from 32float to 8int
  • scale 0.0706 to convert weights from 32float to 8int
  • scale 4.51 to convert output from 32float to 8int (just to see how all the values are in int)

I was expecting the scale of requantization to be something like 8.33 for this example.
2. Are this scales trainable parameters or are computed using basic numpy operations?

  1. In the FBGEMM documentation on Facebook Engineering I’ve read that the accumulation is done on 16b… so then from where is the difference?