I have a model that’s quite small - 2 inputs, a hidden layer with 3 nodes and a 6 class output. Eventually I’d like to load this model onto hardware and use fixed point representation for some of the values. What I’m confused about is **how** the quantization happens and the different scales/zero points to use when.

For instance, this is my `state_dict`

:

```
OrderedDict([('input_layer_input_scale_0', tensor(0.0039)),
('input_layer_input_zero_point_0', tensor(0)),
('input_layer.scale', tensor(0.0297)),
('input_layer.zero_point', tensor(0)),
('input_layer._packed_params.dtype', torch.qint8),
('input_layer._packed_params._packed_params',
(tensor([[-0.1180, 0.1180],
[-0.2949, -0.5308],
[-3.3029, -7.5496]], size=(3, 2), dtype=torch.qint8,
quantization_scheme=torch.per_tensor_affine, scale=0.05898105353116989,
zero_point=0),
Parameter containing:
tensor([-0.4747, -0.3563, 7.7603], requires_grad=True))),
('out.scale', tensor(1.5963)),
('out.zero_point', tensor(243)),
('out._packed_params.dtype', torch.qint8),
('out._packed_params._packed_params',
(tensor([[ 0.4365, 0.4365, -55.4356],
[ 0.4365, 0.0000, 1.3095],
[ 0.4365, 0.0000, -13.9680],
[ 0.4365, -0.4365, 4.3650],
[ 0.4365, 0.4365, -3.0555],
[ 0.4365, 0.0000, -1.3095],
[ 0.4365, 0.0000, 3.0555]], size=(7, 3), dtype=torch.qint8,
quantization_scheme=torch.per_tensor_affine, scale=0.43650051951408386,
zero_point=0),
Parameter containing:
tensor([ 19.2761, -1.0785, 14.2602, -22.3171, 10.1059, 7.2197, -11.7253],
requires_grad=True)))])
```

and if I give it a set of inputs like this:

```
inputs = np.array(
[[1. , 1. ], # class 0 example
[1. , 0. ], # class 1 example
[0. , 1. ], # 2
[0. , 0. ], # 3
[0. , 0.9 ], # 4
[0. , 0.75], # 5
[0. , 0.25]]) # class 6 example
```

I can verify decent accuracy with this:

```
>>> mq(torch.from_numpy(inputs).float()).argmax(-1)
tensor([0, 1, 2, 3, 4, 5, 1])
```

It gets the last one wrong but the others right. Doesn’t matter in this case because I’m just trying to reproduce this result. My question then becomes, how do I use the scales and zero points to get this same result?

I thought it would be something like

W1 = input_layer_weights / 0.05898105353116989 + 0 # weight scale and weight zp from what I understand

b1 = input_layer_bias / 0.05898105353116989 + 0 # same as above

Z1 = saturate(round(inputs @ W1.T + b1.T))

But even at this point it’s different than if I were to do `qmodel.input_layer(quantized_input)`

. So this makes me believe I’m doing the math incorrectly.

What am I missing with the scales and bias? Why is there `input_layer_input_scale_0`

, `input_layer.scale`

and a scale associated with the weight matrix, all with different values?