Those details are abstracted away in practice.
num_calibration_batches = 32
myModel = load_model(saved_model_dir + float_model_file).to('cpu')
# Fuse Conv, bn and relu
# Specify quantization configuration
# Start with simple min/max range estimation and per-tensor quantization of weights
myModel.qconfig = torch.quantization.default_qconfig
^After this point, the model has been fused and prepared. This means that observers have been inserted into the model which passively record the activations and the values of the weights of each layer. You can see this change from the original model if you print the model. You will see a bunch of observers have been added. Over time they refine the scale and zero_point parameters that define the quantization process, updating each time the the model does a forward pass.
This is similar to how the batchnorm module operates, i.e. it passively records incoming data and refines the parameters of the module depending on these observations.
For this reason if you take any quantized net in the prepare phase (or net with a batchnorm module) and call the model twice on the same exact input, you will get different results.
The documentation for the observers can be found here:
there are a few different types depending on your quantization methodology.