How is caliberation taking place in static quantization?

aditya_gupta · May 18, 2021, 2:50pm

While going through code in this tutorial here I’m not able to understand how is this step

# Calibrate with the training set
evaluate(myModel, criterion, data_loader, neval_batches=num_calibration_batches)

affecting the model, because in evaluate function all we do is inference. What am I missing here ?

Thank you!

HDCharles · May 18, 2021, 4:36pm

Theoretically, all you need to do quantization is to just round all the weights and set your model to round all the activations and you would obtain a quantized model.

The issue is that we have to choose what numbers to round to. In real life, we round to integers, but that won’t necessarily work well for ML. If we are doing int8 quantization, we only have 256 possible values, forcing us to choose where on the number line to place those rounding points. We want to choose the spacing and range of these 256 possible values to define our rounding process in a way that minimizes the error induced by quantization.

At a high level, the quantization framework ‘observes’ the activations and inputs that come into the model to get a better understanding of the distribution of values and then chooses the spacing and min/max (called scale and zero_point) based on this.

In practice the different observer types take different types of statistics and use that to decide what scale and zero_point to use. This is similar to how a batchnorm operates.

aditya_gupta · May 19, 2021, 12:56am

@HDCharles Thank you for the explanation
Exactly this is what I’m trying to understand in the code, I see evaluate function

def evaluate(model, criterion, data_loader, neval_batches):
    model.eval()
    top1 = AverageMeter('Acc@1', ':6.2f')
    top5 = AverageMeter('Acc@5', ':6.2f')
    cnt = 0
    with torch.no_grad():
        for image, target in data_loader:
            output = model(image)
            loss = criterion(output, target)
            cnt += 1
            acc1, acc5 = accuracy(output, target, topk=(1, 5))
            print('.', end = '')
            top1.update(acc1[0], image.size(0))
            top5.update(acc5[0], image.size(0))
            if cnt >= neval_batches:
                 return top1, top5

    return top1, top5

which is here https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html#helper-functions, and I can’t relate the code to what you explained.

HDCharles · May 19, 2021, 1:06am

Those details are abstracted away in practice.

num_calibration_batches = 32

myModel = load_model(saved_model_dir + float_model_file).to('cpu')
myModel.eval()

# Fuse Conv, bn and relu
myModel.fuse_model()

# Specify quantization configuration
# Start with simple min/max range estimation and per-tensor quantization of weights
myModel.qconfig = torch.quantization.default_qconfig
print(myModel.qconfig)
torch.quantization.prepare(myModel, inplace=True)

^After this point, the model has been fused and prepared. This means that observers have been inserted into the model which passively record the activations and the values of the weights of each layer. You can see this change from the original model if you print the model. You will see a bunch of observers have been added. Over time they refine the scale and zero_point parameters that define the quantization process, updating each time the the model does a forward pass.

This is similar to how the batchnorm module operates, i.e. it passively records incoming data and refines the parameters of the module depending on these observations.

For this reason if you take any quantized net in the prepare phase (or net with a batchnorm module) and call the model twice on the same exact input, you will get different results.

The documentation for the observers can be found here:

https://pytorch.org/docs/stable/torch.quantization.html#torch.quantization.MinMaxObserver

there are a few different types depending on your quantization methodology.