Quantization fails for custom backend

stuhlbein · September 24, 2024, 1:20pm

Hi all,
I have issues trying to create a fully quantized model for my own backend (which will ultimately be a hardware AI accelerator). For simplicity, I wanted to purely use qint8 for now, the details will differ later as they depend a lot on memory bandwidth for different layers on hardware etc.
So, what I want to do now is creating a simple model and quantize it completely (that means weights, inputs, outputs, biases… everything). I tried all kinds of different setups for the QConfigs but I never managed to create a model that is completely quantized.

import torch
from torch.ao.quantization import (
    default_weight_observer,
    get_default_qconfig_mapping,
    get_default_qconfig,
    MinMaxObserver,
    QConfig,
    QConfigMapping,
)
from torch.ao.quantization.backend_config import (
    BackendConfig,
    BackendPatternConfig,
    DTypeConfig,
    DTypeWithConstraints,
    ObservationType,
)
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx
from torch.ao.quantization.fx.custom_config import PrepareCustomConfig
from torch.ao.quantization.observer import HistogramObserver, PerChannelMinMaxObserver

class Model(torch.nn.Module):
    def __init__(self, input_size, output_size):
        super().__init__()
        self.linear1 = torch.nn.Linear(input_size, 16)
        self.relu1 = torch.nn.ReLU()
        self.linear2 = torch.nn.Linear(16, 16)
        self.relu2 = torch.nn.ReLU()
        self.linear3 = torch.nn.Linear(16, output_size)

    def forward(self, x):
        x = self.linear1(x)
        x = self.relu1(x)
        x = self.linear2(x)
        x = self.relu2(x)
        x = self.linear3(x)
        return x

# Instantiate simple example model
input_size = 10
output_size = 3
model = Model(input_size, output_size)
model.eval()

# Define backend configuration, all QInt8 to keep things simple for now, no fusing
linear_int8_dtype = DTypeConfig(
    input_dtype=torch.qint8,
    output_dtype=torch.qint8,
    weight_dtype=torch.qint8,
    bias_dtype=torch.qint8)

linear_config = BackendPatternConfig(torch.nn.Linear) \
    .set_observation_type(ObservationType.OUTPUT_SHARE_OBSERVER_WITH_INPUT) \
    .add_dtype_config(linear_int8_dtype) \
    .set_root_module(torch.nn.Linear) \
    .set_reference_quantized_module(torch.ao.nn.quantized.reference.Linear)

backend_config = BackendConfig("my_backend") \
    .set_backend_pattern_config(linear_config)

# Create global QConfig mapping, using qint8 for everything
qconfig = QConfig(
    activation=MinMaxObserver.with_args(dtype=torch.qint8), 
    weight=MinMaxObserver.with_args(dtype=torch.qint8))

qconfig_mapping = QConfigMapping().set_global(qconfig)

# Setup s.t. input and outputs are expected to be already quantized
prepare_custom_config = PrepareCustomConfig() 
prepare_custom_config.set_input_quantized_indexes(list(range(input_size)))
prepare_custom_config.set_output_quantized_indexes(list(range(output_size)))

# generate some example data
example_input_float = torch.rand(20, input_size, dtype=torch.float)
example_input_int = torch.quantize_per_tensor(
                        example_input_float, 1.0/255, 0, torch.qint8)

# Preparing, calibrating, converting...
prepared = prepare_fx(model, qconfig_mapping, example_input_int, prepare_custom_config, backend_config=backend_config)
prepared(example_input_int) # fails here 
converted = convert_fx(prepared, backend_config=backend_config)

In the second to last line I get a runtime error RuntimeError: Creation of quantized tensor requires quantized dtype like torch.quint8.
When adding some debugging output, it looks like the Observers are using 32b floats but not qint8 for some reason not quite clear to me.
When calling the model with floats it works but after the conversion I do not get QuantizedLinear layers, so to me it looks like my configuration is wrong at some place, so preparing fails.

Does somebody know what the issue is?

jerryzh168 · November 27, 2024, 6:51pm

you don’t need to pass in a quantized tensor to prepared model

you can probably do prepared(example_input_float) instead

xiangyun · December 6, 2024, 6:47am

I have another question, I hope you can help me. Why did the code above only quantify activation without quantifying weights? as shown below

thanks!

jerryzh168 · March 21, 2025, 12:45am

weight is probably also quantized, how do you know it’s not quantized?