Understanding differences in the default qconfig for fbgemm and qnnpack

FBGEMM
QConfig(activation=functools.partial(<class 'torch.ao.quantization.fake_quantize.FusedMovingAvgObsFakeQuantize'>, observer=<class 'torch.ao.quantization.observer.MovingAverageMinMaxObserver'>, quant_min=0, quant_max=255, reduce_range=True){'factory_kwargs': <function add_module_to_qconfig_obs_ctr.<locals>.get_factory_kwargs_based_on_module_device at 0x7f89352b3b90>}, weight=functools.partial(<class 'torch.ao.quantization.fake_quantize.FusedMovingAvgObsFakeQuantize'>, observer=<class 'torch.ao.quantization.observer.MovingAveragePerChannelMinMaxObserver'>, quant_min=-128, quant_max=127, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){'factory_kwargs': <function add_module_to_qconfig_obs_ctr.<locals>.get_factory_kwargs_based_on_module_device at 0x7f89352b3b90>})

qnnpack
QConfig(activation=functools.partial(<class 'torch.ao.quantization.fake_quantize.FusedMovingAvgObsFakeQuantize'>, observer=<class 'torch.ao.quantization.observer.MovingAverageMinMaxObserver'>, quant_min=0, quant_max=255, reduce_range=False){'factory_kwargs': <function add_module_to_qconfig_obs_ctr.<locals>.get_factory_kwargs_based_on_module_device at 0x7f706669b9e0>}, weight=functools.partial(<class 'torch.ao.quantization.fake_quantize.FusedMovingAvgObsFakeQuantize'>, observer=<class 'torch.ao.quantization.observer.MovingAverageMinMaxObserver'>, quant_min=-128, quant_max=127, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){'factory_kwargs': <function add_module_to_qconfig_obs_ctr.<locals>.get_factory_kwargs_based_on_module_device at 0x7f706669b9e0>})

The default quantizer for fbgemm has reduce_range set to True. Why is it set only for the fbgemm backend?

Also, only fbgemm backend’s default qconfig has per channel quantization for weights. Why is this the case?

Thank you!

The recude_range of the activation in fbgemm’s config also confuses me. Is the clamp range reduces to [0,127] instead of the original range [0,255] during both training and inference? I only find the arguments in the function quantized_per_tensor associated with torch.dtype but no relation with qmax and qmin. Is there some documents describe the details inside?
can @jerryzh168 help us? Thanks.

some docs here: Quantization — PyTorch main documentation and here: https://github.com/pytorch/pytorch/blob/main/test/quantization/core/test_quantized_op.py#L52

basically it’s because fbgemm/onednn uses a instruction for matrix multiplication that can only work with 8 bit activation and 7 bit weight, so we need reduce_range for weight in order to not overflow.

I have also read the docs carefully, but i still can not understand the pipeline clearly. To my knowledge, the reduce_range flag means that the activation should be 127 if the original qmax is 255,so the clamp range is also[0,127]. But when I set the clamp value to 127, the result can not be aligned when I simulate the quantization process. Let me give an example:

import torch 
import torch.nn.functional as F
from torch.ao.quantization.qconfig_mapping import QConfigMapping
from torch.ao.quantization.backend_config.fbgemm import get_fbgemm_backend_config
from torch.ao.quantization.qconfig import get_default_qat_qconfig
from torch.ao.quantization.quantize_fx import prepare_qat_fx
from torch.ao.quantization.quantize_fx import convert_fx

class Debug(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv_debug=torch.nn.Conv2d(in_channels=3, out_channels=5, kernel_size=(3,3), padding=0, stride=1, groups=1,bias=True)
    def forward(self,x):
        x=self.conv_debug(x)
        return x

if __name__=='__main__':
    torch.manual_seed(4)
    print('The default quantized engine is {}'.format(torch.backends.quantized.engine))

    Q_MAX_LIST=[225,127]
    for Q_MAX in  Q_MAX_LIST:
        print('.....................When Q_MAX is {}, the result:.................................'.format(Q_MAX))
        net=Debug()
        backend_config=get_fbgemm_backend_config()
        qconfig=get_default_qat_qconfig('fbgemm') 
        qconfig_mapping = QConfigMapping().set_global(qconfig)

        net.train()
        net_prepare=prepare_qat_fx(net,qconfig_mapping,torch.randn(1,3,10,10),backend_config=backend_config)
        net_prepare(torch.randn(1,3,10,10))
        net_prepare(torch.randn(1,3,10,10))
        net_converted=convert_fx(net_prepare,qconfig_mapping=qconfig_mapping,backend_config=backend_config)

        
        input=torch.randn(1,3,10,10)
        result_quant_ref=net_converted(input)

        net_converted_state_dict=net_converted.state_dict()
        ##############input scale and zero_point####################################
        scale_quant_input=net_converted_state_dict['conv_debug_input_scale_0']
        zero_point_quant_input=net_converted_state_dict['conv_debug_input_zero_point_0']

        ##############Conv2d scale and zero_point, int weight and float bias####################################
        weight_conv_debug_float=net_converted_state_dict['conv_debug.weight']
        weight_conv_debug_int=weight_conv_debug_float.int_repr()
        scale_conv_debug=weight_conv_debug_float.q_per_channel_scales()
        zero_point_debug_int=weight_conv_debug_float.q_per_channel_zero_points()
        bias_conv_debug_float=net_converted_state_dict['conv_debug.bias']


        ##############output scale and zero_point####################################
        scale_quant_output=net_converted_state_dict['conv_debug.scale']
        zero_point_quant_output=net_converted_state_dict['conv_debug.zero_point']

        ####################simulate the process of quantization################################
        ####################z_scale(z_quant-z_zeropoint)=x_scale(x_quant-x_zeropoint)*y_scale(y_quant-y_zeropoint),y_zeropoint=0
        #get x_quant 
        input_quant=torch.clamp(torch.round(input/scale_quant_input+zero_point_quant_input),min=0,max=Q_MAX)

        input_quant_ref = torch.quantize_per_tensor(input, scale=scale_quant_input, zero_point=zero_point_quant_input,dtype=torch.quint8)
        input_quant_ref_int=input_quant_ref.int_repr().type(torch.float32)
        print('input quant result: {}'.format(torch.allclose(input_quant,input_quant_ref_int)))

        #get (x_quant-x_zeropoint)(y_quant-y_zeropoint),simulate the integer multiply of kernel and data
        conv3d_result_quant=F.conv2d(input_quant-zero_point_quant_input,weight_conv_debug_int.type(torch.float32),bias=None,stride=1,padding=0,dilation=1,groups=1)
        scale_input_weight=scale_quant_input*scale_conv_debug
        #get x_scale*y_scale
        scale_input_weight=scale_input_weight[None,:,None,None]  
        #get x_scale(x_quant-x_zeropoint)*y_scale(y_quant-y_zeropoint) 
        result=scale_input_weight*conv3d_result_quant 
        #plus float bias
        result=result+bias_conv_debug_float[None,:,None,None]    
        #quant output
        output_quant=torch.clamp(torch.round(result.detach()/scale_quant_output+zero_point_quant_output),min=0,max=Q_MAX)  
        #dequant
        output_dequant=scale_quant_output*(output_quant-zero_point_quant_output) 
        #check the result with original quantized model   
        close_flag=torch.allclose(result_quant_ref,output_dequant.to(torch.float32))  

        print('close_flag={}'.format(close_flag))                              
        diff=result_quant_ref-output_dequant
        if not close_flag:
            count_big_diff=(torch.abs(diff)>0.0000001).sum()
            diff_shape=diff.shape
            count_total=diff_shape[0]*diff_shape[1]*diff_shape[2]*diff_shape[3]
            print('non algin ratio:{:.2%}'.format(count_big_diff/count_total))

the result:

The default quantized engine is x86
.....................When Q_MAX is 225, the result:.................................
input quant result: True
close_flag=True
.....................When Q_MAX is 127, the result:.................................
input quant result: False
close_flag=False
non algin ratio:13.75%

I saw Q_MAX=225 instead of 255 for the first test, is it a typo?

oh,my mistake,but the result is the same.

so the numerics of: quantized_conv in qnnpack v.s. “dq → F.conv2d → q” will not match exactly, so that difference is expected I think. I’m not sure which part is not clear about the reduce_range stuff though

the quantization of my code is also:
q->F.conv2d(input/output are all actually int but in fp32’s form)->dq
all the process in almost the same as the reference_xxx_op definded in https://github.com/pytorch/pytorch/blob/main/torch/ao/quantization/pt2e/representation/rewrite.py,

(_reference_quantize_per_tensor_int8 → _reference_quantized_conv2d → _dequantize_per_tensor_int8)

my question is : if the reduce_range is enabled(using fbgemm), the activation’s max clamp value during inference should be 127,but the result is:

.....................When Q_MAX is 127, the result:.................................
input quant result: False
close_flag=False
non algin ratio:13.75%

it can not be aligned with the quantized model produced by function convert_fx(net_converted).

but when the I set the Q_MAX to 255, the result is aligned.

.....................When Q_MAX is 255, the result:.................................
input quant result: True
close_flag=True

this is the point that confuses me.

thanks.

so convert_fx does not produce the pattern, it will produce quantized ops instead, could you try using convert_to_reference_fx to get the pattern and compare.

also I’m not sure what is the motivation for doing this comparison, could you clarify a bit more on that?

I just want to extract the parameters and align the operators to deploy it on my own inference engine.
As you said, I use the model produced by convert_to_reference_fx and simulate the process.,the result show that it can be aligned only when the clamp value is 255. Is the reduce_range only happened in training phrase?
code:

import torch 
import torch.nn.functional as F
from torch.ao.quantization.qconfig_mapping import QConfigMapping
from torch.ao.quantization.backend_config.fbgemm import get_fbgemm_backend_config
from torch.ao.quantization.qconfig import get_default_qat_qconfig
from torch.ao.quantization.quantize_fx import prepare_qat_fx
from torch.ao.quantization.quantize_fx import convert_fx,convert_to_reference_fx
import os
import copy

def quantize_per_tensor_uint8(x_fp32, scale, zero_point, quant_min, quant_max):
    x = x_fp32 / scale  # fp32
    x = torch.round(x)  # fp32
    x = x.to(dtype=torch.int32)  # int32
    x = x + zero_point  # int32
    x = torch.clamp(x, quant_min, quant_max)  # int32
    x = x.to(dtype=torch.uint8)
    return x


def dequantize_per_tensor_uint8(x_i8, scale, zero_point):
    return ((x_i8.to(torch.float32) - zero_point) * scale).to(dtype=torch.float32)

class Debug(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv_debug=torch.nn.Conv2d(in_channels=3, out_channels=5, kernel_size=(3,3), padding=0, stride=1, groups=1,bias=True)
    def forward(self,x):
        x=self.conv_debug(x)
        return x

if __name__=='__main__':
    # torch.manual_seed(4)
    print('The default quantized engine is {}'.format(torch.backends.quantized.engine))
    Q_MAX_LIST=[127,255]
    for Q_MAX in  Q_MAX_LIST:
        for i in range(10):
            print('.....................When Q_MAX is {}, the result(loop {}):.................................'.format(Q_MAX,i))
            net=Debug()
            backend_config=get_fbgemm_backend_config()
            qconfig=get_default_qat_qconfig('fbgemm') 
            qconfig_mapping = QConfigMapping().set_global(qconfig)

            net.train()
            net_prepare=prepare_qat_fx(net,qconfig_mapping,torch.randn(1,3,10,10),backend_config=backend_config)
            net_prepare(torch.randn(1,3,10,10))
            net_prepare(torch.randn(1,3,10,10))
            net_converted=convert_fx(copy.deepcopy(net_prepare),qconfig_mapping=qconfig_mapping,backend_config=backend_config)
            net_converted_ref=convert_to_reference_fx(copy.deepcopy(net_prepare),qconfig_mapping=qconfig_mapping,backend_config=backend_config)
            # print('reference model:')
            # print(net_converted_ref)
            
            input=torch.randn(1,3,10,10)
            result_quant_ref=net_converted_ref(input)

            net_converted_state_dict=net_converted.state_dict()
            ##############input scale and zero_point####################################
            scale_quant_input=net_converted_state_dict['conv_debug_input_scale_0']
            zero_point_quant_input=net_converted_state_dict['conv_debug_input_zero_point_0']

            ##############Conv2d scale and zero_point, int weight and float bias####################################
            weight_conv_debug_float=net_converted_state_dict['conv_debug.weight']
            weight_conv_debug_float_dequantize=weight_conv_debug_float.dequantize()
            scale_conv_debug=weight_conv_debug_float.q_per_channel_scales()
            zero_point_debug_int=weight_conv_debug_float.q_per_channel_zero_points()
            bias_conv_debug_float=net_converted_state_dict['conv_debug.bias']


            ##############output scale and zero_point####################################
            scale_quant_output=net_converted_state_dict['conv_debug.scale']
            zero_point_quant_output=net_converted_state_dict['conv_debug.zero_point']

            ####################simulate the process of quantization################################
            ####################z_scale(z_quant-z_zeropoint)=x_scale(x_quant-x_zeropoint)*y_scale(y_quant-y_zeropoint),y_zeropoint=0
            #get x_quant 
            input_quant=quantize_per_tensor_uint8(
                x_fp32=input, 
                scale=scale_quant_input, 
                zero_point=zero_point_quant_input, 
                quant_min=0, 
                quant_max=Q_MAX)

            input_quant_dequant=dequantize_per_tensor_uint8(
                x_i8=input_quant, 
                scale=scale_quant_input, 
                zero_point=zero_point_quant_input)

            quantize_conv2d_reference = F.conv2d(
                input_quant_dequant, 
                weight_conv_debug_float_dequantize,
                bias_conv_debug_float, 
                1,0, 1, 1)

            output_quant=quantize_per_tensor_uint8(
                x_fp32=quantize_conv2d_reference, 
                scale=scale_quant_output, 
                zero_point=zero_point_quant_output, 
                quant_min=0, 
                quant_max=Q_MAX)

            output_dequant=dequantize_per_tensor_uint8(
                x_i8=output_quant, 
                scale=scale_quant_output, 
                zero_point=zero_point_quant_output)
            #check the result with original quantized model   
            close_flag=torch.allclose(result_quant_ref,output_dequant.to(torch.float32))  

            print('close_flag={}'.format(close_flag))                              
            diff=result_quant_ref-output_dequant
            if not close_flag:
                count_big_diff=(torch.abs(diff)>0.0000001).sum()
                diff_shape=diff.shape
                count_total=diff_shape[0]*diff_shape[1]*diff_shape[2]*diff_shape[3]
                print('non algin ratio:{:.2%}'.format(count_big_diff/count_total))

resut:

The default quantized engine is x86
.....................When Q_MAX is 127, the result(loop 0):.................................
close_flag=False
non algin ratio:0.31%
.....................When Q_MAX is 127, the result(loop 1):.................................
close_flag=True
.....................When Q_MAX is 127, the result(loop 2):.................................
close_flag=False
non algin ratio:0.31%
.....................When Q_MAX is 127, the result(loop 3):.................................
close_flag=False
non algin ratio:0.31%
.....................When Q_MAX is 127, the result(loop 4):.................................
close_flag=False
non algin ratio:0.63%
.....................When Q_MAX is 127, the result(loop 5):.................................
close_flag=False
non algin ratio:0.31%
.....................When Q_MAX is 127, the result(loop 6):.................................
close_flag=False
non algin ratio:16.88%
.....................When Q_MAX is 127, the result(loop 7):.................................
close_flag=False
non algin ratio:3.75%
.....................When Q_MAX is 127, the result(loop 8):.................................
close_flag=False
non algin ratio:0.63%
.....................When Q_MAX is 127, the result(loop 9):.................................
close_flag=False
non algin ratio:25.94%
.....................When Q_MAX is 255, the result(loop 0):.................................
close_flag=True
.....................When Q_MAX is 255, the result(loop 1):.................................
close_flag=True
.....................When Q_MAX is 255, the result(loop 2):.................................
close_flag=True
.....................When Q_MAX is 255, the result(loop 3):.................................
close_flag=True
.....................When Q_MAX is 255, the result(loop 4):.................................
close_flag=True
.....................When Q_MAX is 255, the result(loop 5):.................................
close_flag=True
.....................When Q_MAX is 255, the result(loop 6):.................................
close_flag=True
.....................When Q_MAX is 255, the result(loop 7):.................................
close_flag=True
.....................When Q_MAX is 255, the result(loop 8):.................................
close_flag=True
.....................When Q_MAX is 255, the result(loop 9):.................................
close_flag=True

Is the reduce_range only happened in training phrase?

I think so, the old quantize/dequantize op does not support clamping I think. could you try _convert_to_reference_decomposed_fx instead?

1 Like

yes,you are right. I changed the function convert_to_reference_fx to _convert_to_reference_decomposed_fx, the result is as expected(the reduce_range flag works).
code:

import torch 
import torch.nn.functional as F
from torch.ao.quantization.qconfig_mapping import QConfigMapping
from torch.ao.quantization.backend_config.fbgemm import get_fbgemm_backend_config
from torch.ao.quantization.qconfig import get_default_qat_qconfig
from torch.ao.quantization.quantize_fx import prepare_qat_fx
from torch.ao.quantization.quantize_fx import convert_fx,convert_to_reference_fx,_convert_to_reference_decomposed_fx
import os
import copy

def quantize_per_tensor_uint8(x_fp32, scale, zero_point, quant_min, quant_max):
    x = x_fp32 / scale  # fp32
    x = torch.round(x)  # fp32
    x = x.to(dtype=torch.int32)  # int32
    x = x + zero_point  # int32
    x = torch.clamp(x, quant_min, quant_max)  # int32
    x = x.to(dtype=torch.uint8)
    return x


def dequantize_per_tensor_uint8(x_i8, scale, zero_point):
    return ((x_i8.to(torch.float32) - zero_point) * scale).to(dtype=torch.float32)

class Debug(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv_debug=torch.nn.Conv2d(in_channels=3, out_channels=5, kernel_size=(3,3), padding=0, stride=1, groups=1,bias=True)
    def forward(self,x):
        x=self.conv_debug(x)
        return x

if __name__=='__main__':
    # torch.manual_seed(4)
    print('The default quantized engine is {}'.format(torch.backends.quantized.engine))
    Q_MAX_LIST=[127,255]
    for Q_MAX in  Q_MAX_LIST:
        for i in range(10):
            print('.....................When Q_MAX is {}, the result(loop {}):.................................'.format(Q_MAX,i))
            net=Debug()
            backend_config=get_fbgemm_backend_config()
            qconfig=get_default_qat_qconfig('fbgemm') 
            qconfig_mapping = QConfigMapping().set_global(qconfig)

            net.train()
            net_prepare=prepare_qat_fx(net,qconfig_mapping,torch.randn(1,3,10,10),backend_config=backend_config)
            net_prepare(torch.randn(1,3,10,10))
            net_prepare(torch.randn(1,3,10,10))
            net_converted=convert_fx(copy.deepcopy(net_prepare),qconfig_mapping=qconfig_mapping,backend_config=backend_config)
            net_converted_ref=_convert_to_reference_decomposed_fx(copy.deepcopy(net_prepare),qconfig_mapping=qconfig_mapping,backend_config=backend_config)

            input=torch.randn(1,3,10,10)
            result_quant_ref=net_converted_ref(input)

            net_converted_state_dict=net_converted.state_dict()
            ##############input scale and zero_point####################################
            scale_quant_input=net_converted_state_dict['conv_debug_input_scale_0']
            zero_point_quant_input=net_converted_state_dict['conv_debug_input_zero_point_0']

            ##############Conv2d scale and zero_point, int weight and float bias####################################
            weight_conv_debug_float=net_converted_state_dict['conv_debug.weight']
            weight_conv_debug_float_dequantize=weight_conv_debug_float.dequantize()
            scale_conv_debug=weight_conv_debug_float.q_per_channel_scales()
            zero_point_debug_int=weight_conv_debug_float.q_per_channel_zero_points()
            bias_conv_debug_float=net_converted_state_dict['conv_debug.bias']


            ##############output scale and zero_point####################################
            scale_quant_output=net_converted_state_dict['conv_debug.scale']
            zero_point_quant_output=net_converted_state_dict['conv_debug.zero_point']

            ####################simulate the process of quantization################################
            ####################z_scale(z_quant-z_zeropoint)=x_scale(x_quant-x_zeropoint)*y_scale(y_quant-y_zeropoint),y_zeropoint=0
            #get x_quant 
            input_quant=quantize_per_tensor_uint8(
                x_fp32=input, 
                scale=scale_quant_input, 
                zero_point=zero_point_quant_input, 
                quant_min=0, 
                quant_max=Q_MAX)

            input_quant_dequant=dequantize_per_tensor_uint8(
                x_i8=input_quant, 
                scale=scale_quant_input, 
                zero_point=zero_point_quant_input)

            quantize_conv2d_reference = F.conv2d(
                input_quant_dequant, 
                weight_conv_debug_float_dequantize,
                bias_conv_debug_float, 
                1,0, 1, 1)

            output_quant=quantize_per_tensor_uint8(
                x_fp32=quantize_conv2d_reference, 
                scale=scale_quant_output, 
                zero_point=zero_point_quant_output, 
                quant_min=0, 
                quant_max=Q_MAX)

            output_dequant=dequantize_per_tensor_uint8(
                x_i8=output_quant, 
                scale=scale_quant_output, 
                zero_point=zero_point_quant_output)
            #check the result with original quantized model   
            close_flag=torch.allclose(result_quant_ref,output_dequant.to(torch.float32))  

            print('close_flag={}'.format(close_flag))                              
            diff=result_quant_ref-output_dequant
            if not close_flag:
                count_big_diff=(torch.abs(diff)>0.0000001).sum()
                diff_shape=diff.shape
                count_total=diff_shape[0]*diff_shape[1]*diff_shape[2]*diff_shape[3]
                print('non algin ratio:{:.2%}'.format(count_big_diff/count_total))

result:

The default quantized engine is x86
.....................When Q_MAX is 127, the result(loop 0):.................................
close_flag=True
.....................When Q_MAX is 127, the result(loop 1):.................................
close_flag=True
.....................When Q_MAX is 127, the result(loop 2):.................................
close_flag=True
.....................When Q_MAX is 127, the result(loop 3):.................................
close_flag=True
.....................When Q_MAX is 127, the result(loop 4):.................................
close_flag=True
.....................When Q_MAX is 127, the result(loop 5):.................................
close_flag=True
.....................When Q_MAX is 127, the result(loop 6):.................................
close_flag=True
.....................When Q_MAX is 127, the result(loop 7):.................................
close_flag=True
.....................When Q_MAX is 127, the result(loop 8):.................................
close_flag=True
.....................When Q_MAX is 127, the result(loop 9):.................................
close_flag=True
.....................When Q_MAX is 255, the result(loop 0):.................................
close_flag=False
non algin ratio:0.63%
.....................When Q_MAX is 255, the result(loop 1):.................................
close_flag=False
non algin ratio:0.31%
.....................When Q_MAX is 255, the result(loop 2):.................................
close_flag=True
.....................When Q_MAX is 255, the result(loop 3):.................................
close_flag=False
non algin ratio:27.19%
.....................When Q_MAX is 255, the result(loop 4):.................................
close_flag=True
.....................When Q_MAX is 255, the result(loop 5):.................................
close_flag=True
.....................When Q_MAX is 255, the result(loop 6):.................................
close_flag=False
non algin ratio:1.56%
.....................When Q_MAX is 255, the result(loop 7):.................................
close_flag=False
non algin ratio:2.50%
.....................When Q_MAX is 255, the result(loop 8):.................................
close_flag=True
.....................When Q_MAX is 255, the result(loop 9):.................................
close_flag=False
non algin ratio:0.31%

I also print the converted model produced by 3 converters and compare them:
_convert_to_reference_decomposed_fx:

GraphModule(
  (conv_debug): QuantizedConv2d(Reference)(3, 5, kernel_size=(3, 3), stride=(1, 1))
)



def forward(self, x):
    conv_debug_input_scale_0 = self.conv_debug_input_scale_0
    conv_debug_input_zero_point_0 = self.conv_debug_input_zero_point_0
    quantize_per_tensor = torch.ops.quantized_decomposed.quantize_per_tensor(x, conv_debug_input_scale_0, conv_debug_input_zero_point_0, 0, 127, torch.uint8);  x = None
    dequantize_per_tensor = torch.ops.quantized_decomposed.dequantize_per_tensor(quantize_per_tensor, conv_debug_input_scale_0, conv_debug_input_zero_point_0, 0, 127, torch.uint8);  quantize_per_tensor = conv_debug_input_scale_0 = conv_debug_input_zero_point_0 = None
    conv_debug = self.conv_debug(dequantize_per_tensor);  dequantize_per_tensor = None
    conv_debug_scale_0 = self.conv_debug_scale_0
    conv_debug_zero_point_0 = self.conv_debug_zero_point_0
    quantize_per_tensor_1 = torch.ops.quantized_decomposed.quantize_per_tensor(conv_debug, conv_debug_scale_0, conv_debug_zero_point_0, 0, 127, torch.uint8);  conv_debug = None
    dequantize_per_tensor_1 = torch.ops.quantized_decomposed.dequantize_per_tensor(quantize_per_tensor_1, conv_debug_scale_0, conv_debug_zero_point_0, 0, 127, torch.uint8);  quantize_per_tensor_1 = conv_debug_scale_0 = conv_debug_zero_point_0 = None
    return dequantize_per_tensor_1

convert_to_reference_fx:

GraphModule(
  (conv_debug): QuantizedConv2d(Reference)(3, 5, kernel_size=(3, 3), stride=(1, 1))
)



def forward(self, x):
    conv_debug_input_scale_0 = self.conv_debug_input_scale_0
    conv_debug_input_zero_point_0 = self.conv_debug_input_zero_point_0
    quantize_per_tensor = torch.quantize_per_tensor(x, conv_debug_input_scale_0, conv_debug_input_zero_point_0, torch.quint8);  x = conv_debug_input_scale_0 = conv_debug_input_zero_point_0 = None
    dequantize = quantize_per_tensor.dequantize();  quantize_per_tensor = None
    conv_debug = self.conv_debug(dequantize);  dequantize = None
    conv_debug_scale_0 = self.conv_debug_scale_0
    conv_debug_zero_point_0 = self.conv_debug_zero_point_0
    quantize_per_tensor_1 = torch.quantize_per_tensor(conv_debug, conv_debug_scale_0, conv_debug_zero_point_0, torch.quint8);  conv_debug = conv_debug_scale_0 = conv_debug_zero_point_0 = None
    dequantize_1 = quantize_per_tensor_1.dequantize();  quantize_per_tensor_1 = None
    return dequantize_1

convert_fx:

GraphModule(
  (conv_debug): QuantizedConv2d(3, 5, kernel_size=(3, 3), stride=(1, 1), scale=0.02617141790688038, zero_point=65)
)



def forward(self, x):
    conv_debug_input_scale_0 = self.conv_debug_input_scale_0
    conv_debug_input_zero_point_0 = self.conv_debug_input_zero_point_0
    quantize_per_tensor = torch.quantize_per_tensor(x, conv_debug_input_scale_0, conv_debug_input_zero_point_0, torch.quint8);  x = conv_debug_input_scale_0 = conv_debug_input_zero_point_0 = None
    conv_debug = self.conv_debug(quantize_per_tensor);  quantize_per_tensor = None
    dequantize_1 = conv_debug.dequantize();  conv_debug = None
    return dequantize_1

The quantize function in the model produced by convert_fx and convert_to_reference_fx is torch.quantize_per_tensor, its signature is :

def quantize_per_tensor(input: Tensor, scale: Tensor, zero_point: Tensor, dtype: _dtype) -> Tensor: ...

But the quantize function in model produced by _convert_to_reference_decomposed_fx is torch.ops.quantized_decomposed.quantize_per_tensor, its signature is:

quantize_per_tensor(Tensor input, float scale, int zero_point, int quant_min, int quant_max, ScalarType dtype) -> Tensor

The function has extra inputs quant_min and quant_max compared to the above function. and this is the essential difference.
Is it incorrect if we use fbgemm backend following the tutorial in Quantization — PyTorch 1.13 documentation when we use convert_fx to convert the QAT model?

yeah exactly, I think that’s an issue in the old flow, we have seen this internally as well, but now we have the new flow we’d just encourage people to move to the new flow and the issue will be gone.

this does sounds like a problem, but we haven’t looked at this in detail before, I guess that’s why it’s not solved.

Do you mean 7 bit activation and 8 bit weight?

yeah that’s correct, it’s because fbgemm backend is using some special instructions