I am trying to incorporate fuse_fx into my quantization pipeline:
qconfig_mapping = QConfigMapping().set_global(qconfig)
model_fx = symbolic_trace(original_model)
fused_model_fx = fuse_fx(model_fx)
prepared_model = prepare_fx(fused_model_fx, qconfig_mapping, example_inputs)
calibrate_model(prepared_model, calibration_loader, calibration_batches, device)
quantized_model = convert_fx(prepared_model)
However, it seems that convert_fx expects the model to be modified as well, not just it’s graph representation,
in _lower_dynamic_weighted_ref_module
type(named_modules[str(n.target)]) not in \
KeyError: 'layer1.0.conv1.1'
(such layer exists when I do prepared_fused_model.graph.print_tabular() but doesn’t when I print model nodes)
I can manually synchronize model nodes with its graph after fuse_fx, but I don’t think this is the point of using pytorch graph mode.
The tutorial (prototype) FX Graph Mode Post Training Static Quantization — PyTorch Tutorials 2.3.0+cu121 documentation is very ambigious - they perform fuse_fx after convert_fx, which is not what is supposed to be done in a normal pipeline and it kind of hides potential problems. I can do fuse_fx on the original model as well.
Further it is unclear to me whether prepare_fx actually does what is indicated in the tutorial in the comment:
prepared_model = prepare_fx(float_model, qconfig_mapping, example_inputs) # fuse modules and insert observers
It is indicated here that prepare_fx does fusion of modules - so that would mean adding fuse_fx is pointless. Is that correct?
I would greatly appreciate your help and guidance.
Details: I work with torch resnet18, I expect the following output after aforementioned pipeline (fuse_fx → prepare_fx → convert_fx):
Quantized model:
GraphModule(
(conv1): QuantizedConvReLU2d(3, 64, kernel_size=(7, 7), stride=(2, 2), scale=0.011580255813896656, zero_point=0, padding=(3, 3))
(bn1): Identity()
(relu): Identity()
(maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(layer1): Module(
(0): Module(
(conv1): QuantizedConvReLU2d(64, 64, kernel_size=(3, 3), stride=(1, 1), scale=0.00825551524758339, zero_point=0, padding=(1, 1))
(conv2): QuantizedConv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), scale=0.021675927564501762, zero_point=151, padding=(1, 1))
)
(1): Module(
(conv1): QuantizedConvReLU2d(64, 64, kernel_size=(3, 3), stride=(1, 1), scale=0.007387497462332249, zero_point=0, padding=(1, 1))
(conv2): QuantizedConv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), scale=0.030672406777739525, zero_point=164, padding=(1, 1))
)
)
(layer2): Module(
(0): Module(
(conv1): QuantizedConvReLU2d(64, 128, kernel_size=(3, 3), stride=(2, 2), scale=0.007220075465738773, zero_point=0, padding=(1, 1))
(conv2): QuantizedConv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), scale=0.021219059824943542, zero_point=113, padding=(1, 1))
(downsample): Module(
(0): QuantizedConv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), scale=0.016369296237826347, zero_point=131)
)
)
(1): Module(
(conv1): QuantizedConvReLU2d(128, 128, kernel_size=(3, 3), stride=(1, 1), scale=0.008431993424892426, zero_point=0, padding=(1, 1))
(conv2): QuantizedConv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), scale=0.02427751012146473, zero_point=131, padding=(1, 1))
)
)
(layer3): Module(
(0): Module(
(conv1): QuantizedConvReLU2d(128, 256, kernel_size=(3, 3), stride=(2, 2), scale=0.00851518101990223, zero_point=0, padding=(1, 1))
(conv2): QuantizedConv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), scale=0.025926809757947922, zero_point=93, padding=(1, 1))
(downsample): Module(
(0): QuantizedConv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), scale=0.00756411487236619, zero_point=166)
)
)
(1): Module(
(conv1): QuantizedConvReLU2d(256, 256, kernel_size=(3, 3), stride=(1, 1), scale=0.008150133304297924, zero_point=0, padding=(1, 1))
(conv2): QuantizedConv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), scale=0.02820182591676712, zero_point=164, padding=(1, 1))
)
)
(layer4): Module(
(0): Module(
(conv1): QuantizedConvReLU2d(256, 512, kernel_size=(3, 3), stride=(2, 2), scale=0.006357187870889902, zero_point=0, padding=(1, 1))
(conv2): QuantizedConv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), scale=0.02578684873878956, zero_point=135, padding=(1, 1))
(downsample): Module(
(0): QuantizedConv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), scale=0.019855372607707977, zero_point=124)
)
)
(1): Module(
(conv1): QuantizedConvReLU2d(512, 512, kernel_size=(3, 3), stride=(1, 1), scale=0.014804747886955738, zero_point=0, padding=(1, 1))
(conv2): QuantizedConv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), scale=0.12023842334747314, zero_point=94, padding=(1, 1))
)
)
(avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
(fc): QuantizedLinear(in_features=512, out_features=1000, scale=0.14396773278713226, zero_point=68, qscheme=torch.per_channel_affine)
)
or something like that. This was achieved with torch.ao.quantization.fuse_modules
API preceeding prepare_fx
and convert_fx
, although the benchmark shows it doesn’t work very good for
modules_to_fuse = [
['conv1', 'bn1', 'relu']
]
therefore I try to understand fuse_fx
API. Am I supposed to be doing what I’m doing, or should I already switch to PT2 Export workflow? I was under the impression Eager Mode is v1 and Graph Mode is v2, now it seems both are legacy and PT2 is the modern one?