Is fuse_fx supposed to preceed convert_fx in the quant pipeline?

pylNeuralNet · July 19, 2024, 7:27am

I am trying to incorporate fuse_fx into my quantization pipeline:

    qconfig_mapping = QConfigMapping().set_global(qconfig)
    model_fx = symbolic_trace(original_model)
    fused_model_fx = fuse_fx(model_fx)
    prepared_model = prepare_fx(fused_model_fx, qconfig_mapping, example_inputs)
    calibrate_model(prepared_model, calibration_loader, calibration_batches, device)
    quantized_model = convert_fx(prepared_model)

However, it seems that convert_fx expects the model to be modified as well, not just it’s graph representation,

in _lower_dynamic_weighted_ref_module
    type(named_modules[str(n.target)]) not in \
KeyError: 'layer1.0.conv1.1'

(such layer exists when I do prepared_fused_model.graph.print_tabular() but doesn’t when I print model nodes)

I can manually synchronize model nodes with its graph after fuse_fx, but I don’t think this is the point of using pytorch graph mode.

The tutorial (prototype) FX Graph Mode Post Training Static Quantization — PyTorch Tutorials 2.3.0+cu121 documentation is very ambigious - they perform fuse_fx after convert_fx, which is not what is supposed to be done in a normal pipeline and it kind of hides potential problems. I can do fuse_fx on the original model as well.

Further it is unclear to me whether prepare_fx actually does what is indicated in the tutorial in the comment:
prepared_model = prepare_fx(float_model, qconfig_mapping, example_inputs) # fuse modules and insert observers

It is indicated here that prepare_fx does fusion of modules - so that would mean adding fuse_fx is pointless. Is that correct?

I would greatly appreciate your help and guidance.

Details: I work with torch resnet18, I expect the following output after aforementioned pipeline (fuse_fx → prepare_fx → convert_fx):
Quantized model:

GraphModule(
  (conv1): QuantizedConvReLU2d(3, 64, kernel_size=(7, 7), stride=(2, 2), scale=0.011580255813896656, zero_point=0, padding=(3, 3))
  (bn1): Identity()
  (relu): Identity()
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Module(
    (0): Module(
      (conv1): QuantizedConvReLU2d(64, 64, kernel_size=(3, 3), stride=(1, 1), scale=0.00825551524758339, zero_point=0, padding=(1, 1))
      (conv2): QuantizedConv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), scale=0.021675927564501762, zero_point=151, padding=(1, 1))
    )
    (1): Module(
      (conv1): QuantizedConvReLU2d(64, 64, kernel_size=(3, 3), stride=(1, 1), scale=0.007387497462332249, zero_point=0, padding=(1, 1))
      (conv2): QuantizedConv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), scale=0.030672406777739525, zero_point=164, padding=(1, 1))
    )
  )
  (layer2): Module(
    (0): Module(
      (conv1): QuantizedConvReLU2d(64, 128, kernel_size=(3, 3), stride=(2, 2), scale=0.007220075465738773, zero_point=0, padding=(1, 1))
      (conv2): QuantizedConv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), scale=0.021219059824943542, zero_point=113, padding=(1, 1))
      (downsample): Module(
        (0): QuantizedConv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), scale=0.016369296237826347, zero_point=131)
      )
    )
    (1): Module(
      (conv1): QuantizedConvReLU2d(128, 128, kernel_size=(3, 3), stride=(1, 1), scale=0.008431993424892426, zero_point=0, padding=(1, 1))
      (conv2): QuantizedConv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), scale=0.02427751012146473, zero_point=131, padding=(1, 1))
    )
  )
  (layer3): Module(
    (0): Module(
      (conv1): QuantizedConvReLU2d(128, 256, kernel_size=(3, 3), stride=(2, 2), scale=0.00851518101990223, zero_point=0, padding=(1, 1))
      (conv2): QuantizedConv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), scale=0.025926809757947922, zero_point=93, padding=(1, 1))
      (downsample): Module(
        (0): QuantizedConv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), scale=0.00756411487236619, zero_point=166)
      )
    )
    (1): Module(
      (conv1): QuantizedConvReLU2d(256, 256, kernel_size=(3, 3), stride=(1, 1), scale=0.008150133304297924, zero_point=0, padding=(1, 1))
      (conv2): QuantizedConv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), scale=0.02820182591676712, zero_point=164, padding=(1, 1))
    )
  )
  (layer4): Module(
    (0): Module(
      (conv1): QuantizedConvReLU2d(256, 512, kernel_size=(3, 3), stride=(2, 2), scale=0.006357187870889902, zero_point=0, padding=(1, 1))
      (conv2): QuantizedConv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), scale=0.02578684873878956, zero_point=135, padding=(1, 1))
      (downsample): Module(
        (0): QuantizedConv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), scale=0.019855372607707977, zero_point=124)
      )
    )
    (1): Module(
      (conv1): QuantizedConvReLU2d(512, 512, kernel_size=(3, 3), stride=(1, 1), scale=0.014804747886955738, zero_point=0, padding=(1, 1))
      (conv2): QuantizedConv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), scale=0.12023842334747314, zero_point=94, padding=(1, 1))
    )
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
  (fc): QuantizedLinear(in_features=512, out_features=1000, scale=0.14396773278713226, zero_point=68, qscheme=torch.per_channel_affine)
)

or something like that. This was achieved with torch.ao.quantization.fuse_modules API preceeding prepare_fx and convert_fx, although the benchmark shows it doesn’t work very good for

    modules_to_fuse = [
        ['conv1', 'bn1', 'relu']
        ]

therefore I try to understand fuse_fx API. Am I supposed to be doing what I’m doing, or should I already switch to PT2 Export workflow? I was under the impression Eager Mode is v1 and Graph Mode is v2, now it seems both are legacy and PT2 is the modern one?

jerryzh168 · August 9, 2024, 12:43am

prepare_fx calls fuse_fx by default. yeah fx api is in maintainance, it’s better to switch to pt2 export flow now