How to inference with smoothquant quantized model with pytorch?

oleotiger · November 3, 2023, 9:43am

SmoothQuant can only run on GPUs with Cutlass support.

There are many models quantized with SmoothQuant on Hugging Face. I want to run inference with them on an ARM CPU-only server, performance notwithstanding. I have struggled for a long time but haven’t found a viable way to achieve this.

Should I modify SmoothQuant or Torch-int?
PyTorch supports quantization with QNNPACK, and it provides both module (e.g., quantized_linear, with unspecified input tensor datatype) and functional interfaces (e.g., linear, with input as quint8 and weight as qint8). These do not match the interface with Cutlass in Torch-int.

I would greatly appreciate it if someone could provide thoughts, workflows, or example code/pseudo code.

HDCharles · November 4, 2023, 12:39am

Unless i am mistaken, smoothquant is just input-weight equalization applied to a quantized linear with dynamic quantization of some kind on the activation. Let me know if you want something different.

we have input-weight equalization for fx quantization

github.com

pytorch/pytorch/blob/main/test/quantization/fx/test_equalize_fx.py#L344


      
              }
          
              m = TestBranchingWithEqualizationModel().eval()
              example_inputs = (torch.randn(1, 5),)
              prepared = prepare_fx(
                  m, specific_qconfig_dict, example_inputs=example_inputs,
                  _equalization_config=default_equalization_qconfig_dict)
              self.checkGraphModuleNodes(prepared, expected_node_occurrence=eq_branching_node_occurrence)
          
          @skipIfNoFBGEMM
          def test_input_weight_equalization_convert(self):
              """ Tests that the modified model for equalization (before quantization)
              returns the same output as the original model
              """
          
              tests = [(SingleLayerLinearModel, 2), (LinearAddModel, 2), (TwoLayerLinearModel, 2),
                       (SingleLayerFunctionalLinearModel, 2), (FunctionalLinearAddModel, 2),
                       (TwoLayerFunctionalLinearModel, 2),
                       (LinearReluModel, 2), (LinearReluLinearModel, 2), (LinearReluAddModel, 2),
                       (FunctionalLinearReluModel, 2), (FunctionalLinearReluLinearModel, 2),
                       (ConvModel, 4), (TwoLayerConvModel, 4), (SingleLayerFunctionalConvModel, 4),

and we have dynamic quant with fx
Quantization — PyTorch 2.1 documentation which looks like it works for qnnpack. I think you’re stuck with per-tensor quantizaiton on the activation though.

those two are intended to compose so let us know if that doesn’t work.

also i think you’re looking at the wrong quantized linear op, what you’d want is https://github.com/pytorch/pytorch/blob/main/torch/ao/nn/quantized/dynamic/modules/linear.py

oleotiger · November 6, 2023, 1:32am

Thank you for your responses to both of my questions.

Actually, what I’m interested in is solely the inference of LLM with PyTorch. There are many quantized LLM models with SmoothQuant on Hugging Face (e.g., opt-125m-smoothquant). I want to run a test on one or a few of them on an ARM CPU-only platform. But, I believe I haven’t found the correct method yet.

Here’s what I’ve tried: I wanted to run the model with SmoothQuant, but it calls Torch-int, which is built on Cutlass. I wanted to replace all quantization interfaces on Torch-int or SmoothQuant, but found that quantized linear in Torch-int supports qint8 for activation. I only found quint8 for activation in the PyTorch backend.

I haven’t found the correct location to eliminate Cutlass while also supporting the correct interface in PyTorch.

HDCharles · November 6, 2023, 6:22pm

When you say you want to do smoothquant do you mean something you can define mathematically or a particular repo?

Smoothquant is a technique from https://arxiv.org/pdf/2211.10438.pdf that can be reproduced as outlined above on cpu or from a variety of repos on cuda.

the Repo GitHub - mit-han-lab/smoothquant: [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models is geared towards cuda as far as i can tell, if you want to use that on cpu you’re kind of fitting a round peg in a square hole and that’s reflected in the interface difference you’re seeing. I also think the function you’re looking at isn’t a drop in replacement. Its doing static quantization rather than dynamic quantization that smoothquant needs (mathematically)

oleotiger · November 7, 2023, 1:57am

What I’m interested in is a particular repository, as mentioned above - the SmoothQuant repository based on CUDA. However, instead of focusing on the quantization of a model, what I’m really concerned with is how to run inference with an already SmoothQuant-quantized model on an Arm CPU server. I discovered that I can only replace the Cutlass interface that SmoothQuant calls through torch-int, such as linear_a8_w8_bfp32_ofp32. Therefore, my course of action should be to either find a replacement for the interface in QNNPACK or implement them by myself. Is this correct?

HDCharles · December 5, 2023, 2:33am

it sounds like your question is less about quantization and more about running a model from an external repo on cpu. I can speculate but you’re probably better off asking them directly.

as mentioned above, qnnpack does have a dynamically quantized linear op, but it uses affine per tensor activation quantization and per-channel weight quantization.

chenster_liu · March 20, 2024, 9:04pm

I think this is what you are looking for. Intel’s neural compressor integrates smoothquant and I’ve tested in my CPU-only machine.