Pointers to bring quantized models to device

Dear team & community,

Can anyone describe their workflow to bring quantized models onto devices (mobile, edge devices, etc.)?

  1. AFAIU the quantization API does not support exporting onto ONNX.

2. What do people often do then?

  1. Currently, the road feels incomplete. no?
  • 3a. Where can we read the roadmap?
  • 3b. What’s missing to achieve something similar to (e.g., TFLite) in Pytorch?
  • 3c. Can you please mention the existing players (raising starts :star2: , and partners) such that people can better decide how to organize their projects?

Thanks in advance & keep it up!

P.D. Take my words with a dose of honey, and fast-fwd criticism. I believe AI will be ubiquitous as long as it’s accessible in users’ pockets (e.g., mobile devices & owned hardware), not just done in big cluster farms (e.g., via APIs). Thus, I truly appreciate that Pytorch opened up space for quantization.

1 Like

How about exporting the model into ONNX first and then quantizing the ONNX model by Onnxruntime.

1 Like

Thanks for chiming in @111357

I read somewhere on the web (Forums, Github, etc.), your suggested path. Feel free to provide more details about your experience (e.g.,. linking a Github project or mentioning the models) walking along that trail :blush:

For the current model (YOLOv5-x), I’m working on people suggested the following pipeline:

pytorch → onnx → SNPE/DLC (Qualcomm specific SDK & hardware)

Thus, I’m doing most of the development in Pytorch (ML pipeline & Quantization biz).

BTW, a colleague also suggested cutting any middleman, i.e., jumping directly to hardware SDK (DLC in my case). I didn’t as it’s my first venturing in that frontier.

Hi, I recommend the AIMET which is a tool supported by Qualcomm.
Here is an example that quantized a torch model

1 Like

hi there, we have a recipe for how to run quantized models on device https://pytorch.org/tutorials/recipes/ptmobile_recipes_summary.html

We are working on building an executor stack for exported PyTorch programs to run on device. The release plan to OSS is still in progress, but expect to hear something later this year. cc @Martin_Yuan, @raziel

That’s right.

Moving forward…

We’re building a new on-device stack within PyTorch, and quantization is a core concern just as is building the ecosystem to target the heterogeneous hardware landscape.

Right now I cannot share a lot more than what we presented last December in the PyTorch Conference: https://youtu.be/XJJzJbDEAic

1 Like

Thanks! I’ll definitely take a look at AIMET in the future.

Have you used the QuantizationSimModel.export, mentioned at the end of QAT tutorial?


  • Did you manage to export a useful ONNX file?
  • How did your pipeline with SNPE (Snapdragon SDK) look-alike with the quantized onnx file?
    Apparently, SNPE only generates DLC files out of float32 models. I won’t be surprised if it’s just a matter of casting weights to float32.


  • A few colleagues of mine have used AIMET. It might be a good fit for research (assuming you’re doing something similar to Qualcomm AI papers).
  • Should you go for it then? We’ve found sharp edges to bring stuff onto mobile or on-device :slightly_frowning_face: . Refer to comments & feedback in their Github.
  • Does anyone know the roadmap of PyTorch quantization API?

Here are some tutorials for the new quantization API:

and PyTorch mobile will also release executorch (native pytorch platform/runtime for edge models) in PyTorch conference: Schedule | Linux Foundation Events, it’s 3 weeks away.

1 Like