Can anyone describe their workflow to bring quantized models onto devices (mobile, edge devices, etc.)?
AFAIU the quantization API does not support exporting onto ONNX.
2. What do people often do then?
Currently, the road feels incomplete. no?
3a. Where can we read the roadmap?
3b. What’s missing to achieve something similar to (e.g., TFLite) in Pytorch?
3c. Can you please mention the existing players (raising starts , and partners) such that people can better decide how to organize their projects?
Thanks in advance & keep it up!
-Victor
P.D. Take my words with a dose of honey, and fast-fwd criticism. I believe AI will be ubiquitous as long as it’s accessible in users’ pockets (e.g., mobile devices & owned hardware), not just done in big cluster farms (e.g., via APIs). Thus, I truly appreciate that Pytorch opened up space for quantization.
I read somewhere on the web (Forums, Github, etc.), your suggested path. Feel free to provide more details about your experience (e.g.,. linking a Github project or mentioning the models) walking along that trail
For the current model (YOLOv5-x), I’m working on people suggested the following pipeline:
pytorch → onnx → SNPE/DLC (Qualcomm specific SDK & hardware)
Thus, I’m doing most of the development in Pytorch (ML pipeline & Quantization biz).
BTW, a colleague also suggested cutting any middleman, i.e., jumping directly to hardware SDK (DLC in my case). I didn’t as it’s my first venturing in that frontier.
We are working on building an executor stack for exported PyTorch programs to run on device. The release plan to OSS is still in progress, but expect to hear something later this year. cc @Martin_Yuan, @raziel
We’re building a new on-device stack within PyTorch, and quantization is a core concern just as is building the ecosystem to target the heterogeneous hardware landscape.
How did your pipeline with SNPE (Snapdragon SDK) look-alike with the quantized onnx file?
Apparently, SNPE only generates DLC files out of float32 models. I won’t be surprised if it’s just a matter of casting weights to float32.