I have locally compiled llama 3.2-3b model followed steps in this link executorch/examples/demo-apps/android/LlamaDemo/docs/delegates/qualcomm_README.md at main · pytorch/executorch · GitHub
Converted model to .pte with below command
4 bits weight only quantize
python -m examples.models.llama.export_llama --checkpoint “${MODEL_DIR}/consolidated.00.pth” -p “${MODEL_DIR}/params.json” -kv --disable_dynamic_shape --qnn --pt2e_quantize qnn_16a4w -d fp32 --metadata ‘{“get_bos_id”:128000, “get_eos_ids”:[128009, 128001]}’ --output_name="test.pte”
The model give stutter response
Prompt:HI
Response:”,!!!
Which not expected please help me how can i get the correct response
cccclai
(Chen Lai)
February 12, 2025, 7:52pm
3
Yeah it needs to be calibrated properly to have a reasonable result. SpinQuant is an option, if you follow Building and Running Llama 3 8B Instruct with Qualcomm AI Engine Direct Backend — ExecuTorch 0.5 documentation
however the rotation matrics isn’t available out of box either and you’d need to run GitHub - facebookresearch/SpinQuant: Code repo for the paper "SpinQuant LLM quantization with learned rotations" to get them
We understand it’s quite a pain now and will try to make it easier
Is there away we can enable 3.2-3b , and you mention about calibration of model in above comment could you share the exact steps to export it in to .pte