Executorch -QNN backend

I have locally compiled llama 3.2-3b model followed steps in this link executorch/examples/demo-apps/android/LlamaDemo/docs/delegates/qualcomm_README.md at main · pytorch/executorch · GitHub

Converted model to .pte with below command

4 bits weight only quantize

python -m examples.models.llama.export_llama --checkpoint “${MODEL_DIR}/consolidated.00.pth” -p “${MODEL_DIR}/params.json” -kv --disable_dynamic_shape --qnn --pt2e_quantize qnn_16a4w -d fp32 --metadata ‘{“get_bos_id”:128000, “get_eos_ids”:[128009, 128001]}’ --output_name="test.pte”

The model give stutter response
Prompt:HI
Response:”,!!!

Which not expected please help me how can i get the correct response

@shoumikhin @cccclai

Yeah it needs to be calibrated properly to have a reasonable result. SpinQuant is an option, if you follow Building and Running Llama 3 8B Instruct with Qualcomm AI Engine Direct Backend — ExecuTorch 0.5 documentation

however the rotation matrics isn’t available out of box either and you’d need to run GitHub - facebookresearch/SpinQuant: Code repo for the paper "SpinQuant LLM quantization with learned rotations" to get them

We understand it’s quite a pain now and will try to make it easier

Is there away we can enable 3.2-3b , and you mention about calibration of model in above comment could you share the exact steps to export it in to .pte