Name Date Size #Lines LOC

..--

model/H25-Apr-2025-332279

runner/H25-Apr-2025-953750

CMakeLists.txtH A D25-Apr-20251.1 KiB3935

README.mdH A D25-Apr-20251.9 KiB3929

llama.pyH A D25-Apr-202519.2 KiB574479

qnn_llama_runner.cppH A D25-Apr-20252.8 KiB9557

README.md

1# Summary
2
3## Overview
4This file provides you the instructions to run LLAMA2 with different parameters via Qualcomm HTP backend. Following settings support for Stories 110M
5
6Please check corresponding section for more information.
7
8## Stories 110M
9This example demonstrates how to run a smaller LLAMA2, stories110M on mobile via Qualcomm HTP backend. Model architecture is fine-tuned specifically for HTP to accelerate the performance. Weight is quantized via PTQ quantization to fit the model on a phone.
10
11### Instructions
12#### Step 1: Setup
131. Follow the [tutorial](https://pytorch.org/executorch/main/getting-started-setup) to set up ExecuTorch.
142. Follow the [tutorial](https://pytorch.org/executorch/stable/build-run-qualcomm-ai-engine-direct-backend.html) to build Qualcomm AI Engine Direct Backend.
15
16#### Step2: Prepare Model
17Download and preapre stories110M model
18
19```bash
20# tokenizer.model & stories110M.pt:
21wget "https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.pt"
22wget "https://raw.githubusercontent.com/karpathy/llama2.c/master/tokenizer.model"
23
24# tokenizer.bin:
25python -m extension.llm.tokenizer.tokenizer -t tokenizer.model -o tokenizer.bin
26
27# params.json:
28echo '{"dim": 768, "multiple_of": 32, "n_heads": 12, "n_layers": 12, "norm_eps": 1e-05, "vocab_size": 32000}' > params.json
29```
30
31#### Step3: Run default examples
32Default example generates the story based on the given prompt, "Once".
33```bash
34# 16a4w quant:
35python examples/qualcomm/oss_scripts/llama2/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --ptq 16a4w --checkpoint stories110M --params params.json --tokenizer_model tokenizer.model --tokenizer_bin tokenizer.bin --prompt "Once"
36```
37
38#### (Note) Customized PTQ data set
39User prompts are used for PTQ calibration data. Take the examples above, the word "Once" is the only word for PTQ. If you want to observe more data during the calibration time. Please add more prompts to the args `--prompt`.