Name | Date | Size | #Lines | LOC | ||
---|---|---|---|---|---|---|
.. | - | - | ||||
model/ | H | 25-Apr-2025 | - | 332 | 279 | |
runner/ | H | 25-Apr-2025 | - | 953 | 750 | |
CMakeLists.txt | H A D | 25-Apr-2025 | 1.1 KiB | 39 | 35 | |
README.md | H A D | 25-Apr-2025 | 1.9 KiB | 39 | 29 | |
llama.py | H A D | 25-Apr-2025 | 19.2 KiB | 574 | 479 | |
qnn_llama_runner.cpp | H A D | 25-Apr-2025 | 2.8 KiB | 95 | 57 |
README.md
1# Summary 2 3## Overview 4This file provides you the instructions to run LLAMA2 with different parameters via Qualcomm HTP backend. Following settings support for Stories 110M 5 6Please check corresponding section for more information. 7 8## Stories 110M 9This example demonstrates how to run a smaller LLAMA2, stories110M on mobile via Qualcomm HTP backend. Model architecture is fine-tuned specifically for HTP to accelerate the performance. Weight is quantized via PTQ quantization to fit the model on a phone. 10 11### Instructions 12#### Step 1: Setup 131. Follow the [tutorial](https://pytorch.org/executorch/main/getting-started-setup) to set up ExecuTorch. 142. Follow the [tutorial](https://pytorch.org/executorch/stable/build-run-qualcomm-ai-engine-direct-backend.html) to build Qualcomm AI Engine Direct Backend. 15 16#### Step2: Prepare Model 17Download and preapre stories110M model 18 19```bash 20# tokenizer.model & stories110M.pt: 21wget "https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.pt" 22wget "https://raw.githubusercontent.com/karpathy/llama2.c/master/tokenizer.model" 23 24# tokenizer.bin: 25python -m extension.llm.tokenizer.tokenizer -t tokenizer.model -o tokenizer.bin 26 27# params.json: 28echo '{"dim": 768, "multiple_of": 32, "n_heads": 12, "n_layers": 12, "norm_eps": 1e-05, "vocab_size": 32000}' > params.json 29``` 30 31#### Step3: Run default examples 32Default example generates the story based on the given prompt, "Once". 33```bash 34# 16a4w quant: 35python examples/qualcomm/oss_scripts/llama2/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --ptq 16a4w --checkpoint stories110M --params params.json --tokenizer_model tokenizer.model --tokenizer_bin tokenizer.bin --prompt "Once" 36``` 37 38#### (Note) Customized PTQ data set 39User prompts are used for PTQ calibration data. Take the examples above, the word "Once" is the only word for PTQ. If you want to observe more data during the calibration time. Please add more prompts to the args `--prompt`.