xref: /aosp_15_r20/external/executorch/examples/models/llama2/README.md (revision 523fa7a60841cd1ecfb9cc4201f1ca8b03ed023a)
1*523fa7a6SAndroid Build Coastguard Worker# Summary
2*523fa7a6SAndroid Build Coastguard WorkerFor Llama enablement, please see the [Llama README page](../llama/README.md) for complete details.
3*523fa7a6SAndroid Build Coastguard Worker
4*523fa7a6SAndroid Build Coastguard WorkerThis page contains Llama2 specific instructions and information.
5*523fa7a6SAndroid Build Coastguard Worker
6*523fa7a6SAndroid Build Coastguard Worker
7*523fa7a6SAndroid Build Coastguard Worker## Enablement
8*523fa7a6SAndroid Build Coastguard Worker
9*523fa7a6SAndroid Build Coastguard WorkerWe have verified running Llama 2 7B [mobile applications](#step-6-build-mobile-apps) efficiently on select devices including the iPhone 15 Pro, iPhone 15 Pro Max, Samsung Galaxy S22 and S24, and OnePlus 12.
10*523fa7a6SAndroid Build Coastguard Worker
11*523fa7a6SAndroid Build Coastguard WorkerSince Llama 2 7B needs at least 4-bit quantization to fit even within some of the highend phones, results presented here correspond to 4-bit groupwise post-training quantized model.
12*523fa7a6SAndroid Build Coastguard Worker
13*523fa7a6SAndroid Build Coastguard Worker## Results
14*523fa7a6SAndroid Build Coastguard Worker
15*523fa7a6SAndroid Build Coastguard Worker### Llama2 7B
16*523fa7a6SAndroid Build Coastguard WorkerLlama 2 7B performance was measured on the Samsung Galaxy S22, S24, and OnePlus 12 devices. The performance measurement is expressed in terms of tokens per second using an [adb binary-based approach](#step-5-run-benchmark-on).
17*523fa7a6SAndroid Build Coastguard Worker
18*523fa7a6SAndroid Build Coastguard Worker|Device  | Groupwise 4-bit (128) | Groupwise 4-bit (256)
19*523fa7a6SAndroid Build Coastguard Worker|--------| ---------------------- | ---------------
20*523fa7a6SAndroid Build Coastguard Worker|Galaxy S22  | 8.15 tokens/second | 8.3 tokens/second |
21*523fa7a6SAndroid Build Coastguard Worker|Galaxy S24 | 10.66 tokens/second | 11.26 tokens/second |
22*523fa7a6SAndroid Build Coastguard Worker|OnePlus 12 | 11.55 tokens/second | 11.6 tokens/second |
23*523fa7a6SAndroid Build Coastguard Worker
24*523fa7a6SAndroid Build Coastguard WorkerBelow are the results for two different groupsizes, with max_seq_length 2048, and limit 1000, based on WikiText perplexity using [LM Eval](https://github.com/EleutherAI/lm-evaluation-harness).
25*523fa7a6SAndroid Build Coastguard Worker
26*523fa7a6SAndroid Build Coastguard Worker|Model | Baseline (FP32) | Groupwise 4-bit (128) | Groupwise 4-bit (256)
27*523fa7a6SAndroid Build Coastguard Worker|--------|-----------------| ---------------------- | ---------------
28*523fa7a6SAndroid Build Coastguard Worker|Llama 2 7B | 9.2 | 10.2 | 10.7
29*523fa7a6SAndroid Build Coastguard Worker
30*523fa7a6SAndroid Build Coastguard Worker## Prepare model
31*523fa7a6SAndroid Build Coastguard Worker
32*523fa7a6SAndroid Build Coastguard WorkerYou can export and run the original Llama 2 7B model.
33*523fa7a6SAndroid Build Coastguard Worker
34*523fa7a6SAndroid Build Coastguard Worker1. Llama 2 pretrained parameters can be downloaded from [Meta's official website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) or from [Hugging Face](https://huggingface.co/meta-llama/Llama-2-7b).
35*523fa7a6SAndroid Build Coastguard Worker
36*523fa7a6SAndroid Build Coastguard Worker2. Edit `params.json` file. Replace `"vocab_size": -1` with `"vocab_size": 32000`. This is a short-term workaround.
37*523fa7a6SAndroid Build Coastguard Worker
38*523fa7a6SAndroid Build Coastguard Worker3. Export model and generate `.pte` file:
39*523fa7a6SAndroid Build Coastguard Worker    ```
40*523fa7a6SAndroid Build Coastguard Worker    python -m examples.models.llama.export_llama --checkpoint <checkpoint.pth> --params <params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32
41*523fa7a6SAndroid Build Coastguard Worker    ```
42*523fa7a6SAndroid Build Coastguard Worker4. Create tokenizer.bin.
43*523fa7a6SAndroid Build Coastguard Worker    ```
44*523fa7a6SAndroid Build Coastguard Worker    python -m extension.llm.tokenizer.tokenizer -t <tokenizer.model> -o tokenizer.bin
45*523fa7a6SAndroid Build Coastguard Worker    ```
46*523fa7a6SAndroid Build Coastguard Worker
47*523fa7a6SAndroid Build Coastguard Worker    Pass the converted `tokenizer.bin` file instead of `tokenizer.model` for subsequent steps.
48*523fa7a6SAndroid Build Coastguard Worker
49*523fa7a6SAndroid Build Coastguard Worker
50*523fa7a6SAndroid Build Coastguard Worker# Run
51*523fa7a6SAndroid Build Coastguard Worker
52*523fa7a6SAndroid Build Coastguard WorkerRunning will be the same [by following this step](../llama/README.md#step-4-run-on-your-computer-to-validate).
53