1# Utility tools for Llama enablement 2 3## Stories110M model 4 5If you want to deploy and run a smaller model for educational purposes, you can try stories110M model. It has the same architecture as Llama, but just smaller. It can be also used for fast iteration and verification during development. 6 7### Export: 8 9From `executorch` root: 10 111. Download `stories110M.pt` and `tokenizer.model` from Github. 12 ``` 13 wget "https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.pt" 14 wget "https://raw.githubusercontent.com/karpathy/llama2.c/master/tokenizer.model" 15 ``` 162. Create params file. 17 ``` 18 echo '{"dim": 768, "multiple_of": 32, "n_heads": 12, "n_layers": 12, "norm_eps": 1e-05, "vocab_size": 32000}' > params.json 19 ``` 203. Export model and generate `.pte` file. 21 ``` 22 python -m examples.models.llama.export_llama -c stories110M.pt -p params.json -X -kv 23 ``` 24 25## Smaller model delegated to other backends 26 27Currently we supported lowering the stories model to other backends, including, CoreML, MPS and QNN. Please refer to the instruction 28for each backend ([CoreML](https://pytorch.org/executorch/main/build-run-coreml.html), [MPS](https://pytorch.org/executorch/main/build-run-mps.html), [QNN](https://pytorch.org/executorch/main/build-run-qualcomm-ai-engine-direct-backend.html)) before trying to lower them. After the backend library is installed, the script to export a lowered model is 29 30- Lower to CoreML: `python -m examples.models.llama.export_llama -kv --disable_dynamic_shape --coreml -c stories110M.pt -p params.json ` 31- MPS: `python -m examples.models.llama.export_llama -kv --disable_dynamic_shape --mps -c stories110M.pt -p params.json ` 32- QNN: `python -m examples.models.llama.export_llama -kv --disable_dynamic_shape --qnn -c stories110M.pt -p params.json ` 33 34The iOS LLAMA app supports the CoreML and MPS model and the Android LLAMA app supports the QNN model. On Android, it also allow to cross compiler the llama runner binary, push to the device and run. 35 36For CoreML, there are 2 additional optional arguments: 37* `--coreml-ios`: Specify the minimum iOS version to deploy (and turn on available optimizations). E.g. `--coreml-ios 18` will turn on [in-place KV cache](https://developer.apple.com/documentation/coreml/mlstate?language=objc) and [fused scaled dot product attention kernel](https://apple.github.io/coremltools/source/coremltools.converters.mil.mil.ops.defs.html#coremltools.converters.mil.mil.ops.defs.iOS18.transformers.scaled_dot_product_attention) (the resulting model will then need at least iOS 18 to run, though) 38* `--coreml-quantize`: Use [quantization tailored for CoreML](https://apple.github.io/coremltools/docs-guides/source/opt-quantization-overview.html). E.g. `--coreml-quantize b4w` will perform per-block 4-bit weight-only quantization in a way tailored for CoreML 39 40To deploy the large 8B model on the above backends, [please visit this section](non_cpu_backends.md). 41 42## Download models from Hugging Face and convert from safetensor format to state dict 43 44You can also download above models from [Hugging Face](https://huggingface.co/). Since ExecuTorch starts from a PyTorch model, a script like below can be used to convert the Hugging Face safetensors format to PyTorch's state dict. It leverages the utils provided by [TorchTune](https://github.com/pytorch/torchtune). 45 46 47```Python 48from torchtune.utils import FullModelHFCheckpointer 49from torchtune.models import convert_weights 50import torch 51 52# Convert from safetensors to TorchTune. Suppose the model has been downloaded from Hugging Face 53checkpointer = FullModelHFCheckpointer( 54 checkpoint_dir='/home/.cache/huggingface/hub/models/snapshots/hash-number', 55 checkpoint_files=['model-00001-of-00002.safetensors', 'model-00002-of-00002.safetensors'], 56 output_dir='/the/destination/dir' , 57 model_type='LLAMA3' # or other types that TorchTune supports 58) 59 60print("loading checkpoint") 61sd = checkpointer.load_checkpoint() 62 63# Convert from TorchTune to Meta (PyTorch native) 64sd = convert_weights.tune_to_meta(sd['model']) 65 66print("saving checkpoint") 67torch.save(sd, "/the/destination/dir/checkpoint.pth") 68``` 69 70## Finetuning 71 72If you want to finetune your model based on a specific dataset, PyTorch provides [TorchTune](https://github.com/pytorch/torchtune) - a native-Pytorch library for easily authoring, fine-tuning and experimenting with LLMs. 73 74Once you have [TorchTune installed](https://github.com/pytorch/torchtune?tab=readme-ov-file#get-started) you can finetune Llama2 7B model using LoRA on a single GPU, using the following command. This will produce a checkpoint where the LoRA weights are merged with the base model and so the output checkpoint will be in the same format as the original Llama2 model. 75 76``` 77tune run lora_finetune_single_device \ 78--config llama2/7B_lora_single_device \ 79checkpointer.checkpoint_dir=<path_to_checkpoint_folder> \ 80tokenizer.path=<path_to_checkpoint_folder>/tokenizer.model 81``` 82 83To run full finetuning with Llama2 7B on a single device, you can use the following command. 84 85``` 86tune run full_finetune_single_device \ 87--config llama2/7B_full_single_device \ 88checkpointer.checkpoint_dir=<path_to_checkpoint_folder> \ 89tokenizer.path=<path_to_checkpoint_folder>/tokenizer.model 90``` 91