1# Building and Running ExecuTorch with the Vulkan Backend 2 3The [ExecuTorch Vulkan Delegate](./native-delegates-executorch-vulkan-delegate.md) 4is a native GPU delegate for ExecuTorch. 5 6<!----This will show a grid card on the page-----> 7::::{grid} 2 8:::{grid-item-card} What you will learn in this tutorial: 9:class-card: card-content 10* How to export the Llama3.2-1B parameter model with partial GPU delegation 11* How to execute the partially delegated model on Android 12::: 13:::{grid-item-card} Prerequisites: 14:class-card: card-prerequisites 15* Follow [**Setting up ExecuTorch**](./getting-started-setup.md) 16* It is also recommended that you read through [**ExecuTorch Vulkan Delegate**](./native-delegates-executorch-vulkan-delegate.md) and follow the example in that page 17::: 18:::: 19 20## Prerequisites 21 22Note that all the steps below should be performed from the ExecuTorch repository 23root directory, and assumes that you have gone through the steps of setting up 24ExecuTorch. 25 26It is also assumed that the Android NDK and Android SDK is installed, and the 27following environment examples are set. 28 29```shell 30export ANDROID_NDK=<path_to_ndk> 31# Select an appropriate Android ABI for your device 32export ANDROID_ABI=arm64-v8a 33# All subsequent commands should be performed from ExecuTorch repo root 34cd <path_to_executorch_root> 35# Make sure adb works 36adb --version 37``` 38 39## Lowering the Llama3.2-1B model to Vulkan 40 41::::{note} 42The resultant model will only be partially delegated to the Vulkan backend. In 43particular, only binary arithmetic operators (`aten.add`, `aten.sub`, 44`aten.mul`, `aten.div`), matrix multiplication operators (`aten.mm`, `aten.bmm`), 45and linear layers (`aten.linear`) will be executed on the GPU via the Vulkan 46delegate. The rest of the model will be executed using Portable operators. 47 48Operator support for LLaMA models is currently in active development; please 49check out the `main` branch of the ExecuTorch repo for the latest capabilities. 50:::: 51 52First, obtain the `consolidated.00.pth`, `params.json` and `tokenizer.model` 53files for the `Llama3.2-1B` model from the [Llama website](https://www.llama.com/llama-downloads/). 54 55Once the files have been downloaded, the `export_llama` script can be used to 56partially lower the Llama model to Vulkan. 57 58```shell 59# The files will usually be downloaded to ~/.llama 60python -m examples.models.llama.export_llama \ 61 --disable_dynamic_shape --vulkan -kv --use_sdpa_with_kv_cache -d fp32 \ 62 -c ~/.llama/checkpoints/Llama3.2-1B/consolidated.00.pth \ 63 -p ~/.llama/checkpoints/Llama3.2-1B/params.json \ 64 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' 65``` 66 67A `vulkan_llama2.pte` file should have been created as a result of running the 68script. 69 70Push the tokenizer binary and `vulkan_llama2.pte` onto your Android device: 71 72```shell 73adb push ~/.llama/tokenizer.model /data/local/tmp/ 74adb push vulkan_llama2.pte /data/local/tmp/ 75``` 76 77## Build and Run the LLaMA runner binary on Android 78 79First, build and install ExecuTorch libraries, then build the LLaMA runner 80binary using the Android NDK toolchain. 81 82```shell 83(rm -rf cmake-android-out && \ 84 cmake . -DCMAKE_INSTALL_PREFIX=cmake-android-out \ 85 -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \ 86 -DANDROID_ABI=$ANDROID_ABI \ 87 -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \ 88 -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \ 89 -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \ 90 -DEXECUTORCH_BUILD_VULKAN=ON \ 91 -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \ 92 -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \ 93 -DPYTHON_EXECUTABLE=python \ 94 -Bcmake-android-out && \ 95 cmake --build cmake-android-out -j16 --target install) 96 97# Build LLaMA Runner library 98(rm -rf cmake-android-out/examples/models/llama && \ 99 cmake examples/models/llama \ 100 -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \ 101 -DANDROID_ABI=$ANDROID_ABI \ 102 -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \ 103 -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \ 104 -DCMAKE_INSTALL_PREFIX=cmake-android-out \ 105 -DPYTHON_EXECUTABLE=python \ 106 -Bcmake-android-out/examples/models/llama && \ 107 cmake --build cmake-android-out/examples/models/llama -j16) 108``` 109 110Finally, push and run the llama runner binary on your Android device. Note that 111your device must have sufficient GPU memory to execute the model. 112 113```shell 114adb push cmake-android-out/examples/models/llama/llama_main /data/local/tmp/llama_main 115 116adb shell /data/local/tmp/llama_main \ 117 --model_path=/data/local/tmp/vulkan_llama2.pte \ 118 --tokenizer_path=/data/local/tmp/tokenizer.model \ 119 --prompt "Hello" 120``` 121 122Note that currently model inference will be very slow due to the high amount of 123delegate blobs in the lowered graph, which requires a transfer to and from the 124GPU for each sub graph. Performance is expected to improve drastically as more 125of the model can be lowered to the Vulkan delegate, and techniques such as 126quantization are supported. 127