Name | Date | Size | #Lines | LOC | ||
---|---|---|---|---|---|---|
.. | - | - | ||||
README.md | H A D | 25-Apr-2025 | 1.2 KiB | 37 | 24 | |
TARGETS | H A D | 25-Apr-2025 | 232 | 9 | 5 | |
generate.py | H A D | 25-Apr-2025 | 5.6 KiB | 191 | 131 | |
load_gguf_q4_0.py | H A D | 25-Apr-2025 | 6.5 KiB | 184 | 139 | |
subclass.py | H A D | 25-Apr-2025 | 8.2 KiB | 243 | 179 | |
targets.bzl | H A D | 25-Apr-2025 | 490 | 24 | 21 | |
test_subclass.py | H A D | 25-Apr-2025 | 868 | 29 | 12 |
README.md
1# Experimental Features 2 3This subdirectory is under heavy development so proceed with caution. 4 5We are demonstrating how we can import a llama model in gguf format back into PyTorch/ExecuTorch world, run it and perform different optimizations using PyTorch/ExecuTorch APIs. 6 7The first and most important step would be loading a gguf model into PyTorch. 8 9## Load GGUF Q4_0 Quantized Model 10 11Let's say we've went through the process of building and running llama.cpp and was able to quantize a Llama model using the following command: 12 13```bash 14# checkpoint download to models/llama7b 15<omitted> 16# build 17mkdir build 18cd build 19cmake .. 20cmake --build . --config Release 21 22# prepare model 23python3 convert.py models/llama7b 24 25# quantize. Notice we use --pure to avoid Q6_K from showing up. 26build/bin/quantize --pure models/llama7b/ggml-model-f16.gguf models/llama7b/ggml-model-Q4_0.gguf Q4_0 27 28``` 29 30We want to load it back into a `torch.nn.Module` and run in eager mode. The way it works is through a Tensor subclass. 31 32 33## Generate Tokens in PyTorch Eager 34```bash 35python3 generate.py --prompt "Once upon a time" --gguf_file models/llama7b/ggml-model-Q4_0.gguf --tokenizer_path models/llama7b/tokenizer.model 36``` 37