Name Date Size #Lines LOC

..--

README.mdH A D25-Apr-20251.2 KiB3724

TARGETSH A D25-Apr-2025232 95

generate.pyH A D25-Apr-20255.6 KiB191131

load_gguf_q4_0.pyH A D25-Apr-20256.5 KiB184139

subclass.pyH A D25-Apr-20258.2 KiB243179

targets.bzlH A D25-Apr-2025490 2421

test_subclass.pyH A D25-Apr-2025868 2912

README.md

1# Experimental Features
2
3This subdirectory is under heavy development so proceed with caution.
4
5We are demonstrating how we can import a llama model in gguf format back into PyTorch/ExecuTorch world, run it and perform different optimizations using PyTorch/ExecuTorch APIs.
6
7The first and most important step would be loading a gguf model into PyTorch.
8
9## Load GGUF Q4_0 Quantized Model
10
11Let's say we've went through the process of building and running llama.cpp and was able to quantize a Llama model using the following command:
12
13```bash
14# checkpoint download to models/llama7b
15<omitted>
16# build
17mkdir build
18cd build
19cmake ..
20cmake --build . --config Release
21
22# prepare model
23python3 convert.py models/llama7b
24
25# quantize. Notice we use --pure to avoid Q6_K from showing up.
26build/bin/quantize --pure models/llama7b/ggml-model-f16.gguf models/llama7b/ggml-model-Q4_0.gguf Q4_0
27
28```
29
30We want to load it back into a `torch.nn.Module` and run in eager mode. The way it works is through a Tensor subclass.
31
32
33## Generate Tokens in PyTorch Eager
34```bash
35python3 generate.py --prompt "Once upon a time" --gguf_file models/llama7b/ggml-model-Q4_0.gguf --tokenizer_path models/llama7b/tokenizer.model
36```
37