experimental - OpenGrok cross reference for /aosp_15_r20/external/executorch/examples/models/llama/experimental/

# Experimental Features

This subdirectory is under heavy development so proceed with caution.

We are demonstrating how we can import a llama model in gguf format back into PyTorch/ExecuTorch world, run it and perform different optimizations using PyTorch/ExecuTorch APIs.

The first and most important step would be loading a gguf model into PyTorch.

## Load GGUF Q4_0 Quantized Model

Let's say we've went through the process of building and running llama.cpp and was able to quantize a Llama model using the following command:

```bash
# checkpoint download to models/llama7b
<omitted>
# build
mkdir build
cd build
cmake ..
cmake --build . --config Release

# prepare model
python3 convert.py models/llama7b

# quantize. Notice we use --pure to avoid Q6_K from showing up.
build/bin/quantize --pure models/llama7b/ggml-model-f16.gguf models/llama7b/ggml-model-Q4_0.gguf Q4_0

```

We want to load it back into a `torch.nn.Module` and run in eager mode. The way it works is through a Tensor subclass.


## Generate Tokens in PyTorch Eager
```bash
python3 generate.py --prompt "Once upon a time" --gguf_file models/llama7b/ggml-model-Q4_0.gguf --tokenizer_path models/llama7b/tokenizer.model
```
Name		Date	Size	#Lines	LOC
..		-	-
README.md	H A D	25-Apr-2025	1.2 KiB	37	24
TARGETS	H A D	25-Apr-2025	232	9	5
generate.py	H A D	25-Apr-2025	5.6 KiB	191	131
load_gguf_q4_0.py	H A D	25-Apr-2025	6.5 KiB	184	139
subclass.py	H A D	25-Apr-2025	8.2 KiB	243	179
targets.bzl	H A D	25-Apr-2025	490	24	21
test_subclass.py	H A D	25-Apr-2025	868	29	12