runtime-overview.md (revision 523fa7a60841cd1ecfb9cc4201f1ca8b03ed023a) - OpenGrok cross reference for /aosp_15_r20/external/executorch/docs/source/runtime-overview.md

# ExecuTorch Runtime Overview

This document discusses the design of the ExecuTorch runtime, which executes
ExecuTorch program files on edge devices like smartphones, wearables, and
embedded devices. The code for the main execution API is under
[`executorch/runtime/executor/`](https://github.com/pytorch/executorch/tree/main/runtime/executor).

Before reading this document we recommend that you read [How ExecuTorch
Works](intro-how-it-works.md).

At the highest level, the ExecuTorch runtime is responsible for:

* Loading binary `.pte` program files that were generated by the
  [`to_executorch()`](./tutorials/export-to-executorch-tutorial) step of the
  model-lowering process.
* Executing the series of instructions that implement a lowered model.

Note that as of late 2023, the ExecuTorch runtime only supports model inference,
and does not yet support training.

This diagram shows the high-level flow of, and components involved with,
exporting and executing an ExecuTorch program:

![High-level diagram of the ExecuTorch
Runtime](/_static/img/runtime-overview-high-level.png)

The runtime is also responsible for:

* Managing the memory used during load and execution, potentially across
  multiple memory banks like SRAM and DRAM.
* Mapping symbolic operator names like `"aten::add.out"` to concrete C++
  functions or [_kernels_](kernel-library-overview.md) that implement the
  semantics of those operators.
* Dispatching predetermined sections of the model to [backend
  delegates](compiler-delegate-and-partitioner.md) for acceleration.
* Optionally gathering [profiling data](runtime-profiling.md) during load and
  execution.

## Design Goals

The ExecuTorch runtime was designed to run on a wide variety of edge devices,
from modern smartphone CPUs to resource-constrained microcontrollers and DSPs.
It has first-class support for
[delegating](compiler-delegate-and-partitioner.md) execution to one or more
backends to take advantage of architecture-specific optimizations and modern
heterogeneous architectures. It is small and portable enough to run directly in
bare-metal embedded environments with no operating systems, dynamic memory, or
threads.

### Low Execution Overhead

#### Memory

* The core runtime library is less than 50kB when built without kernels or
  backends.
* Constant tensors point directly into the `.pte` file data, avoiding copies of
  that data. The alignment of these data chunks can be adjusted at `.pte`
  creation time.
* Backend delegates can choose to unload their precompiled data after model
  initialization, reducing peak memory usage.
* Mutable tensor memory layout is planned ahead of time and packed into a small
  set of user-allocated buffers, providing fine-grained control over memory
  location. This is especially useful on systems with heterogeneous memory
  hierarchies, allowing placement onto (e.g.) SRAM or DRAM close to the core
  that will operate on the data.

#### CPU

* Model execution is a simple loop over an array of instructions, most of which
  are function pointers to kernels and backend delegates. This keeps the
  execution overhead small, on the order of microseconds to nanoseconds per
  operation.
* The implementation of an operation (like "add" or "conv3d") can be fully
  customized for a particular target system without needing to modify the
  original model or generated `.pte` file.

### Familiar PyTorch Semantics

ExecuTorch is a first-class component of the PyTorch stack, and reuses APIs and
semantics whenever possible.

* The C++ types used by ExecuTorch are source-compatible with the corresponding
  types from core PyTorch's `c10::` and `at::` libraries, and ExecuTorch
  provides
  [`aten_bridge`](https://github.com/pytorch/executorch/blob/main/extension/aten_util/aten_bridge.h)
  to convert between the two. This can be helpful for projects that already use
  PyTorch C++ types.
* The semantics of operators like `aten::add` and `aten::sigmoid` are identical
  between ExecuTorch and core PyTorch. ExecuTorch provides a testing framework
  to ensure this, and to help test future implementations of these operators.

### Portable Code and Architecture

The ExecuTorch runtime is implemented with portability in mind, so that users
can build it for a wide variety of target systems.

#### C++ Language Considerations

* The code is C++17-compatible to work with older toolchains.
* The runtime does not use exceptions or RTTI, although it is not antagonistic
  to them.
* The code is compatible with GCC and Clang, and has also been built with
  several proprietary embedded toolchains.
* The repo provides CMake build system to make integration easier.

#### Operating System Considerations

The runtime makes no direct system calls. All access to memory, files, logging,
and clocks are abstracted through the [_Runtime Platform Abstraction Layer
(PAL)_](runtime-platform-abstraction-layer.md) and injected interfaces like
`DataLoader` and `MemoryAllocator`. See the [runtime api reference](executorch-runtime-api-reference.rst) to learn more.

Applications can control all memory allocation through the `MemoryManager`,
`MemoryAllocator`, `HierarchicalAllocator`, and `DataLoader` classes. The core
runtime makes no direct calls to `malloc()` or `new`, or to types like
`std::vector` that allocate under the hood. This makes it possible to:

* Run in environments without a heap, but still use the heap if desired.
* Avoid synchronization on the heap during model load and execution.
* Control which memory region to use for different types of data. For example,
  one set of mutable tensors could live in SRAM while another set lived in DRAM.
* Easily monitor how much memory the runtime uses.

However, please note that specific kernel or backend implementations may use
arbitrary runtime or operating system features. Users should double-check the
docs for the kernel and backend libraries that they use.

#### Threading Considerations

The core runtime does no threading or locking, and does not use thread local
variables. But, it plays well with higher-level synchronization.

* Each `Program` instance is immutable and therefore _[fully
  thread-safe](https://faithlife.codes/blog/2008/03/degrees_of_thread_safety/#thread-safe)_.
  Multiple threads may concurrently access a single `Program` instance.
* Each `Method` instance is mutable but self-contained, and therefore
  _[conditionally
  thread-safe](https://faithlife.codes/blog/2008/03/degrees_of_thread_safety/#conditionally-thread-safe)_.
  Multiple threads can concurrently access and execute independent `Method`
  instances, but access and execution of a single instance must be serialized.

However, please note:

* There are two global tables that may be read during `Program::load_method()`:
  the kernel registration table and the backend registration table.
    * In practice, these tables are only modified at process/system load time,
      and are effectively frozen before the first `Program` is loaded. But some
      applications may need to be aware of these tables, especially if they
      manually mutate them after process/system load time.
* Specific kernel or backend implementations may have their own threading
  restrictions. Users should double-check the docs for the kernel and backend
  libraries that they use.

## Further Reading

For more details about the ExecuTorch runtime, please see:

* [Detailed Runtime APIs Tutorial](running-a-model-cpp-tutorial.md)
* [Simplified Runtime APIs Tutorial](extension-module.md)
* [Runtime Build and Cross Compilation](runtime-build-and-cross-compilation.md)
* [Runtime Platform Abstraction Layer](runtime-platform-abstraction-layer.md)
* [Runtime Profiling](runtime-profiling.md)
* [Backends and Delegates](compiler-delegate-and-partitioner.md)
* [Backend Delegate Implementation](runtime-backend-delegate-implementation-and-linking.md)
* [Kernel Library Overview](kernel-library-overview.md)