1# ExecuTorch Runtime Overview 2 3This document discusses the design of the ExecuTorch runtime, which executes 4ExecuTorch program files on edge devices like smartphones, wearables, and 5embedded devices. The code for the main execution API is under 6[`executorch/runtime/executor/`](https://github.com/pytorch/executorch/tree/main/runtime/executor). 7 8Before reading this document we recommend that you read [How ExecuTorch 9Works](intro-how-it-works.md). 10 11At the highest level, the ExecuTorch runtime is responsible for: 12 13* Loading binary `.pte` program files that were generated by the 14 [`to_executorch()`](./tutorials/export-to-executorch-tutorial) step of the 15 model-lowering process. 16* Executing the series of instructions that implement a lowered model. 17 18Note that as of late 2023, the ExecuTorch runtime only supports model inference, 19and does not yet support training. 20 21This diagram shows the high-level flow of, and components involved with, 22exporting and executing an ExecuTorch program: 23 24 26 27The runtime is also responsible for: 28 29* Managing the memory used during load and execution, potentially across 30 multiple memory banks like SRAM and DRAM. 31* Mapping symbolic operator names like `"aten::add.out"` to concrete C++ 32 functions or [_kernels_](kernel-library-overview.md) that implement the 33 semantics of those operators. 34* Dispatching predetermined sections of the model to [backend 35 delegates](compiler-delegate-and-partitioner.md) for acceleration. 36* Optionally gathering [profiling data](runtime-profiling.md) during load and 37 execution. 38 39## Design Goals 40 41The ExecuTorch runtime was designed to run on a wide variety of edge devices, 42from modern smartphone CPUs to resource-constrained microcontrollers and DSPs. 43It has first-class support for 44[delegating](compiler-delegate-and-partitioner.md) execution to one or more 45backends to take advantage of architecture-specific optimizations and modern 46heterogeneous architectures. It is small and portable enough to run directly in 47bare-metal embedded environments with no operating systems, dynamic memory, or 48threads. 49 50### Low Execution Overhead 51 52#### Memory 53 54* The core runtime library is less than 50kB when built without kernels or 55 backends. 56* Constant tensors point directly into the `.pte` file data, avoiding copies of 57 that data. The alignment of these data chunks can be adjusted at `.pte` 58 creation time. 59* Backend delegates can choose to unload their precompiled data after model 60 initialization, reducing peak memory usage. 61* Mutable tensor memory layout is planned ahead of time and packed into a small 62 set of user-allocated buffers, providing fine-grained control over memory 63 location. This is especially useful on systems with heterogeneous memory 64 hierarchies, allowing placement onto (e.g.) SRAM or DRAM close to the core 65 that will operate on the data. 66 67#### CPU 68 69* Model execution is a simple loop over an array of instructions, most of which 70 are function pointers to kernels and backend delegates. This keeps the 71 execution overhead small, on the order of microseconds to nanoseconds per 72 operation. 73* The implementation of an operation (like "add" or "conv3d") can be fully 74 customized for a particular target system without needing to modify the 75 original model or generated `.pte` file. 76 77### Familiar PyTorch Semantics 78 79ExecuTorch is a first-class component of the PyTorch stack, and reuses APIs and 80semantics whenever possible. 81 82* The C++ types used by ExecuTorch are source-compatible with the corresponding 83 types from core PyTorch's `c10::` and `at::` libraries, and ExecuTorch 84 provides 85 [`aten_bridge`](https://github.com/pytorch/executorch/blob/main/extension/aten_util/aten_bridge.h) 86 to convert between the two. This can be helpful for projects that already use 87 PyTorch C++ types. 88* The semantics of operators like `aten::add` and `aten::sigmoid` are identical 89 between ExecuTorch and core PyTorch. ExecuTorch provides a testing framework 90 to ensure this, and to help test future implementations of these operators. 91 92### Portable Code and Architecture 93 94The ExecuTorch runtime is implemented with portability in mind, so that users 95can build it for a wide variety of target systems. 96 97#### C++ Language Considerations 98 99* The code is C++17-compatible to work with older toolchains. 100* The runtime does not use exceptions or RTTI, although it is not antagonistic 101 to them. 102* The code is compatible with GCC and Clang, and has also been built with 103 several proprietary embedded toolchains. 104* The repo provides CMake build system to make integration easier. 105 106#### Operating System Considerations 107 108The runtime makes no direct system calls. All access to memory, files, logging, 109and clocks are abstracted through the [_Runtime Platform Abstraction Layer 110(PAL)_](runtime-platform-abstraction-layer.md) and injected interfaces like 111`DataLoader` and `MemoryAllocator`. See the [runtime api reference](executorch-runtime-api-reference.rst) to learn more. 112 113Applications can control all memory allocation through the `MemoryManager`, 114`MemoryAllocator`, `HierarchicalAllocator`, and `DataLoader` classes. The core 115runtime makes no direct calls to `malloc()` or `new`, or to types like 116`std::vector` that allocate under the hood. This makes it possible to: 117 118* Run in environments without a heap, but still use the heap if desired. 119* Avoid synchronization on the heap during model load and execution. 120* Control which memory region to use for different types of data. For example, 121 one set of mutable tensors could live in SRAM while another set lived in DRAM. 122* Easily monitor how much memory the runtime uses. 123 124However, please note that specific kernel or backend implementations may use 125arbitrary runtime or operating system features. Users should double-check the 126docs for the kernel and backend libraries that they use. 127 128#### Threading Considerations 129 130The core runtime does no threading or locking, and does not use thread local 131variables. But, it plays well with higher-level synchronization. 132 133* Each `Program` instance is immutable and therefore _[fully 134 thread-safe](https://faithlife.codes/blog/2008/03/degrees_of_thread_safety/#thread-safe)_. 135 Multiple threads may concurrently access a single `Program` instance. 136* Each `Method` instance is mutable but self-contained, and therefore 137 _[conditionally 138 thread-safe](https://faithlife.codes/blog/2008/03/degrees_of_thread_safety/#conditionally-thread-safe)_. 139 Multiple threads can concurrently access and execute independent `Method` 140 instances, but access and execution of a single instance must be serialized. 141 142However, please note: 143 144* There are two global tables that may be read during `Program::load_method()`: 145 the kernel registration table and the backend registration table. 146 * In practice, these tables are only modified at process/system load time, 147 and are effectively frozen before the first `Program` is loaded. But some 148 applications may need to be aware of these tables, especially if they 149 manually mutate them after process/system load time. 150* Specific kernel or backend implementations may have their own threading 151 restrictions. Users should double-check the docs for the kernel and backend 152 libraries that they use. 153 154## Further Reading 155 156For more details about the ExecuTorch runtime, please see: 157 158* [Detailed Runtime APIs Tutorial](running-a-model-cpp-tutorial.md) 159* [Simplified Runtime APIs Tutorial](extension-module.md) 160* [Runtime Build and Cross Compilation](runtime-build-and-cross-compilation.md) 161* [Runtime Platform Abstraction Layer](runtime-platform-abstraction-layer.md) 162* [Runtime Profiling](runtime-profiling.md) 163* [Backends and Delegates](compiler-delegate-and-partitioner.md) 164* [Backend Delegate Implementation](runtime-backend-delegate-implementation-and-linking.md) 165* [Kernel Library Overview](kernel-library-overview.md) 166