xref: /aosp_15_r20/external/executorch/docs/source/kernel-library-custom-aten-kernel.md (revision 523fa7a60841cd1ecfb9cc4201f1ca8b03ed023a)
1# Kernel Registration
2## Overview
3
4At the last stage of [ExecuTorch model exporting](./export-overview.md), we lower the operators in the dialect to the _out variants_ of the [core ATen operators](./ir-ops-set-definition.md). Then we serialize these operator names into the model artifact. During runtime execution, for each operator name we will need to find the actual _kernels_, i.e., the C++ functions that do the heavy-lifting calculations and return results.
5
6## Kernel Libraries
7### First-party kernel libraries:
8
9**[Portable kernel library](https://github.com/pytorch/executorch/tree/main/kernels/portable)** is the in-house default kernel library that covers most of the core ATen operators. It’s easy to use/read and is written in portable C++17. However it’s not optimized for performance, because it’s not specialized for any certain target. Therefore we provide kernel registration APIs for ExecuTorch users to easily register their own optimized kernels.
10
11**[Optimized kernel library](https://github.com/pytorch/executorch/tree/main/kernels/optimized)** specializes on performance for some of the operators, leveraging existing third party libraries such as [EigenBLAS](https://gitlab.com/libeigen/eigen). This works best along with the portable kernel library, with a good balance on portability and performance. One example of combining these two libraries can be found [here](https://github.com/pytorch/executorch/blob/main/configurations/CMakeLists.txt).
12
13**[Quantized kernel library](https://github.com/pytorch/executorch/tree/main/kernels/quantized)** implements operators for quantization and dequantization. These are out of core ATen operators but are vital to most of the production use cases.
14
15### Custom kernel libraries:
16
17**Custom kernels implementing core ATen ops**. Even though we don't have an internal example for custom kernels for core ATen ops, the optimized kernel library can be viewed as a good example. We have optimized [`add.out`](https://github.com/pytorch/executorch/blob/main/kernels/optimized/cpu/op_add.cpp) and a portable [`add.out`](https://github.com/pytorch/executorch/blob/main/kernels/portable/cpu/op_add.cpp). When user is combining these two libraries, we provide APIs to choose which kernel to use for `add.out`. In order to author and use custom kernels implementing core ATen ops, using the [YAML based approach](#yaml-entry-for-core-aten-op-out-variant) is recommended, because it provides full fledged support on
18  1. combining kernel libraries and define fallback kernels;
19  2. using selective build to minimize the kernel size.
20
21A **[Custom operator](https://github.com/pytorch/executorch/tree/main/extension/llm/custom_ops)** is any operator that an ExecuTorch user defines outside of PyTorch's [`native_functions.yaml`](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/native_functions.yaml).
22
23## Operator & Kernel Contract
24
25All the kernels mentioned above, whether they are in-house or customized, should comply with the following requirements:
26
27* Match the calling convention derived from operator schema. The kernel registration API will generate headers for the custom kernels as references.
28* Satisfy the dtype constraints defined in edge dialect. For tensors with certain dtypes as arguments, the result of a custom kernel needs to match the expected dtypes. The constraints are available in edge dialect ops.
29* Give correct result. We will provide a testing framework to automatically test the custom kernels.
30
31
32## APIs
33
34These are the APIs available to register kernels/custom kernels/custom ops into ExecuTorch:
35
36* [YAML Entry API](#yaml-entry-api-high-level-architecture)
37  - [for core ATen op with custom kernels](#yaml-entry-api-for-core-aten-op-out-variant)
38  - [for custom ops](#yaml-entry-api-for-custom-ops)
39  - [CMake Macros](#cmake-macros)
40* C++ API
41  - [for custom ops](#c-api-for-custom-ops)
42  - [CMake Example](#compile-and-link-the-custom-kernel)
43
44If it's not clear which API to use, please see [Best Practices](#custom-ops-api-best-practices).
45
46
47
48### YAML Entry API High Level Architecture
49
50![](./_static/img/kernel-library-custom-aten-kernel.png)
51
52ExecuTorch users are asked to provide:
53
541. the custom kernel library with C++ implementations
55
562. a YAML file associated with the library that describes what operators are being implemented by this library. For partial kernels, the yaml file also contains information on the dtypes and dim orders supported by the  kernel. More details in the API section.
57
58
59### YAML Entry API Workflow
60
61At build time, the yaml files associated with kernel libraries will be passed to the _kernel resolver_ along with the model op info (see selective build doc) and the outcome is a mapping between a combination of operator names and tensor metadata, to kernel symbols. Then codegen tools will use this mapping to generate C++ bindings that connect the kernels to ExecuTorch runtime. ExecuTorch users need to link this generated library into their application to use these kernels.
62
63At static object initialization time, kernels will be registered into the ExecuTorch kernel registry.
64
65At runtime initialization stage, ExecuTorch will use the operator name and argument metadata as a key to lookup for the kernels. For example, with “aten::add.out” and inputs being float tensors with dim order (0, 1, 2, 3), ExecuTorch will go into the kernel registry and lookup for a kernel that matches the name and the input metadata.
66
67### YAML Entry API for Core ATen Op Out Variant
68
69Top level attributes:
70
71* `op` (if the operator appears in `native_functions.yaml`) or `func` for custom operator. The value for this key needs to be the full operator name (including overload name) for `op` key, or a full operator schema (namespace, operator name, operator overload name and schema string), if we are describing a custom operator. For schema syntax please refer to this [instruction](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/README.md).
72* `kernels`: defines kernel information. It consists of `arg_meta` and `kernel_name`, which are bound together to describe "for input tensors with these metadata, use this kernel".
73* `type_alias`(optional): we are giving aliases to possible dtype options. `T0: [Double, Float]` means `T0` can be one of `Double` or `Float`.
74* `dim_order_alias`(optional): similar to `type_alias`, we are giving names to possible dim order options.
75
76Attributes under `kernels`:
77
78
79
80* `arg_meta`: a list of "tensor arg name" entries. The values for these keys are dtypes and dim orders aliases, that are implemented by the corresponding `kernel_name`. This being `null` means the kernel will be used for all types of input.
81* `kernel_name`: the expected name of the C++ function that will implement this operator. You can put whatever you want to here, but you should follow the convention of replacing the `.` in the overload name with an underscore, and lowercasing all characters. In this example, `add.out` uses the C++ function named `add_out`. `add.Scalar_out` would become `add_scalar_out`, with a lowercase `S`. We support namespace for kernels, but note that we will be inserting a `native::` to the last level of namespace. So `custom::add_out` in the `kernel_name` will point to `custom::native::add_out`.
82
83Some examples of operator entry:
84```yaml
85- op: add.out
86  kernels:
87    - arg_meta: null
88      kernel_name: torch::executor::add_out
89```
90An out variant of a core ATen operator with a default kernel
91
92ATen operator with a dtype/dim order specialized kernel (works for `Double` dtype and dim order needs to be (0, 1, 2, 3))
93```yaml
94- op: add.out
95  type_alias:
96    T0: [Double]
97  dim_order_alias:
98    D0: [[0, 1, 2, 3]]
99  kernels:
100    - arg_meta:
101        self: [T0, D0]
102        other: [T0 , D0]
103        out: [T0, D0]
104      kernel_name: torch::executor::add_out
105
106```
107
108
109### YAML Entry API for Custom Ops
110
111As mentioned above, this option provides more support in terms of selective build and features such as merging operator libraries.
112
113First we need to specify the operator schema as well as a `kernel` section. So instead of `op` we use `func` with the operator schema. As an example, here’s a yaml entry for a custom op:
114```yaml
115- func: allclose.out(Tensor self, Tensor other, float rtol=1e-05, float atol=1e-08, bool equal_nan=False, bool dummy_param=False, *, Tensor(a!) out) -> Tensor(a!)
116  kernels:
117    - arg_meta: null
118      kernel_name: torch::executor::allclose_out
119```
120The `kernel` section is the same as the one defined in core ATen ops. For operator schema, we are reusing the DSL defined in this [README.md](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/README.md), with a few differences:
121
122
123#### Out variants only
124
125ExecuTorch only supports out-style operators, where:
126
127
128* The caller provides the output Tensor or Tensor list in the final position with the name `out`.
129* The C++ function modifies and returns the same `out` argument.
130    * If the return type in the YAML file is `()` (which maps to void), the C++ function should still modify `out` but does not need to return anything.
131* The `out` argument must be keyword-only, which means it needs to follow an argument named `*` like in the `add.out` example below.
132* Conventionally, these out operators are named using the pattern `<name>.out` or `<name>.<overload>_out`.
133
134Since all output values are returned via an `out` parameter, ExecuTorch ignores the actual C++ function return value. But, to be consistent, functions should always return `out` when the return type is non-`void`.
135
136
137#### Can only return `Tensor` or `()`
138
139ExecuTorch only supports operators that return a single `Tensor`, or the unit type `()` (which maps to `void`). It does not support returning any other types, including lists, optionals, tuples, or scalars like `bool`.
140
141
142#### Supported argument types
143
144ExecuTorch does not support all of the argument types that core PyTorch supports. Here's a list of the argument types we currently support:
145* Tensor
146* int
147* bool
148* float
149* str
150* Scalar
151* ScalarType
152* MemoryFormat
153* Device
154* Optional<Type>
155* List<Type>
156* List<Optional<Type>>
157* Optional<List<Type>>
158
159#### CMake Macros
160
161We provide build time macros to help users to build their kernel registration library. The macro takes the yaml file describing the kernel library as well as model operator metadata, and packages the generated C++ bindings into a C++ library. The macro is available on CMake.
162
163
164`generate_bindings_for_kernels(FUNCTIONS_YAML functions_yaml CUSTOM_OPS_YAML custom_ops_yaml)` takes a yaml file for core ATen op out variants and also a yaml file for custom ops, generate C++ bindings for kernel registration. It also depends on the selective build artifact generated by `gen_selected_ops()`, see selective build doc for more information. Then `gen_operators_lib` will package those bindings to be a C++ library. As an example:
165```cmake
166# SELECT_OPS_LIST: aten::add.out,aten::mm.out
167gen_selected_ops("" "${SELECT_OPS_LIST}" "")
168
169# Look for functions.yaml associated with portable libs and generate C++ bindings
170generate_bindings_for_kernels(FUNCTIONS_YAML ${EXECUTORCH_ROOT}/kernels/portable/functions.yaml)
171
172# Prepare a C++ library called "generated_lib" with _kernel_lib being the portable library, executorch is a dependency of it.
173gen_operators_lib("generated_lib" KERNEL_LIBS ${_kernel_lib} DEPS executorch)
174
175# Link "generated_lib" into the application:
176target_link_libraries(executorch_binary generated_lib)
177
178```
179
180We also provide the ability to merge two yaml files, given a precedence. `merge_yaml(FUNCTIONS_YAML functions_yaml FALLBACK_YAML fallback_yaml OUTPUT_DIR out_dir)` merges functions_yaml and fallback_yaml into a single yaml, if there's duplicate entries in functions_yaml and fallback_yaml, this macro will always take the one in functions_yaml.
181
182Example:
183
184```yaml
185# functions.yaml
186- op: add.out
187  kernels:
188    - arg_meta: null
189      kernel_name: torch::executor::opt_add_out
190```
191
192And out fallback:
193
194```yaml
195# fallback.yaml
196- op: add.out
197  kernels:
198    - arg_meta: null
199      kernel_name: torch::executor::add_out
200```
201
202The merged yaml will have the entry in functions.yaml.
203
204### C++ API for Custom Ops
205
206Unlike the YAML entry API, the C++ API only uses C++ macros `EXECUTORCH_LIBRARY` and `WRAP_TO_ATEN` for kernel registration, also without selective build support. It makes this API faster in terms of development speed, since users don't have to do YAML authoring and build system tweaking.
207
208Please refer to [Custom Ops Best Practices](#custom-ops-api-best-practices) on which API to use.
209
210Similar to [`TORCH_LIBRARY`](https://pytorch.org/cppdocs/library.html#library_8h_1a0bd5fb09d25dfb58e750d712fc5afb84) in PyTorch, `EXECUTORCH_LIBRARY` takes the operator name and the C++ function name and register them into ExecuTorch runtime.
211
212#### Prepare custom kernel implementation
213
214Define your custom operator schema for both functional variant (used in AOT compilation) and out variant (used in ExecuTorch runtime). The schema needs to follow PyTorch ATen convention (see `native_functions.yaml`). For example:
215
216```yaml
217custom_linear(Tensor weight, Tensor input, Tensor(?) bias) -> Tensor
218custom_linear.out(Tensor weight, Tensor input, Tensor(?) bias, *, Tensor(a!) out) -> Tensor(a!)
219```
220
221Then write your custom kernel according to the schema using ExecuTorch types, along with APIs to register to ExecuTorch runtime:
222
223
224```c++
225// custom_linear.h/custom_linear.cpp
226#include <executorch/runtime/kernel/kernel_includes.h>
227Tensor& custom_linear_out(const Tensor& weight, const Tensor& input, optional<Tensor> bias, Tensor& out) {
228   // calculation
229   return out;
230}
231```
232#### Use a C++ macro to register it into ExecuTorch
233
234Append the following line in the example above:
235```c++
236// custom_linear.h/custom_linear.cpp
237// opset namespace myop
238EXECUTORCH_LIBRARY(myop, "custom_linear.out", custom_linear_out);
239```
240
241Now we need to write some wrapper for this op to show up in PyTorch, but don’t worry we don’t need to rewrite the kernel. Create a separate .cpp for this purpose:
242
243```c++
244// custom_linear_pytorch.cpp
245#include "custom_linear.h"
246#include <torch/library.h>
247
248at::Tensor custom_linear(const at::Tensor& weight, const at::Tensor& input, std::optional<at::Tensor> bias) {
249    // initialize out
250    at::Tensor out = at::empty({weight.size(1), input.size(1)});
251    // wrap kernel in custom_linear.cpp into ATen kernel
252    WRAP_TO_ATEN(custom_linear_out, 3)(weight, input, bias, out);
253    return out;
254}
255// standard API to register ops into PyTorch
256TORCH_LIBRARY(myop, m) {
257    m.def("custom_linear(Tensor weight, Tensor input, Tensor(?) bias) -> Tensor", custom_linear);
258    m.def("custom_linear.out(Tensor weight, Tensor input, Tensor(?) bias, *, Tensor(a!) out) -> Tensor(a!)", WRAP_TO_ATEN(custom_linear_out, 3));
259}
260```
261
262#### Compile and link the custom kernel
263
264Link it into ExecuTorch runtime: In our `CMakeLists.txt` that builds the binary/application, we need to add custom_linear.h/cpp into the binary target. We can build a dynamically loaded library (.so or .dylib) and link it as well.
265
266Here's an example to do it:
267
268```cmake
269# For target_link_options_shared_lib
270include(${EXECUTORCH_ROOT}/build/Utils.cmake)
271
272# Add a custom op library
273add_library(custom_op_lib SHARED ${CMAKE_CURRENT_SOURCE_DIR}/custom_op.cpp)
274
275# Include the header
276target_include_directory(custom_op_lib PUBLIC ${CMAKE_CURRENT_SOURCE_DIR}/include)
277
278# Link ExecuTorch library
279target_link_libraries(custom_op_lib PUBLIC executorch)
280
281# Define a binary target
282add_executable(custom_op_runner PUBLIC main.cpp)
283
284# Link this library with --whole-archive !! IMPORTANT !! this is to avoid the operators being stripped by linker
285target_link_options_shared_lib(custom_op_lib)
286
287# Link custom op lib
288target_link_libraries(custom_op_runner PUBLIC custom_op_lib)
289
290```
291
292Link it into the PyTorch runtime: We need to package custom_linear.h, custom_linear.cpp and custom_linear_pytorch.cpp into a dynamically loaded library (.so or .dylib) and load it into our python environment. One way of doing this is:
293
294```python
295import torch
296torch.ops.load_library("libcustom_linear.so/dylib")
297
298# Now we have access to the custom op, backed by kernel implemented in custom_linear.cpp.
299op = torch.ops.myop.custom_linear.default
300```
301
302### Custom Ops API Best Practices
303
304Given that we have 2 kernel registration APIs for custom ops, which API should we use? Here are some pros and cons for each API:
305
306* C++ API:
307  - Pros:
308    * Only C++ code changes are needed
309    * Resembles PyTorch custom ops C++ API
310    * Low maintenance cost
311  - Cons:
312    * No selective build support
313    * No centralized bookkeepping
314
315* Yaml entry API:
316  - Pros:
317    * Has selective build support
318    * Provides a centralized place for custom ops
319      - It shows what ops are being registered and what kernels are bound to these ops, for an application
320  - Cons:
321    * User needs to create and maintain yaml files
322    * Relatively inflexible to change the op definition
323
324Overall if we are building an application and it uses custom ops, during the development phase it's recommended to use the C++ API since it's low-cost to use and flexible to change. Once the application moves to production phase where the custom ops definitions and the build systems are quite stable and binary size is to be considered, it is recommended to use the Yaml entry API.
325