xref: /aosp_15_r20/external/pigweed/docs/blog/05-coroutines.rst (revision 61c4878ac05f98d0ceed94b57d316916de578985)
1.. _docs-blog-05-coroutines:
2
3=============================================================
4Pigweed Eng Blog #5: C++20 coroutines without heap allocation
5=============================================================
6*Published on 2024-10-07 by Taylor Cramer*
7
8Pigweed now provides support for coroutines on embedded
9targets!
10
11.. literalinclude:: ../../pw_async2/examples/coro_blinky_loop.cc
12   :language: cpp
13   :linenos:
14   :start-after: [pw_async2-examples-coro-blinky-loop]
15   :end-before: [pw_async2-examples-coro-blinky-loop]
16
17`Click here for more examples. <https://pigweed.googlesource.com/pigweed/pigweed/+/refs/heads/main/pw_async2/examples>`_
18
19Notably, Pigweed's :cpp:class:`pw::async2::Coro` API does not require heap
20allocation. This post details how we accompilished this, what challenges
21we encountered, and how you can get started using coroutines on embedded.
22
23If you're:
24
25* a C++ embedded developer looking for an ergonomic way to write async code
26* a Rust ``async`` / ``await`` user curious about C++
27* a C++ committee member who wants to make embedded developers happy
28
29read on!
30
31--------------------------------
32Why use coroutines for embedded?
33--------------------------------
34Coroutines allow for multiple functions to run concurrently.
35Compared to other concurrency options, coroutines have:
36
37* No need for thread stacks
38* No OS/RTOS-dependent context switching
39* No complex, manually written state machines
40
41The `coroutines API introduced in C++20
42<https://en.cppreference.com/w/cpp/language/coroutines>`_
43allows you to write
44`straight-line code <https://en.wikipedia.org/wiki/Straight-line_program>`_
45that can yield to the caller and resume at some later point. This can be used
46similarly to ``async`` / ``await`` from languages like Rust, C#, JavaScript,
47Python, and Swift.
48
49When a typical blocking function waits for a long-running operation to
50complete, the whole thread (and any resources associated with it) become
51blocked until the operation is done. By contrast, coroutine functions can
52yield to their caller and only resume once an asynchronous operation
53completes. The caller, often some kind of scheduler, is then free to run
54other code while the coroutine function waits for its result.
55
56This allows for executing many operations concurrently without paying the
57performance cost of threading or the cognitive cost of callbacks or manual
58state machines.
59
60---------------------------------------------
61Why C++ coroutines require dynamic allocation
62---------------------------------------------
63Unfortunately for embedded developers, the C++ coroutine API uses heap
64allocation by default. *Heap allocation* (e.g. ``new`` and ``delete`` or
65``malloc`` and ``free``) isn't available on many embedded platforms.
66Fortunately, heap allocation is not necessary to use C++ coroutines.
67
68However, C++20 coroutines do require *dynamic allocation*
69(memory that is allocated at runtime). Pigweed's
70:cpp:class:`pw::async2::Coro` API allocates memory using an
71:cpp:class:`pw::Allocator`. This avoids the need for a platform-provided
72heap allocator, but users may still encounter some of the costs associated
73with dynamic allocation, such as:
74
75* Heap fragmentation
76* Runtime memory management size and performance overhead
77* Runtime out-of-memory conditions
78* The need to "right-size" a heap
79
80This dynamic allocation is done in order to store the "coroutine frame", an
81object which holds the coroutine's state.
82
83The coroutine frame
84===================
85Despite not using a native thread stack, C++20 coroutines do need somewhere to
86store the "stack" of paused functions. For example, in our ``Blinky`` function
87from above:
88
89.. literalinclude:: ../../pw_async2/examples/coro_blinky_loop.cc
90   :language: cpp
91   :linenos:
92   :start-after: [pw_async2-examples-coro-blinky-loop]
93   :end-before: [pw_async2-examples-coro-blinky-loop]
94
95When we reach a ``co_await`` statement, the function is paused and control
96is returned to the scheduler until ``500ms`` passes. When paused, we still
97need to store the function's state, including the loop variable
98``i``, the function arguments, and a pointer to where we were in the function
99when it paused.
100
101This coroutine state (hereafter referred to as a "coroutine frame") typically
102requires much less memory than a full thread stack, as it can allocate exactly
103the amount of memory the function needs for variables that are "live" across
104``co_await`` points.
105
106.. admonition:: Layout of coroutine frames
107
108   The actual structure of this object is a compiler implementation
109   detail, but you can read more about how Clang structures it in
110   `The structure of coroutine frames <https://clang.llvm.org/docs/DebuggingCoroutines.html#coroutine-frame>`_.
111
112Static vs. dynamic allocation of the coroutine frame
113====================================================
114When a coroutine is first created, the C++ coroutine API dynamically allocates
115space to store the coroutine frame and returns a pointer to it called a
116`coroutine handle <https://en.cppreference.com/w/cpp/coroutine/coroutine_handle>`_.
117By contrast, `Rust's async functions
118<https://rust-lang.github.io/async-book/01_getting_started/04_async_await_primer.html>`_
119directly return coroutine frames inline, allowing for static allocation.
120Why did C++ make the choice to dynamically allocate? Some key reasons are that
121returning a pointer to dynamically allocated memory keeps the coroutine
122function's return value small, movable, and most importantly ABI-stable.
123
124ABI stability
125=============
126ABI stability is a particular challenge when designing coroutines. The coroutine
127frame contains the stack of the coroutine function, so the coroutine frame's
128size is dependent upon the number and size of the variables used in the function
129implementation.
130
131For example, two functions with this declaration can have differently sized
132coroutine frames:
133
134.. code-block:: c++
135
136   Coro<int> GimmeANumber();
137
138An implementation like this would have a very small coroutine frame:
139
140.. code-block:: c++
141
142   Coro<int> GimmeANumber() {
143     co_return 5;
144   }
145
146While this function would need a very large coroutine frame in order
147to store its temporary state:
148
149.. code-block:: c++
150
151   Coro<int> GimmeANumber() {
152     ABigValue a = co_await DownloadAllOfReddit();
153     AnotherOne b = co_await DownloadAllOfHackerNews();
154     co_return a.size() + b.size();
155   }
156
157If we were to inline the coroutine frame in the resulting ``Coro<int>`` object,
158that would mean that ``Coro<int>`` would have to be a different size depending
159on the function implementation. The two functions would therefore have different
160ABIs, and we would no longer be able to refer to ``Coro<int>`` as a single
161statically sized type.
162
163This would also make it impossible to call a coroutine function from a different
164translation unit without knowing the details of its implementation. Without
165additional compiler features, inlined coroutine frames would require that public
166coroutine functions be implemented in headers (similar to templated or
167``constexpr`` functions).
168
169Similarly, inlined coroutine frames would prevent coroutine functions from
170being ``virtual``, as different overrides would have different ABIs.
171
172Rust bites the bullet
173=====================
174Rust inlines the coroutine frame and runs headfirst into these challenges.
175The coroutine-style object returned by a Rust ``async fn`` is:
176
177* Immovable once the coroutine has started. This is necessary in order to
178  ensure that pointers to variables in the coroutine stack are not invalidated.
179* Dependent upon the implementation of the ``async`` function. Adding a new
180  variable in the body of the ``async`` function will require more "stack"
181  space, so the size and structure of the returned state machine will change
182  as the implementation of the ``async`` function changes.
183* Not usable with dynamic dispatch (Rust's ``virtual``) without an
184  intermediate layer which dynamically allocates the coroutine frame
185  (see the `async_trait <https://docs.rs/async-trait/latest/async_trait/>`_
186  library for one implementation of this pattern).
187* Challenging to compile. Since the size of the coroutine object is
188  dependent on the function implementation, code which uses the coroutine
189  function can't be compiled until the compiler has first determined the
190  size of the coroutine object.
191
192However, this tradeoff also means that Rust can statically allocate space
193for coroutines, avoiding all the pitfalls of dynamic allocation. The current
194C++ coroutine API doesn't allow for this.
195
196---------------------
197What Pigweed provides
198---------------------
199Despite the need for dynamic allocation, it is still possible to use coroutines
200on embedded devices. Today, Pigweed is using coroutines in the :ref:`showcase-sense`
201showcase for the Raspberry Pi RP2040 and (new!) RP2350 microcontrollers.
202
203In order to accomplish this, we needed the Pigweed :cpp:class:`pw::async2::Coro`
204API to:
205
206* Avoid heap allocation.
207* Allow recovering from allocation failure without using exceptions.
208* Allow using custom, context-specific
209  :cpp:class:`pw::Allocator` implementations on a per-coroutine basis.
210
211If you want to skip ahead, you can `see the Coro implementation here.
212<https://cs.opensource.google/pigweed/pigweed/+/main:pw_async2/public/pw_async2/coro.h>`_
213
214--------------------
215Eliminating the heap
216--------------------
217As I said before, the C++ coroutine API defaults to heap allocation.
218Fortunately for us embedded devs, the coroutine ``promise_type`` can customize
219the behavior of this allocation by implementing the following functions:
220
221* ``static void* operator new(std::size_t, Args...) noexcept``
222* ``static MyReturnObjectType get_return_object_on_allocation_failure()``
223* ``static void operator delete(void*)``
224
225Allocating memory
226=================
227Implementing ``operator new`` lets us control how coroutine memory is allocated.
228Our custom ``operator new`` takes two parameters: a ``std::size_t`` and an
229``Args...`` parameter pack. Let's explore how these are handled.
230
231The ``size`` parameter
232----------------------
233Unlike regular ``operator new`` or `placement new
234<https://en.cppreference.com/w/cpp/language/new#Placement_new>`_,
235the ``promise_type`` ``operator new`` function allocates space not just for
236the ``promise_type`` itself, but also for the compiler-generated coroutine
237stack. The size of this coroutine stack is determined by the compiler and is
238optimization-dependent, so it is not statically observable and is only available
239at runtime.
240
241The ``Args...`` parameter pack
242------------------------------
243The arguments to coroutine functions are also passed into the ``operator new``
244of the corresponding ``promise_type``. This allows ``operator new`` to "intercept"
245arguments passed to the coroutine function.
246
247This means that if we have some
248coroutine function:
249
250.. code-block:: c++
251
252   MyCoroutineType some_func(int a)
253
254and ``MyCoroutineType::promise_type`` has an ``operator new`` like:
255
256.. code-block:: c++
257
258   static void* operator new(std::size_t n, int a) noexcept
259
260then the argument ``a`` will be copied into the invocation of ``operator new``.
261
262:cpp:class:`pw::async2::Coro` uses this tool to allow the caller of a coroutine
263function to specify which memory allocator to use for the coroutine frame.
264
265Different coroutines may want to use memory provided by different allocators
266with various memory locations or allocation strategies. When allocating a
267regular type ``T``, we'd typically achieve this by pre-allocating ``sizeof(T)``
268with our custom allocator and then passing the resulting memory into the
269`placement new
270<https://en.cppreference.com/w/cpp/language/new#Placement_new>`_
271function of ``T``.
272
273However, unlike regular types, coroutine constructors or ``operator new`` are not
274directly available to the caller. Instead, the coroutine types are constructed
275by the compiler when the coroutine function is called.
276
277Furthermore, C++20 doesn't provide a way to know ahead of time how much memory
278needs to be allocated to store the coroutine state—that information is only
279available once our custom ``operator new`` is invoked with a ``size`` value.
280
281Instead, :cpp:class:`pw::async2::Coro` allows custom per-coroutine ``Allocator``
282values to be provided by intercepting an ``Allocator`` argument to the coroutine
283function. One can pass an allocator as the first argument of all coroutine
284functions. We can then use an ``Args...`` variadic to discard all other
285arguments to the coroutine function:
286
287.. code-block:: c++
288
289   template<typename... Args>
290   static void* operator new(std::size_t n,
291                             Allocator& alloc,
292                             const Args&...) noexcept {
293      return alloc.Allocate(n);
294   }
295
296The user's coroutine functions will then look like this:
297
298.. code-block:: c++
299
300   MyCoroutineType some_func(Allocator&, int a) { ... }
301
302The ``operator new`` implementation will then receive the caller-provided ``Allocator``
303reference and can use it to allocate the coroutine state.
304
305.. admonition:: ``CoroContext``
306
307   The allocator argument to :cpp:class:`pw::async2::Coro` is actually packaged inside a
308   :cpp:class:`pw::async2::CoroContext` in order to allow for greater evolvability, but
309   today this type is a simple wrapper over a :cpp:class:`pw::Allocator`.
310
311Handling allocation failure
312===========================
313Many embedded systems both disable exceptions and wish to recover gracefully
314from allocation failure. If this is the case, ``promise_type::operator new`` must
315be marked ``noexcept`` and return ``nullptr`` on allocation failure.
316
317Note that our coroutine function, however, doesn't have an obvious way to
318return ``nullptr``:
319
320.. code-block:: c++
321
322   MyCoroutineType some_func(Allocator&, int a) { ... }
323
324It *must* produce a value of type ``MyCoroutineType``.
325
326This is made possible by
327``promise_type::get_return_object_on_allocation_failure``. When ``operator new``
328returns ``nullptr``, C++ will invoke ``get_return_object_on_allocation_failure``
329to produce an "empty" ``MyCoroutineType``. Coroutine return types must therefore
330decide on a sentinel representation to use in order to indicate allocation
331failure.
332
333Fortunately, most coroutine return types are simple wrappers around
334``std::coroutine_handle<my_promise_type>``, which is nullable, so we can use the
335``nullptr`` representation to indicate allocation failure.
336
337The ``delete`` function is missing critical pieces. What now?
338=============================================================
339Just as we can control allocation with our custom ``promise_type::operator new``,
340we can control deallocation with ``promise_type::operator delete``!
341
342But unlike ``operator new``, ``delete`` cannot intercept function arguments,
343nor can it access properties of the coroutine frame. By the time ``delete``
344is called, the coroutine frame has already been destroyed. Other parts of
345C++ use the ``destroying_delete`` API to allow accessing an object as part
346of its deletion, but the coroutine ``promise_type`` doesn't include such an
347API.
348
349This means that ``delete`` has no way to get access to the ``Allocator``
350instance from which the memory was allocated. Pigweed's
351:cpp:class:`pw::Allocator` API does not provide a way to free memory from only
352a pointer to the allocated memory; a reference to the ``Allocator`` object is
353required (particular ``Allocator`` instances may support this, but the generic
354interface does not). :cpp:class:`pw::async2::Coro` solves this by specifying
355``operator delete`` to be a no-op. Instead, the coroutine memory is freed by
356the coroutine handle wrapper type (``Coro``) *after* running
357``operator delete``:
358
359.. literalinclude:: ../../pw_async2/public/pw_async2/coro.h
360   :language: cpp
361   :linenos:
362   :start-after: [pw_async2-coro-release]
363   :end-before: [pw_async2-coro-release]
364
365.. admonition:: Promise handle address
366
367   The standard doesn't appear to clarify whether or not this usage is
368   supported. I couldn't find any documentation specifying that the promise
369   handle address must point to the *start* of the allocated memory (rather
370   than, say, the allocated memory plus eight bytes or something).
371
372   In practice, this seems to work just fine: the Pigweed API works in both
373   GCC and Clang, even with sanitizers enabled.
374
375   In the future, a guarantee here would be super useful!
376
377----------------------------------------------------------------
378Still to go: improvements needed from C++ standard and compilers
379----------------------------------------------------------------
380One concern that may hamper adoption of coroutines in embedded applications is
381the total size of coroutine frames. While these frames are frequently much
382smaller than the stack sizes of threads, there may be many more of them.
383
384Especially for users who would otherwise write manual state-machine-based or
385callback-based code, the current size overhead of coroutines may be a limiting
386factor. Some opportunities for improvement here are discussed below.
387
388Reusing stack space of expired temporaries
389==========================================
390Currently, Clang does not use stack space of temporaries after their
391storage duration ends. This means that all variables and temporaries
392in a coroutine function must have dedicated storage space in the coroutine frame.
393This is unnecessary: many variables are not alive at the same time, and others
394may not ever be alive across a ``co_await`` point! In the former case, storage
395for these variables could be overlapped, and in the latter, they need not be
396stored in the coroutine frame at all.
397
398This results in having a coroutine frame of 88 bytes:
399
400.. code-block:: c++
401
402   Coro<Status> SimplestCoroWithAwait(CoroContext& cx) {
403     co_return co_await SimplestCoro(cx);
404   }
405
406And this code having a coroutine frame of 408 bytes:
407
408.. code-block:: c++
409
410   Coro<Status> SimplestCoroWithTwoAwaits(CoroContext& cx) {
411     co_await SimplestCoro(cx);
412     co_await SimplestCoro(cx);
413     co_await SimplestCoro(cx);
414     co_await SimplestCoro(cx);
415     co_await SimplestCoro(cx);
416     co_await SimplestCoro(cx);
417     co_await SimplestCoro(cx);
418     co_await SimplestCoro(cx);
419     co_await SimplestCoro(cx);
420     co_await SimplestCoro(cx);
421     co_return co_await SimplestCoro(cx);
422   }
423
424Despite both requiring the same amount of temporary storage.
425
426Discussion on this topic continues in
427`this LLVM issue regarding the missing lifetime markers on temporaries <https://github.com/llvm/llvm-project/issues/43598>`_.
428
429A similar issue in Rust was solved thanks to tmandry's efforts culminating
430in `this PR <https://github.com/rust-lang/rust/pull/60187>`_.
431
432Passing values in and out of coroutines
433=======================================
434Once a coroutine function has been invoked, it can only be advanced
435(and completed) via a series of calls to
436`resume <https://en.cppreference.com/w/cpp/coroutine/coroutine_handle/resume>`_.
437Annoyingly, ``resume`` does not accept any arguments and returns ``void``,
438so it does not offer a built-in way to pass arguments into our coroutine
439function, nor does it give us a way to return values from our coroutine function.
440
441:cpp:class:`pw::async2::Coro` solves this by storing a pointer to input
442and output storage inside the ``promise_type`` before invoking ``resume``.
443
444.. literalinclude:: ../../pw_async2/public/pw_async2/coro.h
445   :language: cpp
446   :linenos:
447   :start-after: [pw_async2-coro-resume]
448   :end-before: [pw_async2-coro-resume]
449
450This means that every coroutine frame must additionally include space for this
451input/output pointer. This wasted overhead seems fairly easy to eliminate if
452C++ evolved to allow ``promise_type`` to have associated ``resume_arg_type``
453and ``return_arg_type`` values.
454
455Avoiding dynamic allocation for statically known coroutines
456===========================================================
457Although the steps described in this blog will allow you to avoid heap
458allocation, they still rely on *dynamic* allocation via a
459:cpp:class:`pw::Allocator`. This means that our application can potentially
460fail at runtime by running out of memory for our coroutines.
461
462For code which only instantiates a single known coroutine or fixed set of
463known coroutines defined within the same translation unit, this is an
464unnecessary cost. Dynamic allocation creates both runtime inefficiency,
465developer uncertainty, and risk: it's no longer possible to statically ensure
466that your program will be able to run on-device.
467
468It would be fantastic if future C++ standards would make it possible to access
469the coroutine frame size of specific, known coroutines so that space for them
470could be statically allocated.
471
472.. admonition:: Heap allocation elision optimization (HALO)
473
474   `Heap allocation elision optimization <https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0981r0.html>`_
475   can omit the calls to ``new`` and ``delete`` entirely in some cases.
476   However, this relies on the inlining of several independent functions
477   to the top-level scope which encompasses the full lifetime of the
478   coroutine. This isn't possible in the vast majority of the relevant
479   embedded use cases where it is common to store a coroutine as part of a
480   long-lived (often ``static``) object.
481
482   Furthermore, relying on such an optimization would be quite fragile, as
483   the compiler's choice to inline (or not) is dependent upon a variety of
484   heuristics outside the programmer's control.
485
486   Note, too, that when using custom allocator implementations one has
487   to take care or elision will not be possible: the allocation and
488   deallocation functions must be tagged with
489   `the appropriate attributes <https://discourse.llvm.org/t/optimize-away-memory-allocations/65587/4>`_
490   to tell the compiler that they are safe to optimize away.
491
492-----------------------------------
493Try out Pigweed's coroutine API now
494-----------------------------------
495To experiment with Pigweed's :cpp:class:`pw::async2::Coro` API, you can get
496started by `cloning our Quickstart repo <https://cs.opensource.google/pigweed/quickstart/bazel>`_
497for a bare-bones example
498or trying out the :ref:`Sense tutorial <showcase-sense-tutorial-intro>`
499for a full tour of a Pigweed product codebase.
500