docs/blog/05-coroutines.rst

.. _docs-blog-05-coroutines:

=============================================================
Pigweed Eng Blog #5: C++20 coroutines without heap allocation
=============================================================
*Published on 2024-10-07 by Taylor Cramer*

Pigweed now provides support for coroutines on embedded
targets!

.. literalinclude:: ../../pw_async2/examples/coro_blinky_loop.cc
   :language: cpp
   :linenos:
   :start-after: [pw_async2-examples-coro-blinky-loop]
   :end-before: [pw_async2-examples-coro-blinky-loop]

`Click here for more examples. <https://pigweed.googlesource.com/pigweed/pigweed/+/refs/heads/main/pw_async2/examples>`_

Notably, Pigweed's :cpp:class:`pw::async2::Coro` API does not require heap
allocation. This post details how we accompilished this, what challenges
we encountered, and how you can get started using coroutines on embedded.

If you're:

* a C++ embedded developer looking for an ergonomic way to write async code
* a Rust ``async`` / ``await`` user curious about C++
* a C++ committee member who wants to make embedded developers happy

read on!

--------------------------------
Why use coroutines for embedded?
--------------------------------
Coroutines allow for multiple functions to run concurrently.
Compared to other concurrency options, coroutines have:

* No need for thread stacks
* No OS/RTOS-dependent context switching
* No complex, manually written state machines

The `coroutines API introduced in C++20
<https://en.cppreference.com/w/cpp/language/coroutines>`_
allows you to write
`straight-line code <https://en.wikipedia.org/wiki/Straight-line_program>`_
that can yield to the caller and resume at some later point. This can be used
similarly to ``async`` / ``await`` from languages like Rust, C#, JavaScript,
Python, and Swift.

When a typical blocking function waits for a long-running operation to
complete, the whole thread (and any resources associated with it) become
blocked until the operation is done. By contrast, coroutine functions can
yield to their caller and only resume once an asynchronous operation
completes. The caller, often some kind of scheduler, is then free to run
other code while the coroutine function waits for its result.

This allows for executing many operations concurrently without paying the
performance cost of threading or the cognitive cost of callbacks or manual
state machines.

---------------------------------------------
Why C++ coroutines require dynamic allocation
---------------------------------------------
Unfortunately for embedded developers, the C++ coroutine API uses heap
allocation by default. *Heap allocation* (e.g. ``new`` and ``delete`` or
``malloc`` and ``free``) isn't available on many embedded platforms.
Fortunately, heap allocation is not necessary to use C++ coroutines.

However, C++20 coroutines do require *dynamic allocation*
(memory that is allocated at runtime). Pigweed's
:cpp:class:`pw::async2::Coro` API allocates memory using an
:cpp:class:`pw::Allocator`. This avoids the need for a platform-provided
heap allocator, but users may still encounter some of the costs associated
with dynamic allocation, such as:

* Heap fragmentation
* Runtime memory management size and performance overhead
* Runtime out-of-memory conditions
* The need to "right-size" a heap

This dynamic allocation is done in order to store the "coroutine frame", an
object which holds the coroutine's state.

The coroutine frame
===================
Despite not using a native thread stack, C++20 coroutines do need somewhere to
store the "stack" of paused functions. For example, in our ``Blinky`` function
from above:

.. literalinclude:: ../../pw_async2/examples/coro_blinky_loop.cc
   :language: cpp
   :linenos:
   :start-after: [pw_async2-examples-coro-blinky-loop]
   :end-before: [pw_async2-examples-coro-blinky-loop]

When we reach a ``co_await`` statement, the function is paused and control
is returned to the scheduler until ``500ms`` passes. When paused, we still
need to store the function's state, including the loop variable
``i``, the function arguments, and a pointer to where we were in the function
when it paused.

This coroutine state (hereafter referred to as a "coroutine frame") typically
requires much less memory than a full thread stack, as it can allocate exactly
the amount of memory the function needs for variables that are "live" across
``co_await`` points.

.. admonition:: Layout of coroutine frames

   The actual structure of this object is a compiler implementation
   detail, but you can read more about how Clang structures it in
   `The structure of coroutine frames <https://clang.llvm.org/docs/DebuggingCoroutines.html#coroutine-frame>`_.

Static vs. dynamic allocation of the coroutine frame
====================================================
When a coroutine is first created, the C++ coroutine API dynamically allocates
space to store the coroutine frame and returns a pointer to it called a
`coroutine handle <https://en.cppreference.com/w/cpp/coroutine/coroutine_handle>`_.
By contrast, `Rust's async functions
<https://rust-lang.github.io/async-book/01_getting_started/04_async_await_primer.html>`_
directly return coroutine frames inline, allowing for static allocation.
Why did C++ make the choice to dynamically allocate? Some key reasons are that
returning a pointer to dynamically allocated memory keeps the coroutine
function's return value small, movable, and most importantly ABI-stable.

ABI stability
=============
ABI stability is a particular challenge when designing coroutines. The coroutine
frame contains the stack of the coroutine function, so the coroutine frame's
size is dependent upon the number and size of the variables used in the function
implementation.

For example, two functions with this declaration can have differently sized
coroutine frames:

.. code-block:: c++

   Coro<int> GimmeANumber();

An implementation like this would have a very small coroutine frame:

.. code-block:: c++

   Coro<int> GimmeANumber() {
     co_return 5;
   }

While this function would need a very large coroutine frame in order
to store its temporary state:

.. code-block:: c++

   Coro<int> GimmeANumber() {
     ABigValue a = co_await DownloadAllOfReddit();
     AnotherOne b = co_await DownloadAllOfHackerNews();
     co_return a.size() + b.size();
   }

If we were to inline the coroutine frame in the resulting ``Coro<int>`` object,
that would mean that ``Coro<int>`` would have to be a different size depending
on the function implementation. The two functions would therefore have different
ABIs, and we would no longer be able to refer to ``Coro<int>`` as a single
statically sized type.

This would also make it impossible to call a coroutine function from a different
translation unit without knowing the details of its implementation. Without
additional compiler features, inlined coroutine frames would require that public
coroutine functions be implemented in headers (similar to templated or
``constexpr`` functions).

Similarly, inlined coroutine frames would prevent coroutine functions from
being ``virtual``, as different overrides would have different ABIs.

Rust bites the bullet
=====================
Rust inlines the coroutine frame and runs headfirst into these challenges.
The coroutine-style object returned by a Rust ``async fn`` is:

* Immovable once the coroutine has started. This is necessary in order to
  ensure that pointers to variables in the coroutine stack are not invalidated.
* Dependent upon the implementation of the ``async`` function. Adding a new
  variable in the body of the ``async`` function will require more "stack"
  space, so the size and structure of the returned state machine will change
  as the implementation of the ``async`` function changes.
* Not usable with dynamic dispatch (Rust's ``virtual``) without an
  intermediate layer which dynamically allocates the coroutine frame
  (see the `async_trait <https://docs.rs/async-trait/latest/async_trait/>`_
  library for one implementation of this pattern).
* Challenging to compile. Since the size of the coroutine object is
  dependent on the function implementation, code which uses the coroutine
  function can't be compiled until the compiler has first determined the
  size of the coroutine object.

However, this tradeoff also means that Rust can statically allocate space
for coroutines, avoiding all the pitfalls of dynamic allocation. The current
C++ coroutine API doesn't allow for this.

---------------------
What Pigweed provides
---------------------
Despite the need for dynamic allocation, it is still possible to use coroutines
on embedded devices. Today, Pigweed is using coroutines in the :ref:`showcase-sense`
showcase for the Raspberry Pi RP2040 and (new!) RP2350 microcontrollers.

In order to accomplish this, we needed the Pigweed :cpp:class:`pw::async2::Coro`
API to:

* Avoid heap allocation.
* Allow recovering from allocation failure without using exceptions.
* Allow using custom, context-specific
  :cpp:class:`pw::Allocator` implementations on a per-coroutine basis.

If you want to skip ahead, you can `see the Coro implementation here.
<https://cs.opensource.google/pigweed/pigweed/+/main:pw_async2/public/pw_async2/coro.h>`_

--------------------
Eliminating the heap
--------------------
As I said before, the C++ coroutine API defaults to heap allocation.
Fortunately for us embedded devs, the coroutine ``promise_type`` can customize
the behavior of this allocation by implementing the following functions:

* ``static void* operator new(std::size_t, Args...) noexcept``
* ``static MyReturnObjectType get_return_object_on_allocation_failure()``
* ``static void operator delete(void*)``

Allocating memory
=================
Implementing ``operator new`` lets us control how coroutine memory is allocated.
Our custom ``operator new`` takes two parameters: a ``std::size_t`` and an
``Args...`` parameter pack. Let's explore how these are handled.

The ``size`` parameter
----------------------
Unlike regular ``operator new`` or `placement new
<https://en.cppreference.com/w/cpp/language/new#Placement_new>`_,
the ``promise_type`` ``operator new`` function allocates space not just for
the ``promise_type`` itself, but also for the compiler-generated coroutine
stack. The size of this coroutine stack is determined by the compiler and is
optimization-dependent, so it is not statically observable and is only available
at runtime.

The ``Args...`` parameter pack
------------------------------
The arguments to coroutine functions are also passed into the ``operator new``
of the corresponding ``promise_type``. This allows ``operator new`` to "intercept"
arguments passed to the coroutine function.

This means that if we have some
coroutine function:

.. code-block:: c++

   MyCoroutineType some_func(int a)

and ``MyCoroutineType::promise_type`` has an ``operator new`` like:

.. code-block:: c++

   static void* operator new(std::size_t n, int a) noexcept

then the argument ``a`` will be copied into the invocation of ``operator new``.

:cpp:class:`pw::async2::Coro` uses this tool to allow the caller of a coroutine
function to specify which memory allocator to use for the coroutine frame.

Different coroutines may want to use memory provided by different allocators
with various memory locations or allocation strategies. When allocating a
regular type ``T``, we'd typically achieve this by pre-allocating ``sizeof(T)``
with our custom allocator and then passing the resulting memory into the
`placement new
<https://en.cppreference.com/w/cpp/language/new#Placement_new>`_
function of ``T``.

However, unlike regular types, coroutine constructors or ``operator new`` are not
directly available to the caller. Instead, the coroutine types are constructed
by the compiler when the coroutine function is called.

Furthermore, C++20 doesn't provide a way to know ahead of time how much memory
needs to be allocated to store the coroutine state—that information is only
available once our custom ``operator new`` is invoked with a ``size`` value.

Instead, :cpp:class:`pw::async2::Coro` allows custom per-coroutine ``Allocator``
values to be provided by intercepting an ``Allocator`` argument to the coroutine
function. One can pass an allocator as the first argument of all coroutine
functions. We can then use an ``Args...`` variadic to discard all other
arguments to the coroutine function:

.. code-block:: c++

   template<typename... Args>
   static void* operator new(std::size_t n,
                             Allocator& alloc,
                             const Args&...) noexcept {
      return alloc.Allocate(n);
   }

The user's coroutine functions will then look like this:

.. code-block:: c++

   MyCoroutineType some_func(Allocator&, int a) { ... }

The ``operator new`` implementation will then receive the caller-provided ``Allocator``
reference and can use it to allocate the coroutine state.

.. admonition:: ``CoroContext``

   The allocator argument to :cpp:class:`pw::async2::Coro` is actually packaged inside a
   :cpp:class:`pw::async2::CoroContext` in order to allow for greater evolvability, but
   today this type is a simple wrapper over a :cpp:class:`pw::Allocator`.

Handling allocation failure
===========================
Many embedded systems both disable exceptions and wish to recover gracefully
from allocation failure. If this is the case, ``promise_type::operator new`` must
be marked ``noexcept`` and return ``nullptr`` on allocation failure.

Note that our coroutine function, however, doesn't have an obvious way to
return ``nullptr``:

.. code-block:: c++

   MyCoroutineType some_func(Allocator&, int a) { ... }

It *must* produce a value of type ``MyCoroutineType``.

This is made possible by
``promise_type::get_return_object_on_allocation_failure``. When ``operator new``
returns ``nullptr``, C++ will invoke ``get_return_object_on_allocation_failure``
to produce an "empty" ``MyCoroutineType``. Coroutine return types must therefore
decide on a sentinel representation to use in order to indicate allocation
failure.

Fortunately, most coroutine return types are simple wrappers around
``std::coroutine_handle<my_promise_type>``, which is nullable, so we can use the
``nullptr`` representation to indicate allocation failure.

The ``delete`` function is missing critical pieces. What now?
=============================================================
Just as we can control allocation with our custom ``promise_type::operator new``,
we can control deallocation with ``promise_type::operator delete``!

But unlike ``operator new``, ``delete`` cannot intercept function arguments,
nor can it access properties of the coroutine frame. By the time ``delete``
is called, the coroutine frame has already been destroyed. Other parts of
C++ use the ``destroying_delete`` API to allow accessing an object as part
of its deletion, but the coroutine ``promise_type`` doesn't include such an
API.

This means that ``delete`` has no way to get access to the ``Allocator``
instance from which the memory was allocated. Pigweed's
:cpp:class:`pw::Allocator` API does not provide a way to free memory from only
a pointer to the allocated memory; a reference to the ``Allocator`` object is
required (particular ``Allocator`` instances may support this, but the generic
interface does not). :cpp:class:`pw::async2::Coro` solves this by specifying
``operator delete`` to be a no-op. Instead, the coroutine memory is freed by
the coroutine handle wrapper type (``Coro``) *after* running
``operator delete``:

.. literalinclude:: ../../pw_async2/public/pw_async2/coro.h
   :language: cpp
   :linenos:
   :start-after: [pw_async2-coro-release]
   :end-before: [pw_async2-coro-release]

.. admonition:: Promise handle address

   The standard doesn't appear to clarify whether or not this usage is
   supported. I couldn't find any documentation specifying that the promise
   handle address must point to the *start* of the allocated memory (rather
   than, say, the allocated memory plus eight bytes or something).

   In practice, this seems to work just fine: the Pigweed API works in both
   GCC and Clang, even with sanitizers enabled.

   In the future, a guarantee here would be super useful!

----------------------------------------------------------------
Still to go: improvements needed from C++ standard and compilers
----------------------------------------------------------------
One concern that may hamper adoption of coroutines in embedded applications is
the total size of coroutine frames. While these frames are frequently much
smaller than the stack sizes of threads, there may be many more of them.

Especially for users who would otherwise write manual state-machine-based or
callback-based code, the current size overhead of coroutines may be a limiting
factor. Some opportunities for improvement here are discussed below.

Reusing stack space of expired temporaries
==========================================
Currently, Clang does not use stack space of temporaries after their
storage duration ends. This means that all variables and temporaries
in a coroutine function must have dedicated storage space in the coroutine frame.
This is unnecessary: many variables are not alive at the same time, and others
may not ever be alive across a ``co_await`` point! In the former case, storage
for these variables could be overlapped, and in the latter, they need not be
stored in the coroutine frame at all.

This results in having a coroutine frame of 88 bytes:

.. code-block:: c++

   Coro<Status> SimplestCoroWithAwait(CoroContext& cx) {
     co_return co_await SimplestCoro(cx);
   }

And this code having a coroutine frame of 408 bytes:

.. code-block:: c++

   Coro<Status> SimplestCoroWithTwoAwaits(CoroContext& cx) {
     co_await SimplestCoro(cx);
     co_await SimplestCoro(cx);
     co_await SimplestCoro(cx);
     co_await SimplestCoro(cx);
     co_await SimplestCoro(cx);
     co_await SimplestCoro(cx);
     co_await SimplestCoro(cx);
     co_await SimplestCoro(cx);
     co_await SimplestCoro(cx);
     co_await SimplestCoro(cx);
     co_return co_await SimplestCoro(cx);
   }

Despite both requiring the same amount of temporary storage.

Discussion on this topic continues in
`this LLVM issue regarding the missing lifetime markers on temporaries <https://github.com/llvm/llvm-project/issues/43598>`_.

A similar issue in Rust was solved thanks to tmandry's efforts culminating
in `this PR <https://github.com/rust-lang/rust/pull/60187>`_.

Passing values in and out of coroutines
=======================================
Once a coroutine function has been invoked, it can only be advanced
(and completed) via a series of calls to
`resume <https://en.cppreference.com/w/cpp/coroutine/coroutine_handle/resume>`_.
Annoyingly, ``resume`` does not accept any arguments and returns ``void``,
so it does not offer a built-in way to pass arguments into our coroutine
function, nor does it give us a way to return values from our coroutine function.

:cpp:class:`pw::async2::Coro` solves this by storing a pointer to input
and output storage inside the ``promise_type`` before invoking ``resume``.

.. literalinclude:: ../../pw_async2/public/pw_async2/coro.h
   :language: cpp
   :linenos:
   :start-after: [pw_async2-coro-resume]
   :end-before: [pw_async2-coro-resume]

This means that every coroutine frame must additionally include space for this
input/output pointer. This wasted overhead seems fairly easy to eliminate if
C++ evolved to allow ``promise_type`` to have associated ``resume_arg_type``
and ``return_arg_type`` values.

Avoiding dynamic allocation for statically known coroutines
===========================================================
Although the steps described in this blog will allow you to avoid heap
allocation, they still rely on *dynamic* allocation via a
:cpp:class:`pw::Allocator`. This means that our application can potentially
fail at runtime by running out of memory for our coroutines.

For code which only instantiates a single known coroutine or fixed set of
known coroutines defined within the same translation unit, this is an
unnecessary cost. Dynamic allocation creates both runtime inefficiency,
developer uncertainty, and risk: it's no longer possible to statically ensure
that your program will be able to run on-device.

It would be fantastic if future C++ standards would make it possible to access
the coroutine frame size of specific, known coroutines so that space for them
could be statically allocated.

.. admonition:: Heap allocation elision optimization (HALO)

   `Heap allocation elision optimization <https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0981r0.html>`_
   can omit the calls to ``new`` and ``delete`` entirely in some cases.
   However, this relies on the inlining of several independent functions
   to the top-level scope which encompasses the full lifetime of the
   coroutine. This isn't possible in the vast majority of the relevant
   embedded use cases where it is common to store a coroutine as part of a
   long-lived (often ``static``) object.

   Furthermore, relying on such an optimization would be quite fragile, as
   the compiler's choice to inline (or not) is dependent upon a variety of
   heuristics outside the programmer's control.

   Note, too, that when using custom allocator implementations one has
   to take care or elision will not be possible: the allocation and
   deallocation functions must be tagged with
   `the appropriate attributes <https://discourse.llvm.org/t/optimize-away-memory-allocations/65587/4>`_
   to tell the compiler that they are safe to optimize away.

-----------------------------------
Try out Pigweed's coroutine API now
-----------------------------------
To experiment with Pigweed's :cpp:class:`pw::async2::Coro` API, you can get
started by `cloning our Quickstart repo <https://cs.opensource.google/pigweed/quickstart/bazel>`_
for a bare-bones example
or trying out the :ref:`Sense tutorial <showcase-sense-tutorial-intro>`
for a full tour of a Pigweed product codebase.