1.. _docs-blog-05-coroutines: 2 3============================================================= 4Pigweed Eng Blog #5: C++20 coroutines without heap allocation 5============================================================= 6*Published on 2024-10-07 by Taylor Cramer* 7 8Pigweed now provides support for coroutines on embedded 9targets! 10 11.. literalinclude:: ../../pw_async2/examples/coro_blinky_loop.cc 12 :language: cpp 13 :linenos: 14 :start-after: [pw_async2-examples-coro-blinky-loop] 15 :end-before: [pw_async2-examples-coro-blinky-loop] 16 17`Click here for more examples. <https://pigweed.googlesource.com/pigweed/pigweed/+/refs/heads/main/pw_async2/examples>`_ 18 19Notably, Pigweed's :cpp:class:`pw::async2::Coro` API does not require heap 20allocation. This post details how we accompilished this, what challenges 21we encountered, and how you can get started using coroutines on embedded. 22 23If you're: 24 25* a C++ embedded developer looking for an ergonomic way to write async code 26* a Rust ``async`` / ``await`` user curious about C++ 27* a C++ committee member who wants to make embedded developers happy 28 29read on! 30 31-------------------------------- 32Why use coroutines for embedded? 33-------------------------------- 34Coroutines allow for multiple functions to run concurrently. 35Compared to other concurrency options, coroutines have: 36 37* No need for thread stacks 38* No OS/RTOS-dependent context switching 39* No complex, manually written state machines 40 41The `coroutines API introduced in C++20 42<https://en.cppreference.com/w/cpp/language/coroutines>`_ 43allows you to write 44`straight-line code <https://en.wikipedia.org/wiki/Straight-line_program>`_ 45that can yield to the caller and resume at some later point. This can be used 46similarly to ``async`` / ``await`` from languages like Rust, C#, JavaScript, 47Python, and Swift. 48 49When a typical blocking function waits for a long-running operation to 50complete, the whole thread (and any resources associated with it) become 51blocked until the operation is done. By contrast, coroutine functions can 52yield to their caller and only resume once an asynchronous operation 53completes. The caller, often some kind of scheduler, is then free to run 54other code while the coroutine function waits for its result. 55 56This allows for executing many operations concurrently without paying the 57performance cost of threading or the cognitive cost of callbacks or manual 58state machines. 59 60--------------------------------------------- 61Why C++ coroutines require dynamic allocation 62--------------------------------------------- 63Unfortunately for embedded developers, the C++ coroutine API uses heap 64allocation by default. *Heap allocation* (e.g. ``new`` and ``delete`` or 65``malloc`` and ``free``) isn't available on many embedded platforms. 66Fortunately, heap allocation is not necessary to use C++ coroutines. 67 68However, C++20 coroutines do require *dynamic allocation* 69(memory that is allocated at runtime). Pigweed's 70:cpp:class:`pw::async2::Coro` API allocates memory using an 71:cpp:class:`pw::Allocator`. This avoids the need for a platform-provided 72heap allocator, but users may still encounter some of the costs associated 73with dynamic allocation, such as: 74 75* Heap fragmentation 76* Runtime memory management size and performance overhead 77* Runtime out-of-memory conditions 78* The need to "right-size" a heap 79 80This dynamic allocation is done in order to store the "coroutine frame", an 81object which holds the coroutine's state. 82 83The coroutine frame 84=================== 85Despite not using a native thread stack, C++20 coroutines do need somewhere to 86store the "stack" of paused functions. For example, in our ``Blinky`` function 87from above: 88 89.. literalinclude:: ../../pw_async2/examples/coro_blinky_loop.cc 90 :language: cpp 91 :linenos: 92 :start-after: [pw_async2-examples-coro-blinky-loop] 93 :end-before: [pw_async2-examples-coro-blinky-loop] 94 95When we reach a ``co_await`` statement, the function is paused and control 96is returned to the scheduler until ``500ms`` passes. When paused, we still 97need to store the function's state, including the loop variable 98``i``, the function arguments, and a pointer to where we were in the function 99when it paused. 100 101This coroutine state (hereafter referred to as a "coroutine frame") typically 102requires much less memory than a full thread stack, as it can allocate exactly 103the amount of memory the function needs for variables that are "live" across 104``co_await`` points. 105 106.. admonition:: Layout of coroutine frames 107 108 The actual structure of this object is a compiler implementation 109 detail, but you can read more about how Clang structures it in 110 `The structure of coroutine frames <https://clang.llvm.org/docs/DebuggingCoroutines.html#coroutine-frame>`_. 111 112Static vs. dynamic allocation of the coroutine frame 113==================================================== 114When a coroutine is first created, the C++ coroutine API dynamically allocates 115space to store the coroutine frame and returns a pointer to it called a 116`coroutine handle <https://en.cppreference.com/w/cpp/coroutine/coroutine_handle>`_. 117By contrast, `Rust's async functions 118<https://rust-lang.github.io/async-book/01_getting_started/04_async_await_primer.html>`_ 119directly return coroutine frames inline, allowing for static allocation. 120Why did C++ make the choice to dynamically allocate? Some key reasons are that 121returning a pointer to dynamically allocated memory keeps the coroutine 122function's return value small, movable, and most importantly ABI-stable. 123 124ABI stability 125============= 126ABI stability is a particular challenge when designing coroutines. The coroutine 127frame contains the stack of the coroutine function, so the coroutine frame's 128size is dependent upon the number and size of the variables used in the function 129implementation. 130 131For example, two functions with this declaration can have differently sized 132coroutine frames: 133 134.. code-block:: c++ 135 136 Coro<int> GimmeANumber(); 137 138An implementation like this would have a very small coroutine frame: 139 140.. code-block:: c++ 141 142 Coro<int> GimmeANumber() { 143 co_return 5; 144 } 145 146While this function would need a very large coroutine frame in order 147to store its temporary state: 148 149.. code-block:: c++ 150 151 Coro<int> GimmeANumber() { 152 ABigValue a = co_await DownloadAllOfReddit(); 153 AnotherOne b = co_await DownloadAllOfHackerNews(); 154 co_return a.size() + b.size(); 155 } 156 157If we were to inline the coroutine frame in the resulting ``Coro<int>`` object, 158that would mean that ``Coro<int>`` would have to be a different size depending 159on the function implementation. The two functions would therefore have different 160ABIs, and we would no longer be able to refer to ``Coro<int>`` as a single 161statically sized type. 162 163This would also make it impossible to call a coroutine function from a different 164translation unit without knowing the details of its implementation. Without 165additional compiler features, inlined coroutine frames would require that public 166coroutine functions be implemented in headers (similar to templated or 167``constexpr`` functions). 168 169Similarly, inlined coroutine frames would prevent coroutine functions from 170being ``virtual``, as different overrides would have different ABIs. 171 172Rust bites the bullet 173===================== 174Rust inlines the coroutine frame and runs headfirst into these challenges. 175The coroutine-style object returned by a Rust ``async fn`` is: 176 177* Immovable once the coroutine has started. This is necessary in order to 178 ensure that pointers to variables in the coroutine stack are not invalidated. 179* Dependent upon the implementation of the ``async`` function. Adding a new 180 variable in the body of the ``async`` function will require more "stack" 181 space, so the size and structure of the returned state machine will change 182 as the implementation of the ``async`` function changes. 183* Not usable with dynamic dispatch (Rust's ``virtual``) without an 184 intermediate layer which dynamically allocates the coroutine frame 185 (see the `async_trait <https://docs.rs/async-trait/latest/async_trait/>`_ 186 library for one implementation of this pattern). 187* Challenging to compile. Since the size of the coroutine object is 188 dependent on the function implementation, code which uses the coroutine 189 function can't be compiled until the compiler has first determined the 190 size of the coroutine object. 191 192However, this tradeoff also means that Rust can statically allocate space 193for coroutines, avoiding all the pitfalls of dynamic allocation. The current 194C++ coroutine API doesn't allow for this. 195 196--------------------- 197What Pigweed provides 198--------------------- 199Despite the need for dynamic allocation, it is still possible to use coroutines 200on embedded devices. Today, Pigweed is using coroutines in the :ref:`showcase-sense` 201showcase for the Raspberry Pi RP2040 and (new!) RP2350 microcontrollers. 202 203In order to accomplish this, we needed the Pigweed :cpp:class:`pw::async2::Coro` 204API to: 205 206* Avoid heap allocation. 207* Allow recovering from allocation failure without using exceptions. 208* Allow using custom, context-specific 209 :cpp:class:`pw::Allocator` implementations on a per-coroutine basis. 210 211If you want to skip ahead, you can `see the Coro implementation here. 212<https://cs.opensource.google/pigweed/pigweed/+/main:pw_async2/public/pw_async2/coro.h>`_ 213 214-------------------- 215Eliminating the heap 216-------------------- 217As I said before, the C++ coroutine API defaults to heap allocation. 218Fortunately for us embedded devs, the coroutine ``promise_type`` can customize 219the behavior of this allocation by implementing the following functions: 220 221* ``static void* operator new(std::size_t, Args...) noexcept`` 222* ``static MyReturnObjectType get_return_object_on_allocation_failure()`` 223* ``static void operator delete(void*)`` 224 225Allocating memory 226================= 227Implementing ``operator new`` lets us control how coroutine memory is allocated. 228Our custom ``operator new`` takes two parameters: a ``std::size_t`` and an 229``Args...`` parameter pack. Let's explore how these are handled. 230 231The ``size`` parameter 232---------------------- 233Unlike regular ``operator new`` or `placement new 234<https://en.cppreference.com/w/cpp/language/new#Placement_new>`_, 235the ``promise_type`` ``operator new`` function allocates space not just for 236the ``promise_type`` itself, but also for the compiler-generated coroutine 237stack. The size of this coroutine stack is determined by the compiler and is 238optimization-dependent, so it is not statically observable and is only available 239at runtime. 240 241The ``Args...`` parameter pack 242------------------------------ 243The arguments to coroutine functions are also passed into the ``operator new`` 244of the corresponding ``promise_type``. This allows ``operator new`` to "intercept" 245arguments passed to the coroutine function. 246 247This means that if we have some 248coroutine function: 249 250.. code-block:: c++ 251 252 MyCoroutineType some_func(int a) 253 254and ``MyCoroutineType::promise_type`` has an ``operator new`` like: 255 256.. code-block:: c++ 257 258 static void* operator new(std::size_t n, int a) noexcept 259 260then the argument ``a`` will be copied into the invocation of ``operator new``. 261 262:cpp:class:`pw::async2::Coro` uses this tool to allow the caller of a coroutine 263function to specify which memory allocator to use for the coroutine frame. 264 265Different coroutines may want to use memory provided by different allocators 266with various memory locations or allocation strategies. When allocating a 267regular type ``T``, we'd typically achieve this by pre-allocating ``sizeof(T)`` 268with our custom allocator and then passing the resulting memory into the 269`placement new 270<https://en.cppreference.com/w/cpp/language/new#Placement_new>`_ 271function of ``T``. 272 273However, unlike regular types, coroutine constructors or ``operator new`` are not 274directly available to the caller. Instead, the coroutine types are constructed 275by the compiler when the coroutine function is called. 276 277Furthermore, C++20 doesn't provide a way to know ahead of time how much memory 278needs to be allocated to store the coroutine state—that information is only 279available once our custom ``operator new`` is invoked with a ``size`` value. 280 281Instead, :cpp:class:`pw::async2::Coro` allows custom per-coroutine ``Allocator`` 282values to be provided by intercepting an ``Allocator`` argument to the coroutine 283function. One can pass an allocator as the first argument of all coroutine 284functions. We can then use an ``Args...`` variadic to discard all other 285arguments to the coroutine function: 286 287.. code-block:: c++ 288 289 template<typename... Args> 290 static void* operator new(std::size_t n, 291 Allocator& alloc, 292 const Args&...) noexcept { 293 return alloc.Allocate(n); 294 } 295 296The user's coroutine functions will then look like this: 297 298.. code-block:: c++ 299 300 MyCoroutineType some_func(Allocator&, int a) { ... } 301 302The ``operator new`` implementation will then receive the caller-provided ``Allocator`` 303reference and can use it to allocate the coroutine state. 304 305.. admonition:: ``CoroContext`` 306 307 The allocator argument to :cpp:class:`pw::async2::Coro` is actually packaged inside a 308 :cpp:class:`pw::async2::CoroContext` in order to allow for greater evolvability, but 309 today this type is a simple wrapper over a :cpp:class:`pw::Allocator`. 310 311Handling allocation failure 312=========================== 313Many embedded systems both disable exceptions and wish to recover gracefully 314from allocation failure. If this is the case, ``promise_type::operator new`` must 315be marked ``noexcept`` and return ``nullptr`` on allocation failure. 316 317Note that our coroutine function, however, doesn't have an obvious way to 318return ``nullptr``: 319 320.. code-block:: c++ 321 322 MyCoroutineType some_func(Allocator&, int a) { ... } 323 324It *must* produce a value of type ``MyCoroutineType``. 325 326This is made possible by 327``promise_type::get_return_object_on_allocation_failure``. When ``operator new`` 328returns ``nullptr``, C++ will invoke ``get_return_object_on_allocation_failure`` 329to produce an "empty" ``MyCoroutineType``. Coroutine return types must therefore 330decide on a sentinel representation to use in order to indicate allocation 331failure. 332 333Fortunately, most coroutine return types are simple wrappers around 334``std::coroutine_handle<my_promise_type>``, which is nullable, so we can use the 335``nullptr`` representation to indicate allocation failure. 336 337The ``delete`` function is missing critical pieces. What now? 338============================================================= 339Just as we can control allocation with our custom ``promise_type::operator new``, 340we can control deallocation with ``promise_type::operator delete``! 341 342But unlike ``operator new``, ``delete`` cannot intercept function arguments, 343nor can it access properties of the coroutine frame. By the time ``delete`` 344is called, the coroutine frame has already been destroyed. Other parts of 345C++ use the ``destroying_delete`` API to allow accessing an object as part 346of its deletion, but the coroutine ``promise_type`` doesn't include such an 347API. 348 349This means that ``delete`` has no way to get access to the ``Allocator`` 350instance from which the memory was allocated. Pigweed's 351:cpp:class:`pw::Allocator` API does not provide a way to free memory from only 352a pointer to the allocated memory; a reference to the ``Allocator`` object is 353required (particular ``Allocator`` instances may support this, but the generic 354interface does not). :cpp:class:`pw::async2::Coro` solves this by specifying 355``operator delete`` to be a no-op. Instead, the coroutine memory is freed by 356the coroutine handle wrapper type (``Coro``) *after* running 357``operator delete``: 358 359.. literalinclude:: ../../pw_async2/public/pw_async2/coro.h 360 :language: cpp 361 :linenos: 362 :start-after: [pw_async2-coro-release] 363 :end-before: [pw_async2-coro-release] 364 365.. admonition:: Promise handle address 366 367 The standard doesn't appear to clarify whether or not this usage is 368 supported. I couldn't find any documentation specifying that the promise 369 handle address must point to the *start* of the allocated memory (rather 370 than, say, the allocated memory plus eight bytes or something). 371 372 In practice, this seems to work just fine: the Pigweed API works in both 373 GCC and Clang, even with sanitizers enabled. 374 375 In the future, a guarantee here would be super useful! 376 377---------------------------------------------------------------- 378Still to go: improvements needed from C++ standard and compilers 379---------------------------------------------------------------- 380One concern that may hamper adoption of coroutines in embedded applications is 381the total size of coroutine frames. While these frames are frequently much 382smaller than the stack sizes of threads, there may be many more of them. 383 384Especially for users who would otherwise write manual state-machine-based or 385callback-based code, the current size overhead of coroutines may be a limiting 386factor. Some opportunities for improvement here are discussed below. 387 388Reusing stack space of expired temporaries 389========================================== 390Currently, Clang does not use stack space of temporaries after their 391storage duration ends. This means that all variables and temporaries 392in a coroutine function must have dedicated storage space in the coroutine frame. 393This is unnecessary: many variables are not alive at the same time, and others 394may not ever be alive across a ``co_await`` point! In the former case, storage 395for these variables could be overlapped, and in the latter, they need not be 396stored in the coroutine frame at all. 397 398This results in having a coroutine frame of 88 bytes: 399 400.. code-block:: c++ 401 402 Coro<Status> SimplestCoroWithAwait(CoroContext& cx) { 403 co_return co_await SimplestCoro(cx); 404 } 405 406And this code having a coroutine frame of 408 bytes: 407 408.. code-block:: c++ 409 410 Coro<Status> SimplestCoroWithTwoAwaits(CoroContext& cx) { 411 co_await SimplestCoro(cx); 412 co_await SimplestCoro(cx); 413 co_await SimplestCoro(cx); 414 co_await SimplestCoro(cx); 415 co_await SimplestCoro(cx); 416 co_await SimplestCoro(cx); 417 co_await SimplestCoro(cx); 418 co_await SimplestCoro(cx); 419 co_await SimplestCoro(cx); 420 co_await SimplestCoro(cx); 421 co_return co_await SimplestCoro(cx); 422 } 423 424Despite both requiring the same amount of temporary storage. 425 426Discussion on this topic continues in 427`this LLVM issue regarding the missing lifetime markers on temporaries <https://github.com/llvm/llvm-project/issues/43598>`_. 428 429A similar issue in Rust was solved thanks to tmandry's efforts culminating 430in `this PR <https://github.com/rust-lang/rust/pull/60187>`_. 431 432Passing values in and out of coroutines 433======================================= 434Once a coroutine function has been invoked, it can only be advanced 435(and completed) via a series of calls to 436`resume <https://en.cppreference.com/w/cpp/coroutine/coroutine_handle/resume>`_. 437Annoyingly, ``resume`` does not accept any arguments and returns ``void``, 438so it does not offer a built-in way to pass arguments into our coroutine 439function, nor does it give us a way to return values from our coroutine function. 440 441:cpp:class:`pw::async2::Coro` solves this by storing a pointer to input 442and output storage inside the ``promise_type`` before invoking ``resume``. 443 444.. literalinclude:: ../../pw_async2/public/pw_async2/coro.h 445 :language: cpp 446 :linenos: 447 :start-after: [pw_async2-coro-resume] 448 :end-before: [pw_async2-coro-resume] 449 450This means that every coroutine frame must additionally include space for this 451input/output pointer. This wasted overhead seems fairly easy to eliminate if 452C++ evolved to allow ``promise_type`` to have associated ``resume_arg_type`` 453and ``return_arg_type`` values. 454 455Avoiding dynamic allocation for statically known coroutines 456=========================================================== 457Although the steps described in this blog will allow you to avoid heap 458allocation, they still rely on *dynamic* allocation via a 459:cpp:class:`pw::Allocator`. This means that our application can potentially 460fail at runtime by running out of memory for our coroutines. 461 462For code which only instantiates a single known coroutine or fixed set of 463known coroutines defined within the same translation unit, this is an 464unnecessary cost. Dynamic allocation creates both runtime inefficiency, 465developer uncertainty, and risk: it's no longer possible to statically ensure 466that your program will be able to run on-device. 467 468It would be fantastic if future C++ standards would make it possible to access 469the coroutine frame size of specific, known coroutines so that space for them 470could be statically allocated. 471 472.. admonition:: Heap allocation elision optimization (HALO) 473 474 `Heap allocation elision optimization <https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0981r0.html>`_ 475 can omit the calls to ``new`` and ``delete`` entirely in some cases. 476 However, this relies on the inlining of several independent functions 477 to the top-level scope which encompasses the full lifetime of the 478 coroutine. This isn't possible in the vast majority of the relevant 479 embedded use cases where it is common to store a coroutine as part of a 480 long-lived (often ``static``) object. 481 482 Furthermore, relying on such an optimization would be quite fragile, as 483 the compiler's choice to inline (or not) is dependent upon a variety of 484 heuristics outside the programmer's control. 485 486 Note, too, that when using custom allocator implementations one has 487 to take care or elision will not be possible: the allocation and 488 deallocation functions must be tagged with 489 `the appropriate attributes <https://discourse.llvm.org/t/optimize-away-memory-allocations/65587/4>`_ 490 to tell the compiler that they are safe to optimize away. 491 492----------------------------------- 493Try out Pigweed's coroutine API now 494----------------------------------- 495To experiment with Pigweed's :cpp:class:`pw::async2::Coro` API, you can get 496started by `cloning our Quickstart repo <https://cs.opensource.google/pigweed/quickstart/bazel>`_ 497for a bare-bones example 498or trying out the :ref:`Sense tutorial <showcase-sense-tutorial-intro>` 499for a full tour of a Pigweed product codebase. 500