xref: /aosp_15_r20/external/pigweed/docs/size_optimizations.rst (revision 61c4878ac05f98d0ceed94b57d316916de578985)
1.. _docs-size-optimizations:
2
3==================
4Size optimizations
5==================
6This page contains recommendations for optimizing the size of embedded software
7including its memory and code footprints.
8
9These recommendations are subject to change as the C++ standard and compilers
10evolve, and as the authors continue to gain more knowledge and experience in
11this area. If you disagree with recommendations, please discuss them with the
12Pigweed team, as we're always looking to improve the guide or correct any
13inaccuracies.
14
15---------------------------------
16Compile Time Constant Expressions
17---------------------------------
18The use of `constexpr <https://en.cppreference.com/w/cpp/language/constexpr>`_
19and soon with C++20
20`consteval <https://en.cppreference.com/w/cpp/language/consteval>`_ can enable
21you to evaluate the value of a function or variable more at compile-time rather
22than only at run-time. This can often not only result in smaller sizes but also
23often times more efficient, faster execution.
24
25We highly encourage using this aspect of C++, however there is one caveat: be
26careful in marking functions constexpr in APIs which cannot be easily changed
27in the future unless you can prove that for all time and all platforms, the
28computation can actually be done at compile time. This is because there is no
29"mutable" escape hatch for constexpr.
30
31See the :doc:`embedded_cpp_guide` for more detail.
32
33---------
34Templates
35---------
36The compiler implements templates by generating a separate version of the
37function for each set of types it is instantiated with. This can increase code
38size significantly.
39
40Be careful when instantiating non-trivial template functions with multiple
41types.
42
43Consider splitting templated interfaces into multiple layers so that more of the
44implementation can be shared between different instantiations. A more advanced
45form is to share common logic internally by using default sentinel template
46argument value and ergo instantation such as ``pw::Vector``'s
47``size_t kMaxSize = vector_impl::kGeneric`` or ``pw::span``'s
48``size_t Extent = dynamic_extent``.
49
50-----------------
51Virtual Functions
52-----------------
53Virtual functions provide for runtime polymorphism. Unless runtime polymorphism
54is required, virtual functions should be avoided. Virtual functions require a
55virtual table and a pointer to it in each instance, which all increases RAM
56usage and requires extra instructions at each call site. Virtual functions can
57also inhibit compiler optimizations, since the compiler may not be able to tell
58which functions will actually be invoked. This can prevent linker garbage
59collection, resulting in unused functions being linked into a binary.
60
61When runtime polymorphism is required, virtual functions should be considered.
62C alternatives, such as a struct of function pointers, could be used instead,
63but these approaches may offer no performance advantage while sacrificing
64flexibility and ease of use.
65
66Only use virtual functions when runtime polymorphism is needed. Lastly try to
67avoid templated virtual interfaces which can compound the cost by instantiating
68many virtual tables.
69
70Devirtualization
71================
72When you do use virtual functions, try to keep devirtualization in mind. You can
73make it easier on the compiler and linker by declaring class definitions as
74``final`` to improve the odds. This can help significantly depending on your
75toolchain.
76
77If you're interested in more details,
78`this is an interesting deep dive <https://quuxplusone.github.io/blog/2021/02/15/devirtualization/>`_.
79
80---------------------------------------------------------
81Initialization, Constructors, Finalizers, and Destructors
82---------------------------------------------------------
83Constructors
84============
85Where possible consider making your constructors constexpr to reduce their
86costs. This also enables global instances to be eligible for ``.data`` or if
87all zeros for ``.bss`` section placement.
88
89Static Destructors And Finalizers
90=================================
91For many embedded projects, cleaning up after the program is not a requirement,
92meaning the exit functions including any finalizers registered through
93``atexit``, ``at_quick_exit``, and static destructors can all be removed to
94reduce the size.
95
96The exact mechanics for disabling static destructors depends on your toolchain.
97
98See the `Ignored Finalizer and Destructor Registration`_ section below for
99further details regarding disabling registration of functions to be run at exit
100via ``atexit`` and ``at_quick_exit``.
101
102Clang
103-----
104With modern versions of Clang you can simply use ``-fno-C++-static-destructors``
105and you are done.
106
107GCC with newlib-nano
108--------------------
109With GCC this is more complicated. For example with GCC for ARM Cortex M devices
110using ``newlib-nano`` you are forced to tackle the problem in two stages.
111
112First, there are the destructors for the static global objects. These can be
113placed in the ``.fini_array`` and ``.fini`` input sections through the use of
114the ``-fno-use-cxa-atexit`` GCC flag, assuming ``newlib-nano`` was configured
115with ``HAVE_INITFINI_ARAY_SUPPORT``. The two input sections can then be
116explicitly discarded in the linker script through the use of the special
117``/DISCARD/`` output section:
118
119.. code-block:: text
120
121   /DISCARD/ : {
122   /* The finalizers are never invoked when the target shuts down and ergo
123    * can be discarded. These include C++ global static destructors and C
124    * designated finalizers. */
125   *(.fini_array);
126   *(.fini);
127
128Second, there are the destructors for the scoped static objects, frequently
129referred to as Meyer's Singletons. With the Itanium ABI these use
130``__cxa_atexit`` to register destruction on the fly. However, if
131``-fno-use-cxa-atexit`` is used with GCC and ``newlib-nano`` these will appear
132as ``__tcf_`` prefixed symbols, for example ``__tcf_0``.
133
134There's `an interesting proposal (P1247R0) <http://wg21.link/p1247r0>`_ to
135enable ``[[no_destroy]]`` attributes to C++ which would be tempting to use here.
136Alas this is not an option yet. As mentioned in the proposal one way to remove
137the destructors from these scoped statics is to wrap it in a templated wrapper
138which uses placement new.
139
140.. code-block:: cpp
141
142   #include <type_traits>
143
144   template <class T>
145   class NoDestroy {
146    public:
147     template <class... Ts>
148     NoDestroy(Ts&&... ts) {
149       new (&static_) T(std::forward<Ts>(ts)...);
150     }
151
152     T& get() { return reinterpret_cast<T&>(static_); }
153
154    private:
155     std::aligned_storage_t<sizeof(T), alignof(T)> static_;
156   };
157
158This can then be used as follows to instantiate scoped statics where the
159destructor will never be invoked and ergo will not be linked in.
160
161.. code-block:: cpp
162
163   Foo& GetFoo() {
164     static NoDestroy<Foo> foo(foo_args);
165     return foo.get();
166   }
167
168-------
169Strings
170-------
171
172Tokenization
173============
174Instead of directly using strings and printf, consider using
175:ref:`module-pw_tokenizer` to replace strings and printf-style formatted strings
176with binary tokens during compilation. This can reduce the code size, memory
177usage, I/O traffic, and even CPU utilization by replacing snprintf calls with
178simple tokenization code.
179
180Be careful when using string arguments with tokenization as these still result
181in a string in your binary which is appended to your token at run time.
182
183String Formatting
184=================
185The formatted output family of printf functions in ``<cstdio>`` are quite
186expensive from a code size point of view and they often rely on malloc. Instead,
187where tokenization cannot be used, consider using :ref:`module-pw_string`'s
188utilities.
189
190Removing all printf functions often saves more than 5KiB of code size on ARM
191Cortex M devices using ``newlib-nano``.
192
193Logging & Asserting
194===================
195Using tokenized backends for logging and asserting such as
196:ref:`module-pw_log_tokenized` coupled with :ref:`module-pw_assert_log` can
197drastically reduce the costs. However, even with this approach there remains a
198callsite cost which can add up due to arguments and including metadata.
199
200Try to avoid string arguments and reduce unnecessary extra arguments where
201possible. And consider adjusting log levels to compile out debug or even info
202logs as code stabilizes and matures.
203
204Future Plans
205------------
206Going forward Pigweed is evaluating extra configuration options to do things
207such as dropping log arguments for certain log levels and modules to give users
208finer grained control in trading off diagnostic value and the size cost.
209
210----------------------------------
211Threading and Synchronization Cost
212----------------------------------
213
214Lighterweight Signaling Primatives
215==================================
216Consider using ``pw::sync::ThreadNotification`` instead of semaphores as they
217can be implemented using more efficient RTOS specific signaling primitives. For
218example on FreeRTOS they can be backed by direct task notifications which are
219more than 10x smaller than semaphores while also being faster.
220
221Threads and their stack sizes
222=============================
223Although synchronous APIs are incredibly portable and often easier to reason
224about, it is often easy to forget the large stack cost this design paradigm
225comes with. We highly recommend watermarking your stacks to reduce wasted
226memory.
227
228Our snapshot integration for RTOSes such as :ref:`module-pw_thread_freertos` and
229:ref:`module-pw_thread_embos` come with built in support to report stack
230watermarks for threads if enabled in the kernel.
231
232In addition, consider using asynchronous design patterns such as Active Objects
233which can use :ref:`module-pw_work_queue` or similar asynchronous dispatch work
234queues to effectively permit the sharing of stack allocations.
235
236Buffer Sizing
237=============
238We'd be remiss not to mention the sizing of the various buffers that may exist
239in your application. You could consider watermarking them with
240:ref:`module-pw_metric`. You may also be able to adjust their servicing interval
241and priority, but do not forget to keep the ingress burst sizes and scheduling
242jitter into account.
243
244----------------------------
245Standard C and C++ libraries
246----------------------------
247Toolchains are typically distributed with their preferred standard C library and
248standard C++ library of choice for the target platform.
249
250Although you do not always have a choice in what standard C library and what
251standard C++ library is used or even how it's compiled, stay vigilant for common
252sources of bloat.
253
254Assert
255======
256The standard C library should provides the ``assert`` function or macro which
257may be internally used even if your application does not invoke it directly.
258Although this can be disabled through ``NDEBUG`` there typically is not a
259portable way of replacing the ``assert(condition)`` implementation without
260configuring and recompiling your standard C library.
261
262However, you can consider replacing the implementation at link time with a
263cheaper implementation. For example ``newlib-nano``, which comes with the
264``GNU Arm Embedded Toolchain``, often has an expensive ``__assert_func``
265implementation which uses ``fiprintf`` to print to ``stderr`` before invoking
266``abort()``. This can be replaced with a simple ``PW_CRASH`` invocation which
267can save several kilobytes in case ``fiprintf`` isn't used elsewhere.
268
269One option to remove this bloat is to use ``--wrap`` at link time to replace
270these implementations. As an example in GN you could replace it with the
271following ``BUILD.gn`` file:
272
273.. code-block:: text
274
275   import("//build_overrides/pigweed.gni")
276
277   import("$dir_pw_build/target_types.gni")
278
279   # Wraps the function called by newlib's implementation of assert from stdlib.h.
280   #
281   # When using this, we suggest injecting :newlib_assert via pw_build_LINK_DEPS.
282   config("wrap_newlib_assert") {
283     ldflags = [ "-Wl,--wrap=__assert_func" ]
284   }
285
286   # Implements the function called by newlib's implementation of assert from
287   # stdlib.h which invokes __assert_func unless NDEBUG is defined.
288   pw_source_set("wrapped_newlib_assert") {
289     sources = [ "wrapped_newlib_assert.cc" ]
290     deps = [
291       "$dir_pw_assert:check",
292       "$dir_pw_preprocessor",
293     ]
294   }
295
296And a ``wrapped_newlib_assert.cc`` source file implementing the wrapped assert
297function:
298
299.. code-block:: cpp
300
301   #include "pw_assert/check.h"
302   #include "pw_preprocessor/compiler.h"
303
304   // This is defined by <cassert>
305   extern "C" PW_NO_RETURN void __wrap___assert_func(const char*,
306                                                     int,
307                                                     const char*,
308                                                     const char*) {
309     PW_CRASH("libc assert() failure");
310   }
311
312
313Ignored Finalizer and Destructor Registration
314=============================================
315Even if no cleanup is done during shutdown for your target, shutdown functions
316such as ``atexit``, ``at_quick_exit``, and ``__cxa_atexit`` can sometimes not be
317linked out. This may be due to vendor code or perhaps using scoped statics, also
318known as Meyer's Singletons.
319
320The registration of these destructors and finalizers may include locks, malloc,
321and more depending on your standard C library and its configuration.
322
323One option to remove this bloat is to use ``--wrap`` at link time to replace
324these implementations with ones which do nothing. As an example in GN you could
325replace it with the following ``BUILD.gn`` file:
326
327.. code-block:: text
328
329   import("//build_overrides/pigweed.gni")
330
331   import("$dir_pw_build/target_types.gni")
332
333   config("wrap_atexit") {
334     ldflags = [
335       "-Wl,--wrap=atexit",
336       "-Wl,--wrap=at_quick_exit",
337       "-Wl,--wrap=__cxa_atexit",
338     ]
339   }
340
341   # Implements atexit, at_quick_exit, and __cxa_atexit from stdlib.h with noop
342   # versions for targets which do not cleanup during exit and quick_exit.
343   #
344   # This removes any dependencies which may exist in your existing libc.
345   # Although this removes the ability for things such as Meyer's Singletons,
346   # i.e. non-global statics, to register destruction function it does not permit
347   # them to be garbage collected by the linker.
348   pw_source_set("wrapped_noop_atexit") {
349     sources = [ "wrapped_noop_atexit.cc" ]
350   }
351
352And a ``wrapped_noop_atexit.cc`` source file implementing the noop functions:
353
354.. code-block:: cpp
355
356   // These two are defined by <cstdlib>.
357   extern "C" int __wrap_atexit(void (*)(void)) { return 0; }
358   extern "C" int __wrap_at_quick_exit(void (*)(void)) { return 0; }
359
360   // This function is part of the Itanium C++ ABI, there is no header which
361   // provides this.
362   extern "C" int __wrap___cxa_atexit(void (*)(void*), void*, void*) { return 0; }
363
364Unexpected Bloat in Disabled STL Exceptions
365===========================================
366The GCC
367`manual <https://gcc.gnu.org/onlinedocs/libstdc++/manual/using_exceptions.html>`_
368recommends using ``-fno-exceptions`` along with ``-fno-unwind-tables`` to
369disable exceptions and any associated overhead. This should replace all throw
370statements with calls to ``abort()``.
371
372However, what we've noticed with the GCC and ``libstdc++`` is that there is a
373risk that the STL will still throw exceptions when the application is compiled
374with ``-fno-exceptions`` and there is no way for you to catch them. In theory,
375this is not unsafe because the unhandled exception will invoke ``abort()`` via
376``std::terminate()``. This can occur because the libraries such as
377``libstdc++.a`` may not have been compiled with ``-fno-exceptions`` even though
378your application is linked against it.
379
380See
381`this <https://blog.mozilla.org/nnethercote/2011/01/18/the-dangers-of-fno-exceptions/>`_
382for more information.
383
384Unfortunately there can be significant overhead surrounding these throw call
385sites in the ``std::__throw_*`` helper functions. These implementations such as
386``std::__throw_out_of_range_fmt(const char*, ...)`` and
387their snprintf and ergo malloc dependencies can very quickly add up to many
388kilobytes of unnecessary overhead.
389
390One option to remove this bloat while also making sure that the exceptions will
391actually result in an effective ``abort()`` is to use ``--wrap`` at link time to
392replace these implementations with ones which simply call ``PW_CRASH``.
393
394As an example in GN you could replace it with the following ``BUILD.gn`` file,
395note that the mangled names must be used:
396
397.. code-block:: text
398
399   import("//build_overrides/pigweed.gni")
400
401   import("$dir_pw_build/target_types.gni")
402
403   # Wraps the std::__throw_* functions called by GNU ISO C++ Library regardless
404   # of whether "-fno-exceptions" is specified.
405   #
406   # When using this, we suggest injecting :wrapped_libstdc++_functexcept via
407   # pw_build_LINK_DEPS.
408   config("wrap_libstdc++_functexcept") {
409     ldflags = [
410       "-Wl,--wrap=_ZSt21__throw_bad_exceptionv",
411       "-Wl,--wrap=_ZSt17__throw_bad_allocv",
412       "-Wl,--wrap=_ZSt16__throw_bad_castv",
413       "-Wl,--wrap=_ZSt18__throw_bad_typeidv",
414       "-Wl,--wrap=_ZSt19__throw_logic_errorPKc",
415       "-Wl,--wrap=_ZSt20__throw_domain_errorPKc",
416       "-Wl,--wrap=_ZSt24__throw_invalid_argumentPKc",
417       "-Wl,--wrap=_ZSt20__throw_length_errorPKc",
418       "-Wl,--wrap=_ZSt20__throw_out_of_rangePKc",
419       "-Wl,--wrap=_ZSt24__throw_out_of_range_fmtPKcz",
420       "-Wl,--wrap=_ZSt21__throw_runtime_errorPKc",
421       "-Wl,--wrap=_ZSt19__throw_range_errorPKc",
422       "-Wl,--wrap=_ZSt22__throw_overflow_errorPKc",
423       "-Wl,--wrap=_ZSt23__throw_underflow_errorPKc",
424       "-Wl,--wrap=_ZSt19__throw_ios_failurePKc",
425       "-Wl,--wrap=_ZSt19__throw_ios_failurePKci",
426       "-Wl,--wrap=_ZSt20__throw_system_errori",
427       "-Wl,--wrap=_ZSt20__throw_future_errori",
428       "-Wl,--wrap=_ZSt25__throw_bad_function_callv",
429     ]
430   }
431
432   # Implements the std::__throw_* functions called by GNU ISO C++ Library
433   # regardless of whether "-fno-exceptions" is specified with PW_CRASH.
434   pw_source_set("wrapped_libstdc++_functexcept") {
435     sources = [ "wrapped_libstdc++_functexcept.cc" ]
436     deps = [
437       "$dir_pw_assert:check",
438       "$dir_pw_preprocessor",
439     ]
440   }
441
442And a ``wrapped_libstdc++_functexcept.cc`` source file implementing each
443wrapped and mangled ``std::__throw_*`` function:
444
445.. code-block:: cpp
446
447   #include "pw_assert/check.h"
448   #include "pw_preprocessor/compiler.h"
449
450   // These are all wrapped implementations of the throw functions provided by
451   // libstdc++'s bits/functexcept.h which are not needed when "-fno-exceptions"
452   // is used.
453
454   // std::__throw_bad_exception(void)
455   extern "C" PW_NO_RETURN void __wrap__ZSt21__throw_bad_exceptionv() {
456     PW_CRASH("std::throw_bad_exception");
457   }
458
459   // std::__throw_bad_alloc(void)
460   extern "C" PW_NO_RETURN void __wrap__ZSt17__throw_bad_allocv() {
461     PW_CRASH("std::throw_bad_alloc");
462   }
463
464   // std::__throw_bad_cast(void)
465   extern "C" PW_NO_RETURN void __wrap__ZSt16__throw_bad_castv() {
466     PW_CRASH("std::throw_bad_cast");
467   }
468
469   // std::__throw_bad_typeid(void)
470   extern "C" PW_NO_RETURN void __wrap__ZSt18__throw_bad_typeidv() {
471     PW_CRASH("std::throw_bad_typeid");
472   }
473
474   // std::__throw_logic_error(const char*)
475   extern "C" PW_NO_RETURN void __wrap__ZSt19__throw_logic_errorPKc(const char*) {
476     PW_CRASH("std::throw_logic_error");
477   }
478
479   // std::__throw_domain_error(const char*)
480   extern "C" PW_NO_RETURN void __wrap__ZSt20__throw_domain_errorPKc(const char*) {
481     PW_CRASH("std::throw_domain_error");
482   }
483
484   // std::__throw_invalid_argument(const char*)
485   extern "C" PW_NO_RETURN void __wrap__ZSt24__throw_invalid_argumentPKc(
486       const char*) {
487     PW_CRASH("std::throw_invalid_argument");
488   }
489
490   // std::__throw_length_error(const char*)
491   extern "C" PW_NO_RETURN void __wrap__ZSt20__throw_length_errorPKc(const char*) {
492     PW_CRASH("std::throw_length_error");
493   }
494
495   // std::__throw_out_of_range(const char*)
496   extern "C" PW_NO_RETURN void __wrap__ZSt20__throw_out_of_rangePKc(const char*) {
497     PW_CRASH("std::throw_out_of_range");
498   }
499
500   // std::__throw_out_of_range_fmt(const char*, ...)
501   extern "C" PW_NO_RETURN void __wrap__ZSt24__throw_out_of_range_fmtPKcz(
502       const char*, ...) {
503     PW_CRASH("std::throw_out_of_range");
504   }
505
506   // std::__throw_runtime_error(const char*)
507   extern "C" PW_NO_RETURN void __wrap__ZSt21__throw_runtime_errorPKc(
508       const char*) {
509     PW_CRASH("std::throw_runtime_error");
510   }
511
512   // std::__throw_range_error(const char*)
513   extern "C" PW_NO_RETURN void __wrap__ZSt19__throw_range_errorPKc(const char*) {
514     PW_CRASH("std::throw_range_error");
515   }
516
517   // std::__throw_overflow_error(const char*)
518   extern "C" PW_NO_RETURN void __wrap__ZSt22__throw_overflow_errorPKc(
519       const char*) {
520     PW_CRASH("std::throw_overflow_error");
521   }
522
523   // std::__throw_underflow_error(const char*)
524   extern "C" PW_NO_RETURN void __wrap__ZSt23__throw_underflow_errorPKc(
525       const char*) {
526     PW_CRASH("std::throw_underflow_error");
527   }
528
529   // std::__throw_ios_failure(const char*)
530   extern "C" PW_NO_RETURN void __wrap__ZSt19__throw_ios_failurePKc(const char*) {
531     PW_CRASH("std::throw_ios_failure");
532   }
533
534   // std::__throw_ios_failure(const char*, int)
535   extern "C" PW_NO_RETURN void __wrap__ZSt19__throw_ios_failurePKci(const char*,
536                                                                     int) {
537     PW_CRASH("std::throw_ios_failure");
538   }
539
540   // std::__throw_system_error(int)
541   extern "C" PW_NO_RETURN void __wrap__ZSt20__throw_system_errori(int) {
542     PW_CRASH("std::throw_system_error");
543   }
544
545   // std::__throw_future_error(int)
546   extern "C" PW_NO_RETURN void __wrap__ZSt20__throw_future_errori(int) {
547     PW_CRASH("std::throw_future_error");
548   }
549
550   // std::__throw_bad_function_call(void)
551   extern "C" PW_NO_RETURN void __wrap__ZSt25__throw_bad_function_callv() {
552     PW_CRASH("std::throw_bad_function_call");
553   }
554
555---------------------------------
556Compiler and Linker Optimizations
557---------------------------------
558
559Compiler Optimization Options
560=============================
561Don't forget to configure your compiler to optimize for size if needed. With
562Clang this is ``-Oz`` and with GCC this can be done via ``-Os``. The GN
563toolchains provided through :ref:`module-pw_toolchain` which are optimized for
564size are suffixed with ``*_size_optimized``.
565
566Garbage collect function and data sections
567==========================================
568By default the linker will place all functions in an object within the same
569linker "section" (e.g. ``.text``). With Clang and GCC you can use
570``-ffunction-sections`` and ``-fdata-sections`` to use a unique "section" for
571each object (e.g. ``.text.do_foo_function``). This permits you to pass
572``--gc-sections`` to the linker to cull any unused sections which were not
573referenced.
574
575To see what sections were garbage collected you can pass ``--print-gc-sections``
576to the linker so it prints out what was removed.
577
578The GN toolchains provided through :ref:`module-pw_toolchain` are configured to
579do this by default.
580
581Function Inlining
582=================
583Don't forget to expose trivial functions such as member accessors as inline
584definitions in the header. The compiler and linker can make the trade-off on
585whether the function should be actually inlined or not based on your
586optimization settings, however this at least gives it the option. Note that LTO
587can inline functions which are not defined in headers.
588
589We stand by the
590`Google style guide <https://google.github.io/styleguide/cppguide.html#Inline_Functions>`_
591to recommend considering this for simple functions which are 10 lines or less.
592
593Link Time Optimization (LTO)
594============================
595**Summary: LTO can decrase your binary size, at a cost: LTO makes debugging
596harder, interacts poorly with linker scripts, and makes crash reports less
597informative. We advise only enabling LTO when absolutely necessary.**
598
599Link time optimization (LTO) moves some optimizations from the individual
600compile steps to the final link step, to enable optimizing across translation
601unit boundaries.
602
603LTO can both increase performance and reduce binary size for embedded projects.
604This appears to be a strict improvement; and one might think enabling LTO at
605all times is the best approach. However, this is not the case; in practice, LTO
606is a trade-off.
607
608**LTO benefits**
609
610* **Reduces binary size** - When compiling with size-shrinking flags like
611  ``-Oz``, some function call overhead can be eliminated, and code paths might
612  be eliminated by the optimizer after inlining. This can include critical
613  abstraction removal like devirtualization.
614* **Improves performance** - When code is inlined, the optimizer can better
615  reduce the number of instructions. When code is smaller, the instruction
616  cache has better hit ratio leading to better performance. In some cases,
617  entire function calls are eliminated.
618
619**LTO costs**
620
621* **LTO interacts poorly with linker scripts** - Production embedded projects
622  often have complicated linker scripts to control the physical layout of code
623  and data on the device. For example, you may want to put performance critical
624  audio codec functions into the fast tightly coupled (TCM) memory region.
625  However, LTO can interact with linker script requirements in strange ways,
626  like inappropriately inlining code that was manually placed into other
627  functions in the wrong region; leading to hard-to-understand bugs.
628* **Debugging LTO binaries is harder** - LTO increases the differences between
629  the machine code and the source code. This makes stepping through source code
630  in a debugger confusing, since the instruction pointer can hop around in
631  confusing ways.
632* **Crash reports for LTO binaries can be misleading** - Just as with
633  debugging, LTO'd binaries can produce confusing stacks in crash reports.
634* **LTO significantly increases build times** - The compilation model is
635  different when LTO is enabled, since individual translation unit compilations
636  (`.cc` --> `.o`) files now produce LLVM- or GCC- IR instead of native machine
637  code; machine code is only generated at the link phase. This makes the final
638  link step take significantly longer. Since any source changes will result in
639  a link step, developer velocity is reduced due to the slow compile time.
640
641How to enable LTO
642-----------------
643On GCC and Clang LTO is enabled by passing ``-flto`` to both the compiler
644and the linker. On GCC ``-fdevirtualize-at-ltrans`` enables more aggressive
645devirtualization.
646
647Our recommendation
648------------------
649* Disable LTO unless absolutely necessary; e.g. due to lack of space.
650* When enabling LTO, carefully and thoroughly test the resulting binary.
651* Check that crash reports are still useful under LTO for your product.
652
653Disabling Scoped Static Initialization Locks
654============================================
655C++11 requires that scoped static objects are initialized in a thread-safe
656manner. This also means that scoped statics, i.e. Meyer's Singletons, be
657thread-safe. Unfortunately this rarely is the case on embedded targets. For
658example with GCC on an ARM Cortex M device if you test for this you will
659discover that instead the program crashes as reentrant initialization is
660detected through the use of guard variables.
661
662With GCC and Clang, ``-fno-threadsafe-statics`` can be used to remove the global
663lock which often does not work for embedded targets. Note that this leaves the
664guard variables in place which ensure that reentrant initialization continues to
665crash.
666
667Be careful when using this option in case you are relying on threadsafe
668initialization of statics and the global locks were functional for your target.
669
670Triaging Unexpectedly Linked In Functions
671=========================================
672Lastly as a tip if you cannot figure out why a function is being linked in you
673can consider:
674
675* Using ``--wrap`` with the linker to remove the implementation, resulting in a
676  link failure which typically calls out which calling function can no longer be
677  linked.
678* With GCC, you can use ``-fcallgraph-info`` to visualize or otherwise inspect
679  the callgraph to figure out who is calling what.
680* Sometimes symbolizing the address can resolve what a function is for. For
681  example if you are using ``newlib-nano`` along with ``-fno-use-cxa-atexit``,
682  scoped static destructors are prefixed ``__tcf_*``. To figure out object these
683  destructor functions are associated with, you can use ``llvm-symbolizer`` or
684  ``addr2line`` and these will often print out the related object's name.
685
686Sorting input sections by alignment
687=========================================
688
689Linker scripts often contain input section wildcard patterns to specify which
690input sections should be placed in each output section. For example, say a
691linker script contains a sections command like the following:
692
693.. code-block:: text
694
695   .text : { *(.init*) *(.text*) }
696
697By default, the GCC and Clang linkers will place symbols matched by each
698wildcard pattern in the order they are seen at link-time. The linker will insert
699padding bytes as necessary to satisfy the alignment requirements of each symbol.
700
701The GCC and Clang linkers allow one to first sort matched symbols for each
702wildcard pattern by alignment with the ``SORT_BY_ALIGNMENT`` keyword, which can
703reduce the amount of necessary padding bytes and save memory. This can be used
704to enable alignment sort on a per-pattern basis like so:
705
706.. code-block:: text
707
708   .text : { *(SORT_BY_ALIGNMENT(.init*)) *(SORT_BY_ALIGNMENT(.text*)) }
709
710This keyword can be applied globally to all wildcard matches in your linker
711script by passing the ``--sort-section=alignment`` option to the linker.
712
713See the `ld manual <https://sourceware.org/binutils/docs/ld/Input-Section-Wildcards.html>`_
714for more information.
715