xref: /aosp_15_r20/bionic/docs/elf-tls.md (revision 8d67ca893c1523eb926b9080dbe4e2ffd2a27ba1)
1# Android ELF TLS
2
3App developers probably just want to read the
4[quick ELS TLS status summary](../android-changes-for-ndk-developers.md#elf-tls-available-for-api-level-29)
5instead.
6
7This document covers the detailed design and implementation choices.
8
9[TOC]
10
11# Overview
12
13ELF TLS is a system for automatically allocating thread-local variables with cooperation among the
14compiler, linker, dynamic loader, and libc.
15
16Thread-local variables are declared in C and C++ with a specifier, e.g.:
17
18```cpp
19thread_local int tls_var;
20```
21
22At run-time, TLS variables are allocated on a module-by-module basis, where a module is a shared
23object or executable. At program startup, TLS for all initially-loaded modules comprises the "Static
24TLS Block". TLS variables within the Static TLS Block exist at fixed offsets from an
25architecture-specific thread pointer (TP) and can be accessed very efficiently -- typically just a
26few instructions. TLS variables belonging to dlopen'ed shared objects, on the other hand, may be
27allocated lazily, and accessing them typically requires a function call.
28
29# Thread-Specific Memory Layout
30
31Ulrich Drepper's ELF TLS document specifies two ways of organizing memory pointed at by the
32architecture-specific thread-pointer ([`__get_tls()`] in Bionic):
33
34![TLS Variant 1 Layout](img/tls-variant1.png)
35
36![TLS Variant 2 Layout](img/tls-variant2.png)
37
38Variant 1 places the static TLS block after the TP, whereas variant 2 places it before the TP.
39According to Drepper, variant 2 was motivated by backwards compatibility, and variant 1 was designed
40for Itanium. The choice has effects on the toolchain, loader, and libc. In particular, when linking
41an executable, the linker needs to know where an executable's TLS segment is relative to the TP so
42it can correctly relocate TLS accesses. Both variants are incompatible with Bionic's current
43thread-specific data layout, but variant 1 is more problematic than variant 2.
44
45Each thread has a "Dynamic Thread Vector" (DTV) with a pointer to each module's TLS block (or NULL
46if it hasn't been allocated yet). If the executable has a TLS segment, then it will always be module
471, and its storage will always be immediately after (or before) the TP. In variant 1, the TP is
48expected to point immediately at the DTV pointer, whereas in variant 2, the DTV pointer's offset
49from TP is implementation-defined.
50
51The DTV's "generation" field is used to lazily update/reallocate the DTV when new modules are loaded
52or unloaded.
53
54[`__get_tls()`]: https://android.googlesource.com/platform/bionic/+/7245c082658182c15d2a423fe770388fec707cbc/libc/private/__get_tls.h
55
56# Access Models
57
58When a C/C++ file references a TLS variable, the toolchain generates instructions to find its
59address using a TLS "access model". The access models trade generality against efficiency. The four
60models are:
61
62 * GD: General Dynamic (aka Global Dynamic)
63 * LD: Local Dynamic
64 * IE: Initial Exec
65 * LE: Local Exec
66
67A TLS variable may be in a different module than the reference.
68
69## General Dynamic (or Global Dynamic) (GD)
70
71A GD access can refer to a TLS variable anywhere. To access a variable `tls_var` using the
72"traditional" non-TLSDESC design described in Drepper's TLS document, the toolchain compiler emits a
73call to a `__tls_get_addr` function provided by libc.
74
75For example, if we have this C code in a shared object:
76
77```cpp
78extern thread_local char tls_var;
79char* get_tls_var() {
80  return &tls_var;
81}
82```
83
84The toolchain generates code like this:
85
86```cpp
87struct TlsIndex {
88  long module; // starts counting at 1
89  long offset;
90};
91
92char* get_tls_var() {
93  static TlsIndex tls_var_idx = { // allocated in the .got
94    R_TLS_DTPMOD(tls_var), // dynamic TP module ID
95    R_TLS_DTPOFF(tls_var), // dynamic TP offset
96  };
97  return __tls_get_addr(&tls_var_idx);
98}
99```
100
101`R_TLS_DTPMOD` is a dynamic relocation to the index of the module containing `tls_var`, and
102`R_TLS_DTPOFF` is a dynamic relocation to the offset of `tls_var` within its module's `PT_TLS`
103segment.
104
105`__tls_get_addr` looks up `TlsIndex::module_id`'s entry in the DTV and adds `TlsIndex::offset` to
106the module's TLS block. Before it can do this, it ensures that the module's TLS block is allocated.
107A simple approach is to allocate memory lazily:
108
1091. If the current thread's DTV generation count is less than the current global TLS generation, then
110   `__tls_get_addr` may reallocate the DTV or free blocks for unloaded modules.
111
1122. If the DTV's entry for the given module is `NULL`, then `__tls_get_addr` allocates the module's
113   memory.
114
115If an allocation fails, `__tls_get_addr` calls `abort` (like emutls).
116
117musl, on the other, preallocates TLS memory in `pthread_create` and in `dlopen`, and each can report
118out-of-memory.
119
120## Local Dynamic (LD)
121
122LD is a specialization of GD that's useful when a function has references to two or more TLS
123variables that are both part of the same module as the reference. Instead of a call to
124`__tls_get_addr` for each variable, the compiler calls `__tls_get_addr` once to get the current
125module's TLS block, then adds each variable's DTPOFF to the result.
126
127For example, suppose we have this C code:
128
129```cpp
130static thread_local int x;
131static thread_local int y;
132int sum() {
133  return x + y;
134}
135```
136
137The toolchain generates code like this:
138
139```cpp
140int sum() {
141  static TlsIndex tls_module_idx = { // allocated in the .got
142    // a dynamic relocation against symbol 0 => current module ID
143    R_TLS_DTPMOD(NULL),
144    0,
145  };
146  char* base = __tls_get_addr(&tls_module_idx);
147  // These R_TLS_DTPOFF() relocations are resolved at link-time.
148  int* px = base + R_TLS_DTPOFF(x);
149  int* py = base + R_TLS_DTPOFF(y);
150  return *px + *py;
151}
152```
153
154(XXX: LD might be important for C++ `thread_local` variables -- even a single `thread_local`
155variable with a dynamic initializer has an associated TLS guard variable.)
156
157## Initial Exec (IE)
158
159If the variable is part of the Static TLS Block (i.e. the executable or an initially-loaded shared
160object), then its offset from the TP is known at load-time. The variable can be accessed with a few
161loads.
162
163Example: a C file for an executable:
164
165```cpp
166// tls_var could be defined in the executable, or it could be defined
167// in a shared object the executable links against.
168extern thread_local char tls_var;
169char* get_addr() { return &tls_var; }
170```
171
172Compiles to:
173
174```cpp
175// allocated in the .got, resolved at load-time with a dynamic reloc.
176// Unlike DTPOFF, which is relative to the start of the module’s block,
177// TPOFF is directly relative to the thread pointer.
178static long tls_var_gotoff = R_TLS_TPOFF(tls_var);
179
180char* get_addr() {
181  return (char*)__get_tls() + tls_var_gotoff;
182}
183```
184
185## Local Exec (LE)
186
187LE is a specialization of IE. If the variable is not just part of the Static TLS Block, but is also
188part of the executable (and referenced from the executable), then a GOT access can be avoided. The
189IE example compiles to:
190
191```cpp
192char* get_addr() {
193  // R_TLS_TPOFF() is resolved at (static) link-time
194  return (char*)__get_tls() + R_TLS_TPOFF(tls_var);
195}
196```
197
198## Selecting an Access Model
199
200The compiler selects an access model for each variable reference using these factors:
201 * The absence of `-fpic` implies an executable, so use IE/LE.
202 * Code compiled with `-fpic` could be in a shared object, so use GD/LD.
203 * The per-file default can be overridden with `-ftls-model=<model>`.
204 * Specifiers on the variable (`static`, `extern`, ELF visibility attributes).
205 * A variable can be annotated with `__attribute__((tls_model(...)))`. Clang may still use a more
206   efficient model than the one specified.
207
208# Shared Objects with Static TLS
209
210Shared objects are sometimes compiled with `-ftls-model=initial-exec` (i.e. "static TLS") for better
211performance. On Ubuntu, for example, `libc.so.6` and `libOpenGL.so.0` are compiled this way. Shared
212objects using static TLS can't be loaded with `dlopen` unless libc has reserved enough surplus
213memory in the static TLS block. glibc reserves a kilobyte or two (`TLS_STATIC_SURPLUS`) with the
214intent that only a few core system libraries would use static TLS. Non-core libraries also sometimes
215use it, which can break `dlopen` if the surplus area is exhausted. See:
216 * https://bugzilla.redhat.com/show_bug.cgi?id=1124987
217 * web search: [`"dlopen: cannot load any more object with static TLS"`][glibc-static-tls-error]
218
219Neither bionic nor musl currently allocate any surplus TLS memory.
220
221In general, supporting surplus TLS memory probably requires maintaining a thread list so that
222`dlopen` can initialize the new static TLS memory in all existing threads. A thread list could be
223omitted if the loader only allowed zero-initialized TLS segments and didn't reclaim memory on
224`dlclose`.
225
226As long as a shared object is one of the initially-loaded modules, a better option is to use
227TLSDESC.
228
229[glibc-static-tls-error]: https://www.google.com/search?q=%22dlopen:+cannot+load+any+more+object+with+static+TLS%22
230
231# TLS Descriptors (TLSDESC)
232
233The code fragments above match the "traditional" TLS design from Drepper's document. For the GD and
234LD models, there is a newer, more efficient design that uses "TLS descriptors". Each TLS variable
235reference has a corresponding descriptor, which contains a resolver function address and an argument
236to pass to the resolver.
237
238For example, if we have this C code in a shared object:
239
240```cpp
241extern thread_local char tls_var;
242char* get_tls_var() {
243  return &tls_var;
244}
245```
246
247The toolchain generates code like this:
248
249```cpp
250struct TlsDescriptor { // NB: arm32 reverses these fields
251  long (*resolver)(long);
252  long arg;
253};
254
255char* get_tls_var() {
256  // allocated in the .got, uses a dynamic relocation
257  static TlsDescriptor desc = R_TLS_DESC(tls_var);
258  return (char*)__get_tls() + desc.resolver(desc.arg);
259}
260```
261
262The dynamic loader fills in the TLS descriptors. For a reference to a variable allocated in the
263Static TLS Block, it can use a simple resolver function:
264
265```cpp
266long static_tls_resolver(long arg) {
267  return arg;
268}
269```
270
271The loader writes `tls_var@TPOFF` into the descriptor's argument.
272
273To support modules loaded with `dlopen`, the loader must use a resolver function that calls
274`__tls_get_addr`. In principle, this simple implementation would work:
275
276```cpp
277long dynamic_tls_resolver(TlsIndex* arg) {
278  return (long)__tls_get_addr(arg) - (long)__get_tls();
279}
280```
281
282There are optimizations that complicate the design a little:
283 * Unlike `__tls_get_addr`, the resolver function has a special calling convention that preserves
284   almost all registers, reducing register pressure in the caller
285   ([example](https://godbolt.org/g/gywcxk)).
286 * In general, the resolver function must call `__tls_get_addr`, so it must save and restore all
287   registers.
288 * To keep the fast path fast, the resolver inlines the fast path of `__tls_get_addr`.
289 * By storing the module's initial generation alongside the TlsIndex, the resolver function doesn't
290   need to use an atomic or synchronized access of the global TLS generation counter.
291
292The resolver must be written in assembly, but in C, the function looks like so:
293
294```cpp
295struct TlsDescDynamicArg {
296  unsigned long first_generation;
297  TlsIndex idx;
298};
299
300struct TlsDtv { // DTV == dynamic thread vector
301  unsigned long generation;
302  char* modules[];
303};
304
305long dynamic_tls_resolver(TlsDescDynamicArg* arg) {
306  TlsDtv* dtv = __get_dtv();
307  char* addr;
308  if (dtv->generation >= arg->first_generation &&
309      dtv->modules[arg->idx.module] != nullptr) {
310    addr = dtv->modules[arg->idx.module] + arg->idx.offset;
311  } else {
312    addr = __tls_get_addr(&arg->idx);
313  }
314  return (long)addr - (long)__get_tls();
315}
316```
317
318The loader needs to allocate a table of `TlsDescDynamicArg` objects for each TLS module with dynamic
319TLSDESC relocations.
320
321The static linker can still relax a TLSDESC-based access to an IE/LE access.
322
323The traditional TLS design is implemented everywhere, but the TLSDESC design has less toolchain
324support:
325 * GCC and the BFD linker support both designs on all supported Android architectures (arm32, arm64,
326   x86, x86-64).
327 * GCC can select the design at run-time using `-mtls-dialect=<dialect>` (`trad`-vs-`desc` on arm64,
328   otherwise `gnu`-vs-`gnu2`). Clang always uses the default mode.
329 * GCC and Clang default to TLSDESC on arm64 and the traditional design on other architectures.
330 * Gold and LLD support for TLSDESC is spotty (except when targeting arm64).
331
332# Linker Relaxations
333
334The (static) linker frequently has more information about the location of a referenced TLS variable
335than the compiler, so it can "relax" TLS accesses to more efficient models. For example, if an
336object file compiled with `-fpic` is linked into an executable, the linker could relax GD accesses
337to IE or LE. To relax a TLS access, the linker looks for an expected sequences of instructions and
338static relocations, then replaces the sequence with a different one of equal size. It may need to
339add or remove no-op instructions.
340
341## Current Support for GD->LE Relaxations Across Linkers
342
343Versions tested:
344 * BFD and Gold linkers: version 2.30
345 * LLD version 6.0.0 (upstream)
346
347Linker support for GD->LE relaxation with `-mtls-dialect=gnu/trad` (traditional):
348
349Architecture    | BFD | Gold | LLD
350--------------- | --- | ---- | ---
351arm32           | no  | no   | no
352arm64 (unusual) | yes | yes  | no
353x86             | yes | yes  | yes
354x86_64          | yes | yes  | yes
355
356Linker support for GD->LE relaxation with `-mtls-dialect=gnu2/desc` (TLSDESC):
357
358Architecture          | BFD | Gold               | LLD
359--------------------- | --- | ------------------ | ------------------
360arm32 (experimental)  | yes | unsupported relocs | unsupported relocs
361arm64                 | yes | yes                | yes
362x86 (experimental)    | yes | yes                | unsupported relocs
363X86_64 (experimental) | yes | yes                | unsupported relocs
364
365arm32 linkers can't relax traditional TLS accesses. BFD can relax an arm32 TLSDESC access, but LLD
366can't link code using TLSDESC at all, except on arm64, where it's used by default.
367
368# dlsym
369
370Calling `dlsym` on a TLS variable returns the address of the current thread's variable.
371
372# Debugger Support
373
374## gdb
375
376gdb uses a libthread_db plugin library to retrieve thread-related information from a target. This
377library is typically a shared object, but for Android, we link our own `libthread_db.a` into
378gdbserver. We will need to implement at least 2 APIs in `libthread_db.a` to find TLS variables, and
379gdb provides APIs for looking up symbols, reading or writing memory, and retrieving the current
380thread pointer (e.g. `ps_get_thread_area`).
381 * Reference: [gdb_proc_service.h]: APIs gdb provides to libthread_db
382 * Reference: [Currently unimplemented TLS functions in Android's libthread_tb][libthread_db.c]
383
384[gdb_proc_service.h]: https://android.googlesource.com/toolchain/gdb/+/a7e49fd02c21a496095c828841f209eef8ae2985/gdb-8.0.1/gdb/gdb_proc_service.h#41
385[libthread_db.c]: https://android.googlesource.com/platform/ndk/+/e1f0ad12fc317c0ca3183529cc9625d3f084d981/sources/android/libthread_db/libthread_db.c#115
386
387## LLDB
388
389LLDB more-or-less implemented Linux TLS debugging in [r192922][rL192922] ([D1944]) for x86 and
390x86-64. [arm64 support came later][D5073]. However, the Linux TLS functionality no longer does
391anything: the `GetThreadPointer` function is no longer implemented. Code for reading the thread
392pointer was removed in [D10661] ([this function][r240543]). (arm32 was apparently never supported.)
393
394[rL192922]: https://reviews.llvm.org/rL192922
395[D1944]: https://reviews.llvm.org/D1944
396[D5073]: https://reviews.llvm.org/D5073
397[D10661]: https://reviews.llvm.org/D10661
398[r240543]: https://github.com/llvm-mirror/lldb/commit/79246050b0f8d6b54acb5366f153d07f235d2780#diff-52dee3d148892cccfcdab28bc2165548L962
399
400## Threading Library Metadata
401
402Both debuggers need metadata from the threading library (`libc.so` / `libpthread.so`) to find TLS
403variables. From [LLDB r192922][rL192922]'s commit message:
404
405> ... All OSes use basically the same algorithm (a per-module lookup table) as detailed in Ulrich
406> Drepper's TLS ELF ABI document, so we can easily write code to decode it ourselves. The only
407> question therefore is the exact field layouts required. Happily, the implementors of libpthread
408> expose the structure of the DTV via metadata exported as symbols from the .so itself, designed
409> exactly for this kind of thing. So this patch simply reads that metadata in, and re-implements
410> libthread_db's algorithm itself. We thereby get cross-platform TLS lookup without either requiring
411> third-party libraries, while still being independent of the version of libpthread being used.
412
413 LLDB uses these variables:
414
415Name                              | Notes
416--------------------------------- | ---------------------------------------------------------------------------------------
417`_thread_db_pthread_dtvp`         | Offset from TP to DTV pointer (0 for variant 1, implementation-defined for variant 2)
418`_thread_db_dtv_dtv`              | Size of a DTV slot (typically/always sizeof(void*))
419`_thread_db_dtv_t_pointer_val`    | Offset within a DTV slot to the pointer to the allocated TLS block (typically/always 0)
420`_thread_db_link_map_l_tls_modid` | Offset of a `link_map` field containing the module's 1-based TLS module ID
421
422The metadata variables are local symbols in glibc's `libpthread.so` symbol table (but not its
423dynamic symbol table). Debuggers can access them, but applications can't.
424
425The debugger lookup process is straightforward:
426 * Find the `link_map` object and module-relative offset for a TLS variable.
427 * Use `_thread_db_link_map_l_tls_modid` to find the TLS variable's module ID.
428 * Read the target thread pointer.
429 * Use `_thread_db_pthread_dtvp` to find the thread's DTV.
430 * Use `_thread_db_dtv_dtv` and `_thread_db_dtv_t_pointer_val` to find the desired module's block
431   within the DTV.
432 * Add the module-relative offset to the module pointer.
433
434This process doesn't appear robust in the face of lazy DTV initialization -- presumably it could
435read past the end of an out-of-date DTV or access an unloaded module. To be robust, it needs to
436compare a module's initial generation count against the DTV's generation count. (XXX: Does gdb have
437these sorts of problems with glibc's libpthread?)
438
439## Reading the Thread Pointer with Ptrace
440
441There are ptrace interfaces for reading the thread pointer for each of arm32, arm64, x86, and x86-64
442(XXX: check 32-vs-64-bit for inferiors, debuggers, and kernels):
443 * arm32: `PTRACE_GET_THREAD_AREA`
444 * arm64: `PTRACE_GETREGSET`, `NT_ARM_TLS`
445 * x86_32: `PTRACE_GET_THREAD_AREA`
446 * x86_64: use `PTRACE_PEEKUSER` to read the `{fs,gs}_base` fields of `user_regs_struct`
447
448# C/C++ Specifiers
449
450C/C++ TLS variables are declared with a specifier:
451
452Specifier       | Notes
453--------------- | -----------------------------------------------------------------------------------------------------------------------------
454`__thread`      |  - non-standard, but ubiquitous in GCC and Clang<br/> - cannot have dynamic initialization or destruction
455`_Thread_local` |  - a keyword standardized in C11<br/> - cannot have dynamic initialization or destruction
456`thread_local`  |  - C11: a macro for `_Thread_local` via `threads.h`<br/> - C++11: a keyword, allows dynamic initialization and/or destruction
457
458The dynamic initialization and destruction of C++ `thread_local` variables is layered on top of ELF
459TLS (or emutls), so this design document mostly ignores it. Like emutls, ELF TLS variables either
460have a static initializer or are zero-initialized.
461
462Aside: Because a `__thread` variable cannot have dynamic initialization, `__thread` is more
463efficient in C++ than `thread_local` when the compiler cannot see the definition of a declared TLS
464variable. The compiler assumes the variable could have a dynamic initializer and generates code, at
465each access, to call a function to initialize the variable.
466
467# Graceful Failure on Old Platforms
468
469ELF TLS isn't implemented on older Android platforms, so dynamic executables and shared objects
470using it generally won't work on them. Ideally, the older platforms would reject these binaries
471rather than experience memory corruption at run-time.
472
473Static executables aren't a problem--the necessary runtime support is part of the executable, so TLS
474just works.
475
476XXX: Shared objects are less of a problem.
477 * On arm32, x86, and x86_64, the loader [should reject a TLS relocation]. (XXX: I haven't verified
478   this.)
479 * On arm64, the primary TLS relocation (R_AARCH64_TLSDESC) is [confused with an obsolete
480   R_AARCH64_TLS_DTPREL32 relocation][R_AARCH64_TLS_DTPREL32] and is [quietly ignored].
481 * Android P [added compatibility checks] for TLS symbols and `DT_TLSDESC_{GOT|PLT}` entries.
482
483XXX: A dynamic executable using ELF TLS would have a PT_TLS segment and no other distinguishing
484marks, so running it on an older platform would result in memory corruption. Should we add something
485to these executables that only newer platforms recognize? (e.g. maybe an entry in .dynamic, a
486reference to a symbol only a new libc.so has...)
487
488[should reject a TLS relocation]: https://android.googlesource.com/platform/bionic/+/android-8.1.0_r48/linker/linker.cpp#2852
489[R_AARCH64_TLS_DTPREL32]: https://android-review.googlesource.com/c/platform/bionic/+/723696
490[quietly ignored]: https://android.googlesource.com/platform/bionic/+/android-8.1.0_r48/linker/linker.cpp#2784
491[added compatibility checks]: https://android-review.googlesource.com/c/platform/bionic/+/648760
492
493## Loader/libc Communication
494
495The loader exposes a list of TLS modules ([`struct TlsModules`][TlsModules]) to `libc.so` using the
496`__libc_shared_globals` variable (see `tls_modules()` in [linker_tls.cpp][tls_modules-linker] and
497[elf_tls.cpp][tls_modules-libc]). `__tls_get_addr` in libc.so acquires the `TlsModules::mutex` and
498iterates its module list to lazily allocate and free TLS blocks.
499
500[TlsModules]: https://android-review.googlesource.com/c/platform/bionic/+/723698/1/libc/bionic/elf_tls.h#53
501[tls_modules-linker]: https://android-review.googlesource.com/c/platform/bionic/+/723698/1/linker/linker_tls.cpp#45
502[tls_modules-libc]: https://android-review.googlesource.com/c/platform/bionic/+/723698/1/libc/bionic/elf_tls.cpp#49
503
504## TLS Allocator
505
506bionic currently allocates a `pthread_internal_t` object and static TLS in a single mmap'ed
507region, along with a thread's stack if it needs one allocated. It doesn't place TLS memory on a
508preallocated stack (either the main thread's stack or one provided with `pthread_attr_setstack`).
509
510The DTV and blocks for dlopen'ed modules are instead allocated using the Bionic loader's
511`LinkerMemoryAllocator`, adapted to avoid the STL and to provide `memalign`.
512The implementation tries to achieve async-signal safety by blocking signals and
513acquiring a lock.
514
515There are three "entry points" to dynamically locate a TLS variable's address:
516 * libc.so: `__tls_get_addr`
517 * loader: TLSDESC dynamic resolver
518 * loader: dlsym
519
520The loader's entry points need to call `__tls_get_addr`, which needs to allocate memory. Currently,
521the implementation uses a [special function pointer] to call libc.so's `__tls_get_addr` from the loader.
522(This should probably be removed.)
523
524The implementation currently allows for arbitrarily-large TLS variable alignment. IIRC, different
525implementations (glibc, musl, FreeBSD) vary in their level of respect for TLS alignment. It looks
526like the Bionic loader ignores segments' alignment and aligns loaded libraries to 256 KiB. See
527`ReserveAligned`.
528
529[special function pointer]: https://android-review.googlesource.com/c/platform/bionic/+/723698/1/libc/private/bionic_globals.h#52
530
531## Async-Signal Safety
532
533The implementation's `__tls_get_addr` might be async-signal safe. Making it AS-safe is a good idea if
534it's feasible. musl's function is AS-safe, but glibc's isn't (or wasn't). Google had a patch to make
535glibc AS-safe back in 2012-2013. See:
536 * https://sourceware.org/glibc/wiki/TLSandSignals
537 * https://sourceware.org/ml/libc-alpha/2012-06/msg00335.html
538 * https://sourceware.org/ml/libc-alpha/2013-09/msg00563.html
539
540## Out-of-Memory Handling (abort)
541
542The implementation lazily allocates TLS memory for dlopen'ed modules (see `__tls_get_addr`), and an
543out-of-memory error on a TLS access aborts the process. musl, on the other hand, preallocates TLS
544memory on `pthread_create` and `dlopen`, so either function can return out-of-memory. Both functions
545probably need to acquire the same lock.
546
547Maybe Bionic should do the same as musl? Perhaps musl's robustness argument holds for Bionic,
548though, because Bionic (at least the linker) probably already aborts on OOM. musl doesn't support
549`dlclose`/unloading, so it might have an easier time.
550
551On the other hand, maybe lazy allocation is a feature, because not all threads will use a dlopen'ed
552solib's TLS variables. Drepper makes this argument in his TLS document:
553
554> In addition the run-time support should avoid creating the thread-local storage if it is not
555> necessary. For instance, a loaded module might only be used by one thread of the many which make
556> up the process. It would be a waste of memory and time to allocate the storage for all threads. A
557> lazy method is wanted. This is not much extra burden since the requirement to handle dynamically
558> loaded objects already requires recognizing storage which is not yet allocated. This is the only
559> alternative to stopping all threads and allocating storage for all threads before letting them run
560> again.
561
562FWIW: emutls also aborts on out-of-memory.
563
564## ELF TLS Not Usable in libc Itself
565
566The dynamic loader currently can't use ELF TLS, so any part of libc linked into the loader (i.e.
567most of it) also can't use ELF TLS. It might be possible to lift this restriction, perhaps with
568specialized `__tls_get_addr` and TLSDESC resolver functions.
569
570# Open Issues
571
572## Bionic Memory Layout Conflicts with Common TLS Layout
573
574Bionic already allocates thread-specific data in a way that conflicts with TLS variants 1 and 2:
575![Bionic TLS Layout in Android P](img/bionic-tls-layout-in-p.png)
576
577TLS variant 1 allocates everything after the TP to ELF TLS (except the first two words), and variant
5782 allocates everything before the TP. Bionic currently allocates memory before and after the TP to
579the `pthread_internal_t` struct.
580
581The `bionic_tls.h` header is marked with a warning:
582
583```cpp
584/** WARNING WARNING WARNING
585 **
586 ** This header file is *NOT* part of the public Bionic ABI/API
587 ** and should not be used/included by user-serviceable parts of
588 ** the system (e.g. applications).
589 **
590 ** It is only provided here for the benefit of the system dynamic
591 ** linker and the OpenGL sub-system (which needs to access the
592 ** pre-allocated slot directly for performance reason).
593 **/
594```
595
596There are issues with rearranging this memory:
597
598 * `TLS_SLOT_STACK_GUARD` is used for `-fstack-protector`. The location (word #5) was initially used
599   by GCC on x86 (and x86-64), where it is compatible with x86's TLS variant 2. We [modified Clang
600   to use this slot for arm64 in 2016][D18632], though, and the slot isn't compatible with ARM's
601   variant 1 layout. This change shipped in NDK r14, and the NDK's build systems (ndk-build and the
602   CMake toolchain file) enable `-fstack-protector-strong` by default.
603
604 * `TLS_SLOT_TSAN` is used for more than just TSAN -- it's also used by [HWASAN and
605   Scudo](https://reviews.llvm.org/D53906#1285002).
606
607 * The Go runtime allocates a thread-local "g" variable on Android by creating a pthread key and
608   searching for its TP-relative offset, which it assumes is nonnegative:
609    * On arm32/arm64, it creates a pthread key, sets it to a magic value, then scans forward from
610      the thread pointer looking for it. [The scan count was bumped to 384 to fix a reported
611      breakage happening with Android N.](https://go-review.googlesource.com/c/go/+/38636) (XXX: I
612      suspect the actual platform breakage happened with Android M's [lock-free pthread key
613      work][bionic-lockfree-keys].)
614    * On x86/x86-64, it uses a fixed offset from the thread pointer (TP+0xf8 or TP+0x1d0) and
615      creates pthread keys until one of them hits the fixed offset.
616    * CLs:
617       * arm32: https://codereview.appspot.com/106380043
618       * arm64: https://go-review.googlesource.com/c/go/+/17245
619       * x86: https://go-review.googlesource.com/c/go/+/16678
620       * x86-64: https://go-review.googlesource.com/c/go/+/15991
621    * Moving the pthread keys before the thread pointer breaks Go-based apps.
622    * It's unclear how many Android apps use Go. There are at least two with 1,000,000+ installs.
623    * [Some motivation for Go's design][golang-post], [runtime/HACKING.md][go-hacking]
624    * [On x86/x86-64 Darwin, Go uses a TLS slot reserved for both Go and Wine][go-darwin-x86] (On
625      [arm32][go-darwin-arm32]/[arm64][go-darwin-arm64] Darwin, Go scans for pthread keys like it
626      does on Android.)
627
628 * Android's "native bridge" system allows the Zygote to load an app solib of a non-native ABI. (For
629   example, it could be used to load an arm32 solib into an x86 Zygote.) The solib is translated
630   into the host architecture. TLS accesses in the app solib (whether ELF TLS, Bionic slots, or
631   `pthread_internal_t` fields) become host accesses. Laying out TLS memory differently across
632   architectures could complicate this translation.
633
634 * A `pthread_t` is practically just a `pthread_internal_t*`, and some apps directly access the
635   `pthread_internal_t::tid` field. Past examples: http://b/17389248, [aosp/107467]. Reorganizing
636   the initial `pthread_internal_t` fields could break those apps.
637
638It seems easy to fix the incompatibility for variant 2 (x86 and x86_64) by splitting out the Bionic
639slots into a new data structure. Variant 1 is a harder problem.
640
641The TLS prototype used a patched LLD that uses a variant 1 TLS layout with a 16-word TCB
642on all architectures.
643
644Aside: gcc's arm64ilp32 target uses a 32-bit unsigned offset for a TLS IE access
645(https://godbolt.org/z/_NIXjF). If Android ever supports this target, and in a configuration with
646variant 2 TLS, we might need to change the compiler to emit a sign-extending load.
647
648[D18632]: https://reviews.llvm.org/D18632
649[bionic-lockfree-keys]: https://android-review.googlesource.com/c/platform/bionic/+/134202
650[golang-post]: https://groups.google.com/forum/#!msg/golang-nuts/EhndTzcPJxQ/i-w7kAMfBQAJ
651[go-hacking]: https://github.com/golang/go/blob/master/src/runtime/HACKING.md
652[go-darwin-x86]: https://github.com/golang/go/issues/23617
653[go-darwin-arm32]: https://github.com/golang/go/blob/15c106d99305411b587ec0d9e80c882e538c9d47/src/runtime/cgo/gcc_darwin_arm.c
654[go-darwin-arm64]: https://github.com/golang/go/blob/15c106d99305411b587ec0d9e80c882e538c9d47/src/runtime/cgo/gcc_darwin_arm64.c
655[aosp/107467]: https://android-review.googlesource.com/c/platform/bionic/+/107467
656
657### Workaround: Use Variant 2 on arm32/arm64
658
659Pros: simplifies Bionic
660
661Cons:
662 * arm64: requires either subtle reinterpretation of a TLS relocation or addition of a new
663   relocation
664 * arm64: a new TLS relocation reduces compiler/assembler compatibility with non-Android
665
666The point of variant 2 was backwards-compatibility, and ARM Android needs to remain
667backwards-compatible, so we could use variant 2 for ARM. Problems:
668
669 * When linking an executable, the static linker needs to know how TLS is allocated because it
670   writes TP-relative offsets for IE/LE-model accesses. Clang doesn't tell the linker to target
671   Android, so it could pass an `--tls-variant2` flag to configure lld.
672
673 * On arm64, there are different sets of static LE relocations accommodating different ranges of
674   offsets from TP:
675
676   Size | TP offset range   | Static LE relocation types
677   ---- | ----------------- | ---------------------------------------
678   12   | 0 <= x < 2^12     | `R_AARCH64_TLSLE_ADD_TPREL_LO12`
679   "    | "                 | `R_AARCH64_TLSLE_LDST8_TPREL_LO12`
680   "    | "                 | `R_AARCH64_TLSLE_LDST16_TPREL_LO12`
681   "    | "                 | `R_AARCH64_TLSLE_LDST32_TPREL_LO12`
682   "    | "                 | `R_AARCH64_TLSLE_LDST64_TPREL_LO12`
683   "    | "                 | `R_AARCH64_TLSLE_LDST128_TPREL_LO12`
684   16   | -2^16 <= x < 2^16 | `R_AARCH64_TLSLE_MOVW_TPREL_G0`
685   24   | 0 <= x < 2^24     | `R_AARCH64_TLSLE_ADD_TPREL_HI12`
686   "    | "                 | `R_AARCH64_TLSLE_ADD_TPREL_LO12_NC`
687   "    | "                 | `R_AARCH64_TLSLE_LDST8_TPREL_LO12_NC`
688   "    | "                 | `R_AARCH64_TLSLE_LDST16_TPREL_LO12_NC`
689   "    | "                 | `R_AARCH64_TLSLE_LDST32_TPREL_LO12_NC`
690   "    | "                 | `R_AARCH64_TLSLE_LDST64_TPREL_LO12_NC`
691   "    | "                 | `R_AARCH64_TLSLE_LDST128_TPREL_LO12_NC`
692   32   | -2^32 <= x < 2^32 | `R_AARCH64_TLSLE_MOVW_TPREL_G1`
693   "    | "                 | `R_AARCH64_TLSLE_MOVW_TPREL_G0_NC`
694   48   | -2^48 <= x < 2^48 | `R_AARCH64_TLSLE_MOVW_TPREL_G2`
695   "    | "                 | `R_AARCH64_TLSLE_MOVW_TPREL_G1_NC`
696   "    | "                 | `R_AARCH64_TLSLE_MOVW_TPREL_G0_NC`
697
698   GCC for arm64 defaults to the 24-bit model and has an `-mtls-size=SIZE` option for setting other
699   supported sizes. (It supports 12, 24, 32, and 48.) Clang has only implemented the 24-bit model,
700   but that could change. (Clang [briefly used][D44355] load/store relocations, but it was reverted
701   because no linker supported them: [BFD], [Gold], [LLD]).
702
703   The 16-, 32-, and 48-bit models use a `movn/movz` instruction to set the highest 16 bits to a
704   positive or negative value, then `movk` to set the remaining 16 bit chunks. In principle, these
705   relocations should be able to accommodate a negative TP offset.
706
707   The 24-bit model uses `add` to set the high 12 bits, then places the low 12 bits into another
708   `add` or a load/store instruction.
709
710Maybe we could modify the `R_AARCH64_TLSLE_ADD_TPREL_HI12` relocation to allow a negative TP offset
711by converting the relocated `add` instruction to a `sub`. Alternately, we could add a new
712`R_AARCH64_TLSLE_SUB_TPREL_HI12` relocation, and Clang would use a different TLS LE instruction
713sequence when targeting Android/arm64.
714
715 * LLD's arm64 relaxations from GD and IE to LE would need to use `movn` instead of `movk` for
716   Android.
717
718 * Binaries linked with the flag crash on non-Bionic, and binaries without the flag crash on Bionic.
719   We might want to mark the binaries somehow to indicate the non-standard TLS ABI. Suggestion:
720    * Use an `--android-tls-variant2` flag (or `--bionic-tls-variant2`, we're trying to make [Bionic
721      run on the host](http://b/31559095))
722    * Add a `PT_ANDROID_TLS_TPOFF` segment?
723    * Add a [`.note.gnu.property`](https://reviews.llvm.org/D53906#1283425) with a
724      "`GNU_PROPERTY_TLS_TPOFF`" property value?
725
726[D44355]: https://reviews.llvm.org/D44355
727[BFD]: https://sourceware.org/bugzilla/show_bug.cgi?id=22970
728[Gold]: https://sourceware.org/bugzilla/show_bug.cgi?id=22969
729[LLD]: https://bugs.llvm.org/show_bug.cgi?id=36727
730
731### Workaround: Reserve an Extra-Large TCB on ARM
732
733Pros: Minimal linker change, no change to TLS relocations.
734Cons: The reserved amount becomes an arbitrary but immutable part of the Android ABI.
735
736Add an lld option: `--android-tls[-tcb=SIZE]`
737
738As with the first workaround, we'd probably want to mark the binary to indicate the non-standard
739TP-to-TLS-segment offset.
740
741Reservation amount:
742 * We would reserve at least 6 words to cover the stack guard
743 * Reserving 16 covers all the existing Bionic slots and gives a little room for expansion. (If we
744   ever needed more than 16 slots, we could allocate the space before TP.)
745 * 16 isn't enough for the pthread keys, so the Go runtime is still a problem.
746 * Reserving 138 words is enough for existing slots and pthread keys.
747
748### Workaround: Use Variant 1 Everywhere with an Extra-Large TCB
749
750Pros:
751 * memory layout is the same on all architectures, avoids native bridge complications
752 * x86/x86-64 relocations probably handle positive offsets without issue
753
754Cons:
755 * The reserved amount is still arbitrary.
756
757### Workaround: No LE Model in Android Executables
758
759Pros:
760 * Keeps options open. We can allow LE later if we want.
761 * Bionic's existing memory layout doesn't change, and arm32 and 32-bit x86 have the same layout
762 * Fixes everything but static executables
763
764Cons:
765 * more intrusive toolchain changes (affects both Clang and LLD)
766 * statically-linked executables still need another workaround
767 * somewhat larger/slower executables (they must use IE, not LE)
768
769The layout conflict is apparently only a problem because an executable assumes that its TLS segment
770is located at a statically-known offset from the TP (i.e. it uses the LE model). An initially-loaded
771shared object can still use the efficient IE access model, but its TLS segment offset is known at
772load-time, not link-time. If we can guarantee that Android's executables also use the IE model, not
773LE, then the Bionic loader can place the executable's TLS segment at any offset from the TP, leaving
774the existing thread-specific memory layout untouched.
775
776This workaround doesn't help with statically-linked executables, but they're probably less of a
777problem, because the linker and `libc.a` are usually packaged together.
778
779A likely problem: LD is normally relaxed to LE, not to IE. We'd either have to disable LD usage in
780the compiler (bad for performance) or add LD->IE relaxation. This relaxation requires that IE code
781sequences be no larger than LD code sequences, which may not be the case on some architectures.
782(XXX: In some past testing, it looked feasible for TLSDESC but not the traditional design.)
783
784To implement:
785 * Clang would need to stop generating LE accesses.
786 * LLD would need to relax GD and LD to IE instead of LE.
787 * LLD should abort if it sees a TLS LE relocation.
788 * LLD must not statically resolve an executable's IE relocation in the GOT. (It might assume that
789   it knows its value.)
790 * Perhaps LLD should mark executables specially, because a normal ELF linker's output would quietly
791   trample on `pthread_internal_t`. We need something like `DF_STATIC_TLS`, but instead of
792   indicating IE in an solib, we want to indicate the lack of LE in an executable.
793
794### (Non-)workaround for Go: Allocate a Slot with Go's Magic Values
795
796The Go runtime allocates its thread-local "g" variable by searching for a hard-coded magic constant
797(`0x23581321` for arm32 and `0x23581321345589` for arm64). As long as it finds its constant at a
798small positive offset from TP (within the first 384 words), it will think it has found the pthread
799key it allocated.
800
801As a temporary compatibility hack, we might try to keep these programs running by reserving a TLS
802slot with this magic value. This hack doesn't appear to work, however. The runtime finds its pthread
803key, but apps segfault. Perhaps the Go runtime expects its "g" variable to be zero-initialized ([one
804example][go-tlsg-zero]). With this hack, it's never zero, but with its current allocation strategy,
805it is typically zero. After [Bionic's pthread key system was rewritten to be
806lock-free][bionic-lockfree-keys] for Android M, though, it's not guaranteed, because a key could be
807recycled.
808
809[go-tlsg-zero]: https://go.googlesource.com/go/+/5bc1fd42f6d185b8ff0201db09fb82886978908b/src/runtime/asm_arm64.s#980
810
811### Workaround for Go: place pthread keys after the executable's TLS
812
813Most Android executables do not use any `thread_local` variables. In the prototype, with the
814AOSP hikey960 build, only `/system/bin/netd` had a TLS segment, and it was only 32 bytes. As long as
815`/system/bin/app_process{32,64}` limits its use of TLS memory, then the pthread keys could be
816allocated after `app_process`' TLS segment, and Go will still find them.
817
818Go scans 384 words from the thread pointer. If there are at most 16 Bionic slots and 130 pthread
819keys (2 words per key), then `app_process` can use at most 108 words of TLS memory.
820
821Drawback: In principle, this might make pthread key accesses slower, because Bionic can't assume
822that pthread keys are at a fixed offset from the thread pointer anymore. It must load an offset from
823somewhere (a global variable, another TLS slot, ...). `__get_thread()` already uses a TLS slot to
824find `pthread_internal_t`, though, rather than assume a fixed offset. (XXX: I think it could be
825optimized.)
826
827## TODO: Memory Layout Querying APIs (Proposed)
828
829 * https://sourceware.org/glibc/wiki/ThreadPropertiesAPI
830 * http://b/30609580
831
832## TODO: Sanitizers
833
834XXX: Maybe a sanitizer would want to intercept allocations of TLS memory, and that could be hard if
835the loader is allocating it.
836 * It looks like glibc's ld.so re-relocates itself after loading a program, so a program's symbols
837   can interpose call in the loader: https://sourceware.org/ml/libc-alpha/2014-01/msg00501.html
838
839## TODO: Other
840
841Missing:
842 * `dlsym` of a TLS variable
843 * debugger support
844
845# References
846
847General (and x86/x86-64)
848 * Ulrich Drepper's TLS document, ["ELF Handling For Thread-Local Storage."][drepper] Describes the
849   overall ELF TLS design and ABI details for x86 and x86-64 (as well as several other architectures
850   that Android doesn't target).
851 * Alexandre Oliva's TLSDESC proposal with details for x86 and x86-64: ["Thread-Local Storage
852   Descriptors for IA32 and AMD64/EM64T."][tlsdesc-x86]
853 * [x86 and x86-64 SystemV psABIs][psabi-x86].
854
855arm32:
856 * Alexandre Oliva's TLSDESC proposal for arm32: ["Thread-Local Storage Descriptors for the ARM
857   platform."][tlsdesc-arm]
858 * ["Addenda to, and Errata in, the ABI for the ARM® Architecture."][arm-addenda] Section 3,
859   "Addendum: Thread Local Storage" has details for arm32 non-TLSDESC ELF TLS.
860 * ["Run-time ABI for the ARM® Architecture."][arm-rtabi] Documents `__aeabi_read_tp`.
861 * ["ELF for the ARM® Architecture."][arm-elf] List TLS relocations (traditional and TLSDESC).
862
863arm64:
864 * [2015 LLVM bugtracker comment][llvm22408] with an excerpt from an unnamed ARM draft specification
865   describing arm64 code sequences necessary for linker relaxation
866 * ["ELF for the ARM® 64-bit Architecture (AArch64)."][arm64-elf] Lists TLS relocations (traditional
867   and TLSDESC).
868
869[drepper]: https://www.akkadia.org/drepper/tls.pdf
870[tlsdesc-x86]: https://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-x86.txt
871[psabi-x86]: https://github.com/hjl-tools/x86-psABI/wiki/X86-psABI
872[tlsdesc-arm]: https://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-ARM.txt
873[arm-addenda]: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0045e/IHI0045E_ABI_addenda.pdf
874[arm-rtabi]: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0043d/IHI0043D_rtabi.pdf
875[arm-elf]: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0044f/IHI0044F_aaelf.pdf
876[llvm22408]: https://bugs.llvm.org/show_bug.cgi?id=22408#c10
877[arm64-elf]: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0056b/IHI0056B_aaelf64.pdf
878