xref: /aosp_15_r20/external/perfetto/docs/design-docs/heapprofd-design.md (revision 6dbdd20afdafa5e3ca9b8809fa73465d530080dc)
1*6dbdd20aSAndroid Build Coastguard Worker# heapprofd: Android Heap Profiler
2*6dbdd20aSAndroid Build Coastguard Worker
3*6dbdd20aSAndroid Build Coastguard Worker_**Status:** COMPLETED **·** fmayer, primiano **·** 2018-06-15_
4*6dbdd20aSAndroid Build Coastguard Worker_**Updated:** 2020-04-20_
5*6dbdd20aSAndroid Build Coastguard Worker
6*6dbdd20aSAndroid Build Coastguard Worker## Objective
7*6dbdd20aSAndroid Build Coastguard WorkerProvide a low-overhead native heap profiling mechanism, with C++ and Java callstack attribution, usable by all processes on an Android system. This includes Java and native services. The mechanism is capable of exporting heap dumps into traces in order to be able to correlate heap information with other activity on the system. This feature was added in the Android 10 release.
8*6dbdd20aSAndroid Build Coastguard Worker
9*6dbdd20aSAndroid Build Coastguard Worker## Overview
10*6dbdd20aSAndroid Build Coastguard Worker![](/docs/images/heapprofd-design/Architecture.png)
11*6dbdd20aSAndroid Build Coastguard Worker
12*6dbdd20aSAndroid Build Coastguard WorkerImplement an out-of-process heap profiler. Do the minimal amount of processing in-line of malloc, and then delegate to a central component for further processing. This introduces a new daemon _heapprofd_.
13*6dbdd20aSAndroid Build Coastguard Worker
14*6dbdd20aSAndroid Build Coastguard WorkerWhen tracing is enabled, either via a system property or a signal delivered to an existing process, a given percentage of malloc calls copies the current call stack into a shared memory buffer that is received by heapprofd. heapprofd uses libunwindstack asynchronously for stack unwinding and symbolization. This information is used to build bookkeeping tables to track live allocations and is ultimately dumped into the Perfetto trace.
15*6dbdd20aSAndroid Build Coastguard Worker
16*6dbdd20aSAndroid Build Coastguard WorkerAll data referenced in this design document was collected on a Pixel 2 on Android P.
17*6dbdd20aSAndroid Build Coastguard Worker
18*6dbdd20aSAndroid Build Coastguard Worker
19*6dbdd20aSAndroid Build Coastguard Worker### Requirements
20*6dbdd20aSAndroid Build Coastguard WorkerThese are the properties that must be fulfilled by a heap profiler for Android:
21*6dbdd20aSAndroid Build Coastguard Worker
22*6dbdd20aSAndroid Build Coastguard Worker**No setup:** A heap profile can be taken with a single command.
23*6dbdd20aSAndroid Build Coastguard Worker
24*6dbdd20aSAndroid Build Coastguard Worker**Profile running apps:** The system can be used to enable profiling of already running apps to get information of memory usage _since the profiling was enabled_ without requiring a restart. This can be useful to track down memory leaks.
25*6dbdd20aSAndroid Build Coastguard Worker
26*6dbdd20aSAndroid Build Coastguard Worker**Attribute to Java methods:** The system can be used to track the native memory usage of Java applications. Allocations on the Java heap are outside of the scope of this work.
27*6dbdd20aSAndroid Build Coastguard Worker
28*6dbdd20aSAndroid Build Coastguard Worker**Zero overhead when disabled:** the system must not incur a performance overhead when it is not enabled.
29*6dbdd20aSAndroid Build Coastguard Worker
30*6dbdd20aSAndroid Build Coastguard Worker**Profile whole system:** the system must be capable of handling the load of profiling all processes running. The sample rate is adjusted to limit the amount of data.
31*6dbdd20aSAndroid Build Coastguard Worker
32*6dbdd20aSAndroid Build Coastguard Worker**Negligible in-process memory overhead:** the system must not hold bookkeeping data in the process in order not to inflate higher-level metrics like PSS.
33*6dbdd20aSAndroid Build Coastguard Worker
34*6dbdd20aSAndroid Build Coastguard Worker**Bounded performance impact:** the device must still be usable for all use-cases.
35*6dbdd20aSAndroid Build Coastguard Worker
36*6dbdd20aSAndroid Build Coastguard Worker
37*6dbdd20aSAndroid Build Coastguard Worker## Detailed Design
38*6dbdd20aSAndroid Build Coastguard Worker
39*6dbdd20aSAndroid Build Coastguard Worker### Enabling profiling
40*6dbdd20aSAndroid Build Coastguard Worker
41*6dbdd20aSAndroid Build Coastguard Worker#### Use case 1: profiling future allocations from a running process
42*6dbdd20aSAndroid Build Coastguard WorkerOne of the real-time signals ([`BIONIC_SIGNAL_PROFILER`](https://cs.android.com/android/platform/superproject/main/+/main:bionic/libc/platform/bionic/reserved_signals.h?q=symbol:BIONIC_SIGNAL_PROFILER)) is reserved in libc as a triggering mechanism. In this scenario:
43*6dbdd20aSAndroid Build Coastguard Worker
44*6dbdd20aSAndroid Build Coastguard Worker*   heapprofd sends a RT signal to the target process
45*6dbdd20aSAndroid Build Coastguard Worker*   Upon receipt of the signal, bionic reacts by installing a temporary malloc hook, which in turn spawns a thread to dynamically load libheapprofd.so in the process context. This means heapprofd will not work for statically linked binaries, as they lack the ability to `dlopen`. We can not spawn the thread directly from the signal handler, as `pthread_create` is not async-safe.
46*6dbdd20aSAndroid Build Coastguard Worker*   The initializer in libheapprofd.so is called to take care of the rest (see [client operation](#client-operation-and-in-process-hooks) below)
47*6dbdd20aSAndroid Build Coastguard Worker
48*6dbdd20aSAndroid Build Coastguard Worker
49*6dbdd20aSAndroid Build Coastguard Worker#### Use case 2: profiling a single process from startup
50*6dbdd20aSAndroid Build Coastguard Worker*   heapprofd sets a property of the form libc.debug.heapprofd.argv0 (argv0 being the first argument in `/proc/self/cmdline`, up to the first ":")
51*6dbdd20aSAndroid Build Coastguard Worker*   Native processes: when bionic is initialized checks for the presence of the property and, if found and matches the process name, loads the libheapprofd.so.
52*6dbdd20aSAndroid Build Coastguard Worker*   Managed java processes: zygote calls `mallopt(M_SET_ZYGOTE_CHILD, ...)` in `PreApplicationInit`. In this, Bionic checks for the presence of the property and, if found and matches the process name, loads the libheapprofd.so and continues as above.
53*6dbdd20aSAndroid Build Coastguard Worker
54*6dbdd20aSAndroid Build Coastguard Worker#### Use case 3: profiling the whole system
55*6dbdd20aSAndroid Build Coastguard WorkerA system property `libc.heapprofd.enable` can be set to enable heap profiling on startup. When this property is set every process on startup will load the libheapprofd.so library. The rest is identical to the case above.
56*6dbdd20aSAndroid Build Coastguard Worker
57*6dbdd20aSAndroid Build Coastguard Worker
58*6dbdd20aSAndroid Build Coastguard Worker### Disabling profiling
59*6dbdd20aSAndroid Build Coastguard WorkerDisabling profiling happens simply by virtue of shutting down the sockets from the heapprofd end. Upon send() failure the client will uninstall the hooks (see appendix: thread-safe hooks setup / teardown)
60*6dbdd20aSAndroid Build Coastguard Worker
61*6dbdd20aSAndroid Build Coastguard Worker
62*6dbdd20aSAndroid Build Coastguard Worker### Client operation and in process hooks
63*6dbdd20aSAndroid Build Coastguard WorkerUpon initialization of libheapprofd.so:
64*6dbdd20aSAndroid Build Coastguard Worker
65*6dbdd20aSAndroid Build Coastguard Worker*   The client establishes a connection to the heapprofd daemon through a connected UNIX socket.
66*6dbdd20aSAndroid Build Coastguard Worker*   Upon connection, the daemon will send a packet to the client specifying the profiling configuration (sampling rate; sample all threads / only specific threads; tuning of sampling heuristics). It will also send an FD for the SharedMemoryBuffer used to send samples  (see [wire protocol](heapprofd-wire-protocol.md)).
67*6dbdd20aSAndroid Build Coastguard Worker*   The malloc hooks are installed.
68*6dbdd20aSAndroid Build Coastguard Worker
69*6dbdd20aSAndroid Build Coastguard WorkerUpon each `*alloc()/posix_memalign()` call, the client library will perform some minimal bookkeeping. If the sampling	 rate is hit, it will copy the raw stack, together with a header specifying register state, tid, a global sequence number of the operation, and size of the allocation to the shared memory buffer. It will then send on a control socket to wake up the service.
70*6dbdd20aSAndroid Build Coastguard Worker
71*6dbdd20aSAndroid Build Coastguard WorkerUpon each `free()` call, the client will append the freed address into a global (process-wide) append-only buffer (the buffer is to avoid the overhead of a send() for each free). This buffer of free()s is sent to the heapprofd daemon when the fixed-size buffer is full or after a preset number of operations. This also includes a global sequence number for the operation.
72*6dbdd20aSAndroid Build Coastguard Worker
73*6dbdd20aSAndroid Build Coastguard WorkerIf the send() fails because the heapprofd has shut down the socket, voluntarily (graceful disabling) or involuntarily (has crashed) the client will teardown the hooks and disabling any profiling operation.
74*6dbdd20aSAndroid Build Coastguard Worker
75*6dbdd20aSAndroid Build Coastguard Worker
76*6dbdd20aSAndroid Build Coastguard Worker### Service operation
77*6dbdd20aSAndroid Build Coastguard Worker![](/docs/images/heapprofd-design/shmem-detail.png)
78*6dbdd20aSAndroid Build Coastguard Worker
79*6dbdd20aSAndroid Build Coastguard WorkerThe unwinder thread read the client's shared memory buffers and handle the samples received. The result of the unwinding is then enqueued using a PostTask for the main thread to do the accounting. A queue-based model between the threads is chosen because it makes synchronization easier. No synchronization is needed at all in the main thread, as the bookkeeping data will only be accessed by it.
80*6dbdd20aSAndroid Build Coastguard Worker
81*6dbdd20aSAndroid Build Coastguard WorkerIf the sample is a malloc, the stack is unwound and the resulting data is handled in the main thread. The main thread ignores mallocs with sequence numbers lower than the one already processed for this address. If the sample is a free, it is added to a buffer. As soon as all mallocs with a sequence number lower than the free have been handled, it is processed.
82*6dbdd20aSAndroid Build Coastguard Worker
83*6dbdd20aSAndroid Build Coastguard Worker
84*6dbdd20aSAndroid Build Coastguard Worker#### Unwinding
85*6dbdd20aSAndroid Build Coastguard Workerlibunwindstack is used for unwinding. A new Memory class is implemented that overlays the copied stack over the process memory (which is accessed using FDMemory). FDMemory uses read on `/proc/self/mem` file descriptors sent by the target application.
86*6dbdd20aSAndroid Build Coastguard Worker
87*6dbdd20aSAndroid Build Coastguard Worker```
88*6dbdd20aSAndroid Build Coastguard Workerclass StackMemory : public unwindstack::MemoryRemote {
89*6dbdd20aSAndroid Build Coastguard Worker public:
90*6dbdd20aSAndroid Build Coastguard Worker  ...
91*6dbdd20aSAndroid Build Coastguard Worker  size_t Read(uint64_t addr, void* dst, size_t size) override {
92*6dbdd20aSAndroid Build Coastguard Worker    if (addr >= sp_ && addr + size <= stack_end_ && addr + size > sp_) {
93*6dbdd20aSAndroid Build Coastguard Worker      size_t offset = static_cast<size_t>(addr - sp_);
94*6dbdd20aSAndroid Build Coastguard Worker      memcpy(dst, stack_ + offset, size);
95*6dbdd20aSAndroid Build Coastguard Worker      return size;
96*6dbdd20aSAndroid Build Coastguard Worker    }
97*6dbdd20aSAndroid Build Coastguard Worker
98*6dbdd20aSAndroid Build Coastguard Worker    return mem_->Read(addr, dst, size);
99*6dbdd20aSAndroid Build Coastguard Worker  }
100*6dbdd20aSAndroid Build Coastguard Worker
101*6dbdd20aSAndroid Build Coastguard Worker private:
102*6dbdd20aSAndroid Build Coastguard Worker  uint64_t sp_;
103*6dbdd20aSAndroid Build Coastguard Worker  uint8_t* stack_;
104*6dbdd20aSAndroid Build Coastguard Worker  size_t size_;
105*6dbdd20aSAndroid Build Coastguard Worker};
106*6dbdd20aSAndroid Build Coastguard Worker```
107*6dbdd20aSAndroid Build Coastguard Worker
108*6dbdd20aSAndroid Build Coastguard WorkerThis allows unwinding to work both for native code and all three execution modes of ART. Native libraries are mapped into the process memory, and ephemeral debug information written by ART is also accessible through the process memory. There is a chance that ART will garbage collect the information before the unwinding is done, in which case we will miss stack frames. As this is a sampling approach anyway, that loss of accuracy is acceptable.
109*6dbdd20aSAndroid Build Coastguard Worker
110*6dbdd20aSAndroid Build Coastguard WorkerRemote unwinding also enables us to use _global caching_ (`Elf::SetCachingEnabled(true)`) in libunwindstack. This prevents debug information being used by different processes to be loaded and decompressed multiple times.
111*6dbdd20aSAndroid Build Coastguard Worker
112*6dbdd20aSAndroid Build Coastguard WorkerWe add an `FDMaps` objects to parse maps from `/proc/self/maps` sent by the target process. We keep `FDMaps` object cached per process that is being profiled. This both saves the overhead of text-parsing `/proc/[pid]/maps` as well as keeps various objects needed for unwinding (e.g. decompressed minidebuginfo). In case an unwind fails with `ERROR_INVALID_MAP` we reparse the maps object. We will  do changes to libunwindstack to create a more general version of [`LocalUpdatableMaps`](https://cs.android.com/android/platform/superproject/main/+/main:system/unwinding/libunwindstack/Maps.cpp?q=symbol:LocalUpdatableMaps) that is also applicable for remote processes.
113*6dbdd20aSAndroid Build Coastguard Worker
114*6dbdd20aSAndroid Build Coastguard Worker
115*6dbdd20aSAndroid Build Coastguard Worker#### Advantages of remote unwinding
116*6dbdd20aSAndroid Build Coastguard Worker
117*6dbdd20aSAndroid Build Coastguard Worker**Crash-proofness:** Crashing bugs in the bookkeeping logic or libunwindstack do not result in user-visible crashes but only lack of profiling data. It will result in the connections to heapprofd being broken and profiling gracefully stopping on the client-side.
118*6dbdd20aSAndroid Build Coastguard Worker
119*6dbdd20aSAndroid Build Coastguard Worker**Performance:** copying the stack has much more consistent and higher performance than unwinding, which can take multiple milliseconds. See graph above.
120*6dbdd20aSAndroid Build Coastguard Worker
121*6dbdd20aSAndroid Build Coastguard Worker**Does not inflate higher-level metrics:** higher-level metrics such as PSS are not inflated by the book-keeping cost.
122*6dbdd20aSAndroid Build Coastguard Worker
123*6dbdd20aSAndroid Build Coastguard Worker**Compression:** bookkeeping of unwound frames can be more efficient if it is shared between multiple processes. E.g. common sequences of frames (in libc, ART, etc) can be deduped.
124*6dbdd20aSAndroid Build Coastguard Worker
125*6dbdd20aSAndroid Build Coastguard Worker
126*6dbdd20aSAndroid Build Coastguard Worker#### Disadvantages of remote unwinding
127*6dbdd20aSAndroid Build Coastguard Worker**Complexity:** the system has higher complexity than unwinding and symbolizing synchronously.
128*6dbdd20aSAndroid Build Coastguard Worker
129*6dbdd20aSAndroid Build Coastguard Worker#### Bookkeeping
130*6dbdd20aSAndroid Build Coastguard WorkerThe data is stored as a tree where each element has a back-pointer to its parent. This deduplicates repeated stack frames.  String interning is applied for method names and library names.
131*6dbdd20aSAndroid Build Coastguard Worker
132*6dbdd20aSAndroid Build Coastguard WorkerDetails will change to adapt to data collected during the implementation.
133*6dbdd20aSAndroid Build Coastguard Worker
134*6dbdd20aSAndroid Build Coastguard Worker
135*6dbdd20aSAndroid Build Coastguard Worker### Wire protocol
136*6dbdd20aSAndroid Build Coastguard WorkerIn early versions of heapprofd, we used a `SOCK_STREAM` socket to send callstacks to the service. We now use a shared memory based [wire protocol](heapprofd-wire-protocol.md) described in detail separately.
137*6dbdd20aSAndroid Build Coastguard Worker
138*6dbdd20aSAndroid Build Coastguard Worker### Failure modes
139*6dbdd20aSAndroid Build Coastguard Worker**heapprofd unwinding cannot keep up:** The shared memory buffer will reject new samples. If `block_client` is set, the client will retry until there is space in the shared memory buffer.
140*6dbdd20aSAndroid Build Coastguard Worker
141*6dbdd20aSAndroid Build Coastguard Worker**heapprofd crashes:** Writing on the control socket will fail, and the client will be torn down.
142*6dbdd20aSAndroid Build Coastguard Worker
143*6dbdd20aSAndroid Build Coastguard Worker**Writing in client fails:** If the write fails with any error code except `EINTR`, the connection is closed, and profiling is torn down.
144*6dbdd20aSAndroid Build Coastguard Worker
145*6dbdd20aSAndroid Build Coastguard Worker
146*6dbdd20aSAndroid Build Coastguard Worker### Fork handling
147*6dbdd20aSAndroid Build Coastguard WorkerAfter a process forks, we need to clean up the state that was initialized by the parent process and uninstall the malloc hooks. We do not intend to currently support following forks, see [Alternatives Considered](#alternatives-considered) for possible implementation thereof.
148*6dbdd20aSAndroid Build Coastguard Worker
149*6dbdd20aSAndroid Build Coastguard Worker## Performance considerations
150*6dbdd20aSAndroid Build Coastguard Worker
151*6dbdd20aSAndroid Build Coastguard Worker### Remote unwinding
152*6dbdd20aSAndroid Build Coastguard Worker_**Note:** This data was collected when heapprofd used a socket to communicate from client to service. We now use a shared-memory buffer, so we should be even lower overhead._
153*6dbdd20aSAndroid Build Coastguard Worker
154*6dbdd20aSAndroid Build Coastguard WorkerRemote unwinding is used to reduce the performance impact on the applications that are being profiled. After the stack has been sent, the application can resume its operation while the remote daemon unwinds the stack and does unwinding. As sending the stack is, on average, a faster operation than unwinding the stack, this results in a performance gain.
155*6dbdd20aSAndroid Build Coastguard Worker
156*6dbdd20aSAndroid Build Coastguard Worker
157*6dbdd20aSAndroid Build Coastguard Worker<table>
158*6dbdd20aSAndroid Build Coastguard Worker  <tr>
159*6dbdd20aSAndroid Build Coastguard Worker   <td>
160*6dbdd20aSAndroid Build Coastguard Worker
161*6dbdd20aSAndroid Build Coastguard Worker![](/docs/images/heapprofd-design/Android-Heap0.png)
162*6dbdd20aSAndroid Build Coastguard Worker
163*6dbdd20aSAndroid Build Coastguard Worker   </td>
164*6dbdd20aSAndroid Build Coastguard Worker   <td>
165*6dbdd20aSAndroid Build Coastguard Worker
166*6dbdd20aSAndroid Build Coastguard Worker![](/docs/images/heapprofd-design/Android-Heap1.png)
167*6dbdd20aSAndroid Build Coastguard Worker
168*6dbdd20aSAndroid Build Coastguard Worker   </td>
169*6dbdd20aSAndroid Build Coastguard Worker  </tr>
170*6dbdd20aSAndroid Build Coastguard Worker</table>
171*6dbdd20aSAndroid Build Coastguard Worker
172*6dbdd20aSAndroid Build Coastguard Worker**Mean unwind:** 413us
173*6dbdd20aSAndroid Build Coastguard Worker**Mean send:** 50us
174*6dbdd20aSAndroid Build Coastguard Worker**Median unwind:** 193us
175*6dbdd20aSAndroid Build Coastguard Worker**Median send:** 14us
176*6dbdd20aSAndroid Build Coastguard Worker**90 percentile unwind:** 715us
177*6dbdd20aSAndroid Build Coastguard Worker**90 percentile send:** 40us
178*6dbdd20aSAndroid Build Coastguard Worker
179*6dbdd20aSAndroid Build Coastguard Worker
180*6dbdd20aSAndroid Build Coastguard Worker### Sampling
181*6dbdd20aSAndroid Build Coastguard WorkerUnwinding the stack on every `malloc` call has a high cost that is not always worth paying. Thus malloc calls are sampled client-side using Poisson sampling with a probability proportional to their allocation size (i.e. larger allocations are more likely to be considered than small ones). All memory allocated since the last malloc considered is attributed to this allocation.
182*6dbdd20aSAndroid Build Coastguard Worker
183*6dbdd20aSAndroid Build Coastguard WorkerThe sampling rate is configurable as part of the initial handshake. A sampling rate == 1 will degenerate into the fully-accurate high-overhead mode.
184*6dbdd20aSAndroid Build Coastguard Worker
185*6dbdd20aSAndroid Build Coastguard WorkerSee [Sampling for Memory Profiles](/docs/design-docs/heapprofd-sampling) for
186*6dbdd20aSAndroid Build Coastguard Workermore details.
187*6dbdd20aSAndroid Build Coastguard Worker
188*6dbdd20aSAndroid Build Coastguard WorkerPrior art: [crbug.com/812262](http://crbug.com/812262), [crbug.com/803276](http://crbug.com/803276).
189*6dbdd20aSAndroid Build Coastguard Worker
190*6dbdd20aSAndroid Build Coastguard Worker## Implementation Plan
191*6dbdd20aSAndroid Build Coastguard Worker### Implement prototype [done]
192*6dbdd20aSAndroid Build Coastguard WorkerImplement a prototype of the system described above that works with SELinux `setenforce 0` and running as root on walleye.
193*6dbdd20aSAndroid Build Coastguard Worker
194*6dbdd20aSAndroid Build Coastguard Worker### Implement Benchmark [done]
195*6dbdd20aSAndroid Build Coastguard WorkerImplement a program that executes malloc / free calls from ground truth data. Profile this program using heapprofd, then compare results to ground truth data. Use this to iterate on sampling heuristics.
196*6dbdd20aSAndroid Build Coastguard Worker
197*6dbdd20aSAndroid Build Coastguard Worker### Productionize [done]
198*6dbdd20aSAndroid Build Coastguard WorkerDo security changes required to run heapprofd with `setenforce 1` and as non-root.
199*6dbdd20aSAndroid Build Coastguard Worker
200*6dbdd20aSAndroid Build Coastguard Worker
201*6dbdd20aSAndroid Build Coastguard Worker## Testing Plan
202*6dbdd20aSAndroid Build Coastguard Worker
203*6dbdd20aSAndroid Build Coastguard Worker* Employ fuzzing on the shared memory buffer. [done]
204*6dbdd20aSAndroid Build Coastguard Worker* Unit-tests for components. [done]
205*6dbdd20aSAndroid Build Coastguard Worker* CTS. [done]
206*6dbdd20aSAndroid Build Coastguard Worker
207*6dbdd20aSAndroid Build Coastguard Worker
208*6dbdd20aSAndroid Build Coastguard Worker## Background
209*6dbdd20aSAndroid Build Coastguard Worker
210*6dbdd20aSAndroid Build Coastguard Worker### ART modes of execution
211*6dbdd20aSAndroid Build Coastguard WorkerART (Android RunTime, the Android Java Runtime) has three different modes of execution.
212*6dbdd20aSAndroid Build Coastguard Worker
213*6dbdd20aSAndroid Build Coastguard Worker**Interpreted:** Java byte-code is interpreted during execution. Instrumentation in ART allows to get dexpc (~offset from dex file) for the code being executed.
214*6dbdd20aSAndroid Build Coastguard Worker
215*6dbdd20aSAndroid Build Coastguard Worker**JIT-compiled:** Java byte-code is compiled to native code during run-time. Both the code and the ELF information only live in process memory. The debug information is stored in a global variable, currently only if the app is debuggable or a global system property (`dalvik.vm.minidebuginfo`) is set. This is because the current implementation incurs a memory overhead that is too high to default enable.
216*6dbdd20aSAndroid Build Coastguard Worker
217*6dbdd20aSAndroid Build Coastguard Worker**AOT (ahead of time) compiled:** Java code is compiled into native code before run-time. This produces an .oat file, which is essentially an .so. Both code and ELF information is stored on disk. During execution, like shared native libraries, it is memory mapped into process memory.
218*6dbdd20aSAndroid Build Coastguard Worker
219*6dbdd20aSAndroid Build Coastguard Worker### Stack Unwinding
220*6dbdd20aSAndroid Build Coastguard WorkerStack unwinding is the process of determining the chain of return addresses from the raw bytes of the stack. These are the addresses we want to attribute the allocated memory to.
221*6dbdd20aSAndroid Build Coastguard Worker
222*6dbdd20aSAndroid Build Coastguard WorkerThe most efficient way of stack unwinding is using frame pointers. This is unreliable on Android as we do not control build parameters for vendor libraries or OEM builds and due to issues on ARM32. Thus, our stack unwinding relies on libunwindstack which uses DWARF information from the library ELF files to determine return addresses. This can significantly slower, with unwinding of a stack taking between 100μs and ~100 ms ([data from simpleperf](https://gist.github.com/fmayer/a3a5a352196f9037f34241f8fb09004d)).
223*6dbdd20aSAndroid Build Coastguard Worker
224*6dbdd20aSAndroid Build Coastguard Worker[libunwindstack](https://cs.android.com/android/platform/superproject/main/+/main:system/unwinding/libunwindstack/) is Android's replacement for [libunwind](https://www.nongnu.org/libunwind/). It has a modern C++ object-oriented API surface and support for Android specific features allowing it to unwind mixed native and Java applications using information emitted by ART depending on execution mode. It also supports symbolization for native code and all three execution modes or ART.
225*6dbdd20aSAndroid Build Coastguard Worker
226*6dbdd20aSAndroid Build Coastguard Worker### Symbolization
227*6dbdd20aSAndroid Build Coastguard WorkerSymbolization is the process of determining function name and line number from a code address. For builds by Google, we can get symbolized binaries (i.e. binaries with an ELF section that can be used for symbolization) from go/ab or https://ci.android.com (e.g. https://ci.android.com/builds/submitted/6410994/aosp_cf_x86_phone-userdebug/latest/aosp_cf_x86_phone-symbols-6410994.zip).
228*6dbdd20aSAndroid Build Coastguard Worker
229*6dbdd20aSAndroid Build Coastguard WorkerFor other builds, symbolization requires debug info contained within the binary. This information is often compressed. Symbolization of JIT-ed code requires information contained in process memory.
230*6dbdd20aSAndroid Build Coastguard Worker
231*6dbdd20aSAndroid Build Coastguard Worker### Perfetto
232*6dbdd20aSAndroid Build Coastguard Worker[Perfetto](https://perfetto.dev) is an open-source, highly efficient and expandable platform-wide tracing system that allows collection of performance data from kernel, apps and services. It aims to become the next-gen performance tracing mechanism for both Android and Chrome.
233*6dbdd20aSAndroid Build Coastguard Worker
234*6dbdd20aSAndroid Build Coastguard Worker
235*6dbdd20aSAndroid Build Coastguard Worker## Related Work
236*6dbdd20aSAndroid Build Coastguard Worker
237*6dbdd20aSAndroid Build Coastguard Worker### simpleperf
238*6dbdd20aSAndroid Build Coastguard WorkerEven though [simpleperf](https://cs.android.com/android/platform/superproject/main/+/main:system/extras/simpleperf/doc/README.md) is a CPU rather than memory profiler, it is similar in nature to the work proposed here in that it supports offline unwinding. The kernel is asked to provide copies of stack traces at regular intervals, which are dumped onto disk. The dumped information is then used to unwind the stacks after the profiling is complete.
239*6dbdd20aSAndroid Build Coastguard Worker
240*6dbdd20aSAndroid Build Coastguard Worker
241*6dbdd20aSAndroid Build Coastguard Worker### malloc-debug
242*6dbdd20aSAndroid Build Coastguard Worker[malloc-debug](https://cs.android.com/android/platform/superproject/main/+/main:bionic/libc/malloc_debug/) instruments bionic's allocation functions to detect common memory problems like buffer overflows, double frees, etc. This is similar to the project described in this document as it uses the same mechanism to instrument the libc allocation functions. Unlike heapprofd, it does not provide the user with heap dumps.
243*6dbdd20aSAndroid Build Coastguard Worker
244*6dbdd20aSAndroid Build Coastguard Worker
245*6dbdd20aSAndroid Build Coastguard Worker### Feature Matrix
246*6dbdd20aSAndroid Build Coastguard Worker|              | use after free detection | Java object graph attribution | native memory attribution | Android | out-of-process |
247*6dbdd20aSAndroid Build Coastguard Worker|--------------|--------------------------|-------------------------------|---------------------------|---------|----------------|
248*6dbdd20aSAndroid Build Coastguard Worker| heapprofd    | no                       | no                            | yes                       | yes     | yes            |
249*6dbdd20aSAndroid Build Coastguard Worker| malloc-debug | yes                      | no                            | yes                       | yes     | no             |
250*6dbdd20aSAndroid Build Coastguard Worker
251*6dbdd20aSAndroid Build Coastguard Worker## Alternatives Considered
252*6dbdd20aSAndroid Build Coastguard Worker
253*6dbdd20aSAndroid Build Coastguard Worker### Copy-on-write stack
254*6dbdd20aSAndroid Build Coastguard WorkerThe lower frames of the stack are unlikely to change between the client sending and the server unwinding the stack information. We wanted to exploit this fact by marking the stack pages as copy-on-write by [`vmsplice(2)`](http://man7.org/linux/man-pages/man2/vmsplice.2.html)-ing them into a pipe. Unfortunately, the vmsplice system call does not mark pages as copy-on-write, but is conceptually a mmap into the pipe buffer, which causes the daemon to see changes to the stack that happen after the vmsplice and hence corrupt the unwinder.
255*6dbdd20aSAndroid Build Coastguard Worker
256*6dbdd20aSAndroid Build Coastguard Worker
257*6dbdd20aSAndroid Build Coastguard Worker### Profiling across fork(2)
258*6dbdd20aSAndroid Build Coastguard WorkerIf we want to enable profiling for the newly forked process, we need to establish new connections to heapprofd and create a new connection pool. This is to prevent messages from the parent and child process from being interleaved.
259*6dbdd20aSAndroid Build Coastguard Worker
260*6dbdd20aSAndroid Build Coastguard WorkerFor non-zygote processes, we could use [`pthread_atfork(3)`](http://man7.org/linux/man-pages/man3/pthread_atfork.3.html) to establish new connections.
261*6dbdd20aSAndroid Build Coastguard Worker
262*6dbdd20aSAndroid Build Coastguard WorkerFor zygote processes, [`FileDescriptorInfo::ReopenOrDetach`](https://cs.android.com/android/platform/superproject/main/+/main:frameworks/base/core/jni/fd_utils.cpp?q=%22void%20FileDescriptorInfo::ReopenOrDetach%22), which is called after `fork(2)`– and thus after the `pthread_atfork` handlers – detaches all sockets, i.e. replaces them with file descriptors to `/dev/null`. If the socket is not contained within [`kPathWhiteList`](https://cs.android.com/android/platform/superproject/main/+/main:frameworks/base/core/jni/fd_utils.cpp?q=symbol:kPathWhitelist), zygote crashes instead. Thus using only a `pthread_atfork` handler is not feasible, as the connections established within will immediately get disconnected in zygote children.
263*6dbdd20aSAndroid Build Coastguard Worker
264*6dbdd20aSAndroid Build Coastguard WorkerAfter forking, zygote calls [`PreApplicationInit`](https://cs.android.com/android/platform/superproject/main/+/main:frameworks/base/core/jni/com_android_internal_os_Zygote.cpp?q=symbol:PreApplicationInit), which is currently used by malloc\_debug to detect whether it is in the root zygote or in a child process by setting `gMallocLeakZygoteChild`. It also calls [Java callbacks](https://cs.android.com/android/platform/superproject/main/+/main:frameworks/base/core/jni/com_android_internal_os_Zygote.cpp?q=CallStaticVoidMethod.*gCallPostForkChildHooks), but there does not seem to currently exist a way to dynamically register native callbacks.
265*6dbdd20aSAndroid Build Coastguard Worker
266*6dbdd20aSAndroid Build Coastguard WorkerNaive lazy initialization (i.e. closing the socket in the atfork handler, and then reconnecting on the first call to malloc) is problematic, as the code in zygote between fork and `ReopenOrDetach` might call `malloc`, thus leading to the connection to be established, which then gets closed by `ReopenOrDetach` again.
267*6dbdd20aSAndroid Build Coastguard Worker
268*6dbdd20aSAndroid Build Coastguard WorkerTo solve this, we can take an approach similar to `gMallocLeakZygoteChild`. Before forking, zygote will be modified to set `gheapprofdInZygoteFork` to true, and after the fork handling is finished it will be set to false. This way we can make sure we delay the lazy initialization until the fork is fully complete. `pthread_atfork` is used to close the file descriptors after fork in the child.
269*6dbdd20aSAndroid Build Coastguard Worker
270*6dbdd20aSAndroid Build Coastguard Worker
271*6dbdd20aSAndroid Build Coastguard Worker### Profiling app from startup by externally detecting startup
272*6dbdd20aSAndroid Build Coastguard WorkerThis option relies on the ability of the tracing system to detect app startup (which we'll need regardless for perf profiling).
273*6dbdd20aSAndroid Build Coastguard Worker
274*6dbdd20aSAndroid Build Coastguard Worker**Advantages**
275*6dbdd20aSAndroid Build Coastguard Worker*   one case fewer to handle from the libc viewpoint
276*6dbdd20aSAndroid Build Coastguard Worker
277*6dbdd20aSAndroid Build Coastguard Worker**Disadvantages**
278*6dbdd20aSAndroid Build Coastguard Worker*   less accurate, will miss the first X ms of startup
279*6dbdd20aSAndroid Build Coastguard Worker*   a mechanism that watches ftrace events to detect startup is non-trivial.
280*6dbdd20aSAndroid Build Coastguard Worker
281*6dbdd20aSAndroid Build Coastguard Worker### Delayed Unwinding
282*6dbdd20aSAndroid Build Coastguard WorkerAnticipating that many allocations are short-lived, we can delay the unwinding of stacks by a fixed time. This is a memory vs CPU usage trade-off, as these stacks have to be held in memory until they are either unwound or freed.
283*6dbdd20aSAndroid Build Coastguard Worker
284*6dbdd20aSAndroid Build Coastguard WorkerThis graph shows that 20 % of allocations are freed within 900 sampled allocations (at 1 %, so 500000 total) from the same process.
285*6dbdd20aSAndroid Build Coastguard Worker
286*6dbdd20aSAndroid Build Coastguard Worker
287*6dbdd20aSAndroid Build Coastguard Worker<table>
288*6dbdd20aSAndroid Build Coastguard Worker  <tr>
289*6dbdd20aSAndroid Build Coastguard Worker   <td>
290*6dbdd20aSAndroid Build Coastguard Worker
291*6dbdd20aSAndroid Build Coastguard Worker![](/docs/images/heapprofd-design/Android-Heap2.png)
292*6dbdd20aSAndroid Build Coastguard Worker
293*6dbdd20aSAndroid Build Coastguard Worker<p>
294*6dbdd20aSAndroid Build Coastguard Worker<strong>Mean:</strong> 7000 allocations
295*6dbdd20aSAndroid Build Coastguard Worker   </td>
296*6dbdd20aSAndroid Build Coastguard Worker   <td>
297*6dbdd20aSAndroid Build Coastguard Worker
298*6dbdd20aSAndroid Build Coastguard Worker![](/docs/images/heapprofd-design/Android-Heap3.png)
299*6dbdd20aSAndroid Build Coastguard Worker
300*6dbdd20aSAndroid Build Coastguard Worker<p>
301*6dbdd20aSAndroid Build Coastguard Worker<strong>Mean:</strong> 8950 bytes
302*6dbdd20aSAndroid Build Coastguard Worker   </td>
303*6dbdd20aSAndroid Build Coastguard Worker  </tr>
304*6dbdd20aSAndroid Build Coastguard Worker</table>
305*6dbdd20aSAndroid Build Coastguard Worker
306*6dbdd20aSAndroid Build Coastguard Worker
307*6dbdd20aSAndroid Build Coastguard WorkerSo, at 1 % sampling rate, for at a cost of ~8 megabytes (900 \* 8950) per process, we can reduce the number of unwinds by around 20 %. This will not allow us to get an accurate number of "allocated space", so this idea was rejected.
308