1*6dbdd20aSAndroid Build Coastguard Worker# heapprofd: Android Heap Profiler 2*6dbdd20aSAndroid Build Coastguard Worker 3*6dbdd20aSAndroid Build Coastguard Worker_**Status:** COMPLETED **·** fmayer, primiano **·** 2018-06-15_ 4*6dbdd20aSAndroid Build Coastguard Worker_**Updated:** 2020-04-20_ 5*6dbdd20aSAndroid Build Coastguard Worker 6*6dbdd20aSAndroid Build Coastguard Worker## Objective 7*6dbdd20aSAndroid Build Coastguard WorkerProvide a low-overhead native heap profiling mechanism, with C++ and Java callstack attribution, usable by all processes on an Android system. This includes Java and native services. The mechanism is capable of exporting heap dumps into traces in order to be able to correlate heap information with other activity on the system. This feature was added in the Android 10 release. 8*6dbdd20aSAndroid Build Coastguard Worker 9*6dbdd20aSAndroid Build Coastguard Worker## Overview 10*6dbdd20aSAndroid Build Coastguard Worker 11*6dbdd20aSAndroid Build Coastguard Worker 12*6dbdd20aSAndroid Build Coastguard WorkerImplement an out-of-process heap profiler. Do the minimal amount of processing in-line of malloc, and then delegate to a central component for further processing. This introduces a new daemon _heapprofd_. 13*6dbdd20aSAndroid Build Coastguard Worker 14*6dbdd20aSAndroid Build Coastguard WorkerWhen tracing is enabled, either via a system property or a signal delivered to an existing process, a given percentage of malloc calls copies the current call stack into a shared memory buffer that is received by heapprofd. heapprofd uses libunwindstack asynchronously for stack unwinding and symbolization. This information is used to build bookkeeping tables to track live allocations and is ultimately dumped into the Perfetto trace. 15*6dbdd20aSAndroid Build Coastguard Worker 16*6dbdd20aSAndroid Build Coastguard WorkerAll data referenced in this design document was collected on a Pixel 2 on Android P. 17*6dbdd20aSAndroid Build Coastguard Worker 18*6dbdd20aSAndroid Build Coastguard Worker 19*6dbdd20aSAndroid Build Coastguard Worker### Requirements 20*6dbdd20aSAndroid Build Coastguard WorkerThese are the properties that must be fulfilled by a heap profiler for Android: 21*6dbdd20aSAndroid Build Coastguard Worker 22*6dbdd20aSAndroid Build Coastguard Worker**No setup:** A heap profile can be taken with a single command. 23*6dbdd20aSAndroid Build Coastguard Worker 24*6dbdd20aSAndroid Build Coastguard Worker**Profile running apps:** The system can be used to enable profiling of already running apps to get information of memory usage _since the profiling was enabled_ without requiring a restart. This can be useful to track down memory leaks. 25*6dbdd20aSAndroid Build Coastguard Worker 26*6dbdd20aSAndroid Build Coastguard Worker**Attribute to Java methods:** The system can be used to track the native memory usage of Java applications. Allocations on the Java heap are outside of the scope of this work. 27*6dbdd20aSAndroid Build Coastguard Worker 28*6dbdd20aSAndroid Build Coastguard Worker**Zero overhead when disabled:** the system must not incur a performance overhead when it is not enabled. 29*6dbdd20aSAndroid Build Coastguard Worker 30*6dbdd20aSAndroid Build Coastguard Worker**Profile whole system:** the system must be capable of handling the load of profiling all processes running. The sample rate is adjusted to limit the amount of data. 31*6dbdd20aSAndroid Build Coastguard Worker 32*6dbdd20aSAndroid Build Coastguard Worker**Negligible in-process memory overhead:** the system must not hold bookkeeping data in the process in order not to inflate higher-level metrics like PSS. 33*6dbdd20aSAndroid Build Coastguard Worker 34*6dbdd20aSAndroid Build Coastguard Worker**Bounded performance impact:** the device must still be usable for all use-cases. 35*6dbdd20aSAndroid Build Coastguard Worker 36*6dbdd20aSAndroid Build Coastguard Worker 37*6dbdd20aSAndroid Build Coastguard Worker## Detailed Design 38*6dbdd20aSAndroid Build Coastguard Worker 39*6dbdd20aSAndroid Build Coastguard Worker### Enabling profiling 40*6dbdd20aSAndroid Build Coastguard Worker 41*6dbdd20aSAndroid Build Coastguard Worker#### Use case 1: profiling future allocations from a running process 42*6dbdd20aSAndroid Build Coastguard WorkerOne of the real-time signals ([`BIONIC_SIGNAL_PROFILER`](https://cs.android.com/android/platform/superproject/main/+/main:bionic/libc/platform/bionic/reserved_signals.h?q=symbol:BIONIC_SIGNAL_PROFILER)) is reserved in libc as a triggering mechanism. In this scenario: 43*6dbdd20aSAndroid Build Coastguard Worker 44*6dbdd20aSAndroid Build Coastguard Worker* heapprofd sends a RT signal to the target process 45*6dbdd20aSAndroid Build Coastguard Worker* Upon receipt of the signal, bionic reacts by installing a temporary malloc hook, which in turn spawns a thread to dynamically load libheapprofd.so in the process context. This means heapprofd will not work for statically linked binaries, as they lack the ability to `dlopen`. We can not spawn the thread directly from the signal handler, as `pthread_create` is not async-safe. 46*6dbdd20aSAndroid Build Coastguard Worker* The initializer in libheapprofd.so is called to take care of the rest (see [client operation](#client-operation-and-in-process-hooks) below) 47*6dbdd20aSAndroid Build Coastguard Worker 48*6dbdd20aSAndroid Build Coastguard Worker 49*6dbdd20aSAndroid Build Coastguard Worker#### Use case 2: profiling a single process from startup 50*6dbdd20aSAndroid Build Coastguard Worker* heapprofd sets a property of the form libc.debug.heapprofd.argv0 (argv0 being the first argument in `/proc/self/cmdline`, up to the first ":") 51*6dbdd20aSAndroid Build Coastguard Worker* Native processes: when bionic is initialized checks for the presence of the property and, if found and matches the process name, loads the libheapprofd.so. 52*6dbdd20aSAndroid Build Coastguard Worker* Managed java processes: zygote calls `mallopt(M_SET_ZYGOTE_CHILD, ...)` in `PreApplicationInit`. In this, Bionic checks for the presence of the property and, if found and matches the process name, loads the libheapprofd.so and continues as above. 53*6dbdd20aSAndroid Build Coastguard Worker 54*6dbdd20aSAndroid Build Coastguard Worker#### Use case 3: profiling the whole system 55*6dbdd20aSAndroid Build Coastguard WorkerA system property `libc.heapprofd.enable` can be set to enable heap profiling on startup. When this property is set every process on startup will load the libheapprofd.so library. The rest is identical to the case above. 56*6dbdd20aSAndroid Build Coastguard Worker 57*6dbdd20aSAndroid Build Coastguard Worker 58*6dbdd20aSAndroid Build Coastguard Worker### Disabling profiling 59*6dbdd20aSAndroid Build Coastguard WorkerDisabling profiling happens simply by virtue of shutting down the sockets from the heapprofd end. Upon send() failure the client will uninstall the hooks (see appendix: thread-safe hooks setup / teardown) 60*6dbdd20aSAndroid Build Coastguard Worker 61*6dbdd20aSAndroid Build Coastguard Worker 62*6dbdd20aSAndroid Build Coastguard Worker### Client operation and in process hooks 63*6dbdd20aSAndroid Build Coastguard WorkerUpon initialization of libheapprofd.so: 64*6dbdd20aSAndroid Build Coastguard Worker 65*6dbdd20aSAndroid Build Coastguard Worker* The client establishes a connection to the heapprofd daemon through a connected UNIX socket. 66*6dbdd20aSAndroid Build Coastguard Worker* Upon connection, the daemon will send a packet to the client specifying the profiling configuration (sampling rate; sample all threads / only specific threads; tuning of sampling heuristics). It will also send an FD for the SharedMemoryBuffer used to send samples (see [wire protocol](heapprofd-wire-protocol.md)). 67*6dbdd20aSAndroid Build Coastguard Worker* The malloc hooks are installed. 68*6dbdd20aSAndroid Build Coastguard Worker 69*6dbdd20aSAndroid Build Coastguard WorkerUpon each `*alloc()/posix_memalign()` call, the client library will perform some minimal bookkeeping. If the sampling rate is hit, it will copy the raw stack, together with a header specifying register state, tid, a global sequence number of the operation, and size of the allocation to the shared memory buffer. It will then send on a control socket to wake up the service. 70*6dbdd20aSAndroid Build Coastguard Worker 71*6dbdd20aSAndroid Build Coastguard WorkerUpon each `free()` call, the client will append the freed address into a global (process-wide) append-only buffer (the buffer is to avoid the overhead of a send() for each free). This buffer of free()s is sent to the heapprofd daemon when the fixed-size buffer is full or after a preset number of operations. This also includes a global sequence number for the operation. 72*6dbdd20aSAndroid Build Coastguard Worker 73*6dbdd20aSAndroid Build Coastguard WorkerIf the send() fails because the heapprofd has shut down the socket, voluntarily (graceful disabling) or involuntarily (has crashed) the client will teardown the hooks and disabling any profiling operation. 74*6dbdd20aSAndroid Build Coastguard Worker 75*6dbdd20aSAndroid Build Coastguard Worker 76*6dbdd20aSAndroid Build Coastguard Worker### Service operation 77*6dbdd20aSAndroid Build Coastguard Worker 78*6dbdd20aSAndroid Build Coastguard Worker 79*6dbdd20aSAndroid Build Coastguard WorkerThe unwinder thread read the client's shared memory buffers and handle the samples received. The result of the unwinding is then enqueued using a PostTask for the main thread to do the accounting. A queue-based model between the threads is chosen because it makes synchronization easier. No synchronization is needed at all in the main thread, as the bookkeeping data will only be accessed by it. 80*6dbdd20aSAndroid Build Coastguard Worker 81*6dbdd20aSAndroid Build Coastguard WorkerIf the sample is a malloc, the stack is unwound and the resulting data is handled in the main thread. The main thread ignores mallocs with sequence numbers lower than the one already processed for this address. If the sample is a free, it is added to a buffer. As soon as all mallocs with a sequence number lower than the free have been handled, it is processed. 82*6dbdd20aSAndroid Build Coastguard Worker 83*6dbdd20aSAndroid Build Coastguard Worker 84*6dbdd20aSAndroid Build Coastguard Worker#### Unwinding 85*6dbdd20aSAndroid Build Coastguard Workerlibunwindstack is used for unwinding. A new Memory class is implemented that overlays the copied stack over the process memory (which is accessed using FDMemory). FDMemory uses read on `/proc/self/mem` file descriptors sent by the target application. 86*6dbdd20aSAndroid Build Coastguard Worker 87*6dbdd20aSAndroid Build Coastguard Worker``` 88*6dbdd20aSAndroid Build Coastguard Workerclass StackMemory : public unwindstack::MemoryRemote { 89*6dbdd20aSAndroid Build Coastguard Worker public: 90*6dbdd20aSAndroid Build Coastguard Worker ... 91*6dbdd20aSAndroid Build Coastguard Worker size_t Read(uint64_t addr, void* dst, size_t size) override { 92*6dbdd20aSAndroid Build Coastguard Worker if (addr >= sp_ && addr + size <= stack_end_ && addr + size > sp_) { 93*6dbdd20aSAndroid Build Coastguard Worker size_t offset = static_cast<size_t>(addr - sp_); 94*6dbdd20aSAndroid Build Coastguard Worker memcpy(dst, stack_ + offset, size); 95*6dbdd20aSAndroid Build Coastguard Worker return size; 96*6dbdd20aSAndroid Build Coastguard Worker } 97*6dbdd20aSAndroid Build Coastguard Worker 98*6dbdd20aSAndroid Build Coastguard Worker return mem_->Read(addr, dst, size); 99*6dbdd20aSAndroid Build Coastguard Worker } 100*6dbdd20aSAndroid Build Coastguard Worker 101*6dbdd20aSAndroid Build Coastguard Worker private: 102*6dbdd20aSAndroid Build Coastguard Worker uint64_t sp_; 103*6dbdd20aSAndroid Build Coastguard Worker uint8_t* stack_; 104*6dbdd20aSAndroid Build Coastguard Worker size_t size_; 105*6dbdd20aSAndroid Build Coastguard Worker}; 106*6dbdd20aSAndroid Build Coastguard Worker``` 107*6dbdd20aSAndroid Build Coastguard Worker 108*6dbdd20aSAndroid Build Coastguard WorkerThis allows unwinding to work both for native code and all three execution modes of ART. Native libraries are mapped into the process memory, and ephemeral debug information written by ART is also accessible through the process memory. There is a chance that ART will garbage collect the information before the unwinding is done, in which case we will miss stack frames. As this is a sampling approach anyway, that loss of accuracy is acceptable. 109*6dbdd20aSAndroid Build Coastguard Worker 110*6dbdd20aSAndroid Build Coastguard WorkerRemote unwinding also enables us to use _global caching_ (`Elf::SetCachingEnabled(true)`) in libunwindstack. This prevents debug information being used by different processes to be loaded and decompressed multiple times. 111*6dbdd20aSAndroid Build Coastguard Worker 112*6dbdd20aSAndroid Build Coastguard WorkerWe add an `FDMaps` objects to parse maps from `/proc/self/maps` sent by the target process. We keep `FDMaps` object cached per process that is being profiled. This both saves the overhead of text-parsing `/proc/[pid]/maps` as well as keeps various objects needed for unwinding (e.g. decompressed minidebuginfo). In case an unwind fails with `ERROR_INVALID_MAP` we reparse the maps object. We will do changes to libunwindstack to create a more general version of [`LocalUpdatableMaps`](https://cs.android.com/android/platform/superproject/main/+/main:system/unwinding/libunwindstack/Maps.cpp?q=symbol:LocalUpdatableMaps) that is also applicable for remote processes. 113*6dbdd20aSAndroid Build Coastguard Worker 114*6dbdd20aSAndroid Build Coastguard Worker 115*6dbdd20aSAndroid Build Coastguard Worker#### Advantages of remote unwinding 116*6dbdd20aSAndroid Build Coastguard Worker 117*6dbdd20aSAndroid Build Coastguard Worker**Crash-proofness:** Crashing bugs in the bookkeeping logic or libunwindstack do not result in user-visible crashes but only lack of profiling data. It will result in the connections to heapprofd being broken and profiling gracefully stopping on the client-side. 118*6dbdd20aSAndroid Build Coastguard Worker 119*6dbdd20aSAndroid Build Coastguard Worker**Performance:** copying the stack has much more consistent and higher performance than unwinding, which can take multiple milliseconds. See graph above. 120*6dbdd20aSAndroid Build Coastguard Worker 121*6dbdd20aSAndroid Build Coastguard Worker**Does not inflate higher-level metrics:** higher-level metrics such as PSS are not inflated by the book-keeping cost. 122*6dbdd20aSAndroid Build Coastguard Worker 123*6dbdd20aSAndroid Build Coastguard Worker**Compression:** bookkeeping of unwound frames can be more efficient if it is shared between multiple processes. E.g. common sequences of frames (in libc, ART, etc) can be deduped. 124*6dbdd20aSAndroid Build Coastguard Worker 125*6dbdd20aSAndroid Build Coastguard Worker 126*6dbdd20aSAndroid Build Coastguard Worker#### Disadvantages of remote unwinding 127*6dbdd20aSAndroid Build Coastguard Worker**Complexity:** the system has higher complexity than unwinding and symbolizing synchronously. 128*6dbdd20aSAndroid Build Coastguard Worker 129*6dbdd20aSAndroid Build Coastguard Worker#### Bookkeeping 130*6dbdd20aSAndroid Build Coastguard WorkerThe data is stored as a tree where each element has a back-pointer to its parent. This deduplicates repeated stack frames. String interning is applied for method names and library names. 131*6dbdd20aSAndroid Build Coastguard Worker 132*6dbdd20aSAndroid Build Coastguard WorkerDetails will change to adapt to data collected during the implementation. 133*6dbdd20aSAndroid Build Coastguard Worker 134*6dbdd20aSAndroid Build Coastguard Worker 135*6dbdd20aSAndroid Build Coastguard Worker### Wire protocol 136*6dbdd20aSAndroid Build Coastguard WorkerIn early versions of heapprofd, we used a `SOCK_STREAM` socket to send callstacks to the service. We now use a shared memory based [wire protocol](heapprofd-wire-protocol.md) described in detail separately. 137*6dbdd20aSAndroid Build Coastguard Worker 138*6dbdd20aSAndroid Build Coastguard Worker### Failure modes 139*6dbdd20aSAndroid Build Coastguard Worker**heapprofd unwinding cannot keep up:** The shared memory buffer will reject new samples. If `block_client` is set, the client will retry until there is space in the shared memory buffer. 140*6dbdd20aSAndroid Build Coastguard Worker 141*6dbdd20aSAndroid Build Coastguard Worker**heapprofd crashes:** Writing on the control socket will fail, and the client will be torn down. 142*6dbdd20aSAndroid Build Coastguard Worker 143*6dbdd20aSAndroid Build Coastguard Worker**Writing in client fails:** If the write fails with any error code except `EINTR`, the connection is closed, and profiling is torn down. 144*6dbdd20aSAndroid Build Coastguard Worker 145*6dbdd20aSAndroid Build Coastguard Worker 146*6dbdd20aSAndroid Build Coastguard Worker### Fork handling 147*6dbdd20aSAndroid Build Coastguard WorkerAfter a process forks, we need to clean up the state that was initialized by the parent process and uninstall the malloc hooks. We do not intend to currently support following forks, see [Alternatives Considered](#alternatives-considered) for possible implementation thereof. 148*6dbdd20aSAndroid Build Coastguard Worker 149*6dbdd20aSAndroid Build Coastguard Worker## Performance considerations 150*6dbdd20aSAndroid Build Coastguard Worker 151*6dbdd20aSAndroid Build Coastguard Worker### Remote unwinding 152*6dbdd20aSAndroid Build Coastguard Worker_**Note:** This data was collected when heapprofd used a socket to communicate from client to service. We now use a shared-memory buffer, so we should be even lower overhead._ 153*6dbdd20aSAndroid Build Coastguard Worker 154*6dbdd20aSAndroid Build Coastguard WorkerRemote unwinding is used to reduce the performance impact on the applications that are being profiled. After the stack has been sent, the application can resume its operation while the remote daemon unwinds the stack and does unwinding. As sending the stack is, on average, a faster operation than unwinding the stack, this results in a performance gain. 155*6dbdd20aSAndroid Build Coastguard Worker 156*6dbdd20aSAndroid Build Coastguard Worker 157*6dbdd20aSAndroid Build Coastguard Worker<table> 158*6dbdd20aSAndroid Build Coastguard Worker <tr> 159*6dbdd20aSAndroid Build Coastguard Worker <td> 160*6dbdd20aSAndroid Build Coastguard Worker 161*6dbdd20aSAndroid Build Coastguard Worker 162*6dbdd20aSAndroid Build Coastguard Worker 163*6dbdd20aSAndroid Build Coastguard Worker </td> 164*6dbdd20aSAndroid Build Coastguard Worker <td> 165*6dbdd20aSAndroid Build Coastguard Worker 166*6dbdd20aSAndroid Build Coastguard Worker 167*6dbdd20aSAndroid Build Coastguard Worker 168*6dbdd20aSAndroid Build Coastguard Worker </td> 169*6dbdd20aSAndroid Build Coastguard Worker </tr> 170*6dbdd20aSAndroid Build Coastguard Worker</table> 171*6dbdd20aSAndroid Build Coastguard Worker 172*6dbdd20aSAndroid Build Coastguard Worker**Mean unwind:** 413us 173*6dbdd20aSAndroid Build Coastguard Worker**Mean send:** 50us 174*6dbdd20aSAndroid Build Coastguard Worker**Median unwind:** 193us 175*6dbdd20aSAndroid Build Coastguard Worker**Median send:** 14us 176*6dbdd20aSAndroid Build Coastguard Worker**90 percentile unwind:** 715us 177*6dbdd20aSAndroid Build Coastguard Worker**90 percentile send:** 40us 178*6dbdd20aSAndroid Build Coastguard Worker 179*6dbdd20aSAndroid Build Coastguard Worker 180*6dbdd20aSAndroid Build Coastguard Worker### Sampling 181*6dbdd20aSAndroid Build Coastguard WorkerUnwinding the stack on every `malloc` call has a high cost that is not always worth paying. Thus malloc calls are sampled client-side using Poisson sampling with a probability proportional to their allocation size (i.e. larger allocations are more likely to be considered than small ones). All memory allocated since the last malloc considered is attributed to this allocation. 182*6dbdd20aSAndroid Build Coastguard Worker 183*6dbdd20aSAndroid Build Coastguard WorkerThe sampling rate is configurable as part of the initial handshake. A sampling rate == 1 will degenerate into the fully-accurate high-overhead mode. 184*6dbdd20aSAndroid Build Coastguard Worker 185*6dbdd20aSAndroid Build Coastguard WorkerSee [Sampling for Memory Profiles](/docs/design-docs/heapprofd-sampling) for 186*6dbdd20aSAndroid Build Coastguard Workermore details. 187*6dbdd20aSAndroid Build Coastguard Worker 188*6dbdd20aSAndroid Build Coastguard WorkerPrior art: [crbug.com/812262](http://crbug.com/812262), [crbug.com/803276](http://crbug.com/803276). 189*6dbdd20aSAndroid Build Coastguard Worker 190*6dbdd20aSAndroid Build Coastguard Worker## Implementation Plan 191*6dbdd20aSAndroid Build Coastguard Worker### Implement prototype [done] 192*6dbdd20aSAndroid Build Coastguard WorkerImplement a prototype of the system described above that works with SELinux `setenforce 0` and running as root on walleye. 193*6dbdd20aSAndroid Build Coastguard Worker 194*6dbdd20aSAndroid Build Coastguard Worker### Implement Benchmark [done] 195*6dbdd20aSAndroid Build Coastguard WorkerImplement a program that executes malloc / free calls from ground truth data. Profile this program using heapprofd, then compare results to ground truth data. Use this to iterate on sampling heuristics. 196*6dbdd20aSAndroid Build Coastguard Worker 197*6dbdd20aSAndroid Build Coastguard Worker### Productionize [done] 198*6dbdd20aSAndroid Build Coastguard WorkerDo security changes required to run heapprofd with `setenforce 1` and as non-root. 199*6dbdd20aSAndroid Build Coastguard Worker 200*6dbdd20aSAndroid Build Coastguard Worker 201*6dbdd20aSAndroid Build Coastguard Worker## Testing Plan 202*6dbdd20aSAndroid Build Coastguard Worker 203*6dbdd20aSAndroid Build Coastguard Worker* Employ fuzzing on the shared memory buffer. [done] 204*6dbdd20aSAndroid Build Coastguard Worker* Unit-tests for components. [done] 205*6dbdd20aSAndroid Build Coastguard Worker* CTS. [done] 206*6dbdd20aSAndroid Build Coastguard Worker 207*6dbdd20aSAndroid Build Coastguard Worker 208*6dbdd20aSAndroid Build Coastguard Worker## Background 209*6dbdd20aSAndroid Build Coastguard Worker 210*6dbdd20aSAndroid Build Coastguard Worker### ART modes of execution 211*6dbdd20aSAndroid Build Coastguard WorkerART (Android RunTime, the Android Java Runtime) has three different modes of execution. 212*6dbdd20aSAndroid Build Coastguard Worker 213*6dbdd20aSAndroid Build Coastguard Worker**Interpreted:** Java byte-code is interpreted during execution. Instrumentation in ART allows to get dexpc (~offset from dex file) for the code being executed. 214*6dbdd20aSAndroid Build Coastguard Worker 215*6dbdd20aSAndroid Build Coastguard Worker**JIT-compiled:** Java byte-code is compiled to native code during run-time. Both the code and the ELF information only live in process memory. The debug information is stored in a global variable, currently only if the app is debuggable or a global system property (`dalvik.vm.minidebuginfo`) is set. This is because the current implementation incurs a memory overhead that is too high to default enable. 216*6dbdd20aSAndroid Build Coastguard Worker 217*6dbdd20aSAndroid Build Coastguard Worker**AOT (ahead of time) compiled:** Java code is compiled into native code before run-time. This produces an .oat file, which is essentially an .so. Both code and ELF information is stored on disk. During execution, like shared native libraries, it is memory mapped into process memory. 218*6dbdd20aSAndroid Build Coastguard Worker 219*6dbdd20aSAndroid Build Coastguard Worker### Stack Unwinding 220*6dbdd20aSAndroid Build Coastguard WorkerStack unwinding is the process of determining the chain of return addresses from the raw bytes of the stack. These are the addresses we want to attribute the allocated memory to. 221*6dbdd20aSAndroid Build Coastguard Worker 222*6dbdd20aSAndroid Build Coastguard WorkerThe most efficient way of stack unwinding is using frame pointers. This is unreliable on Android as we do not control build parameters for vendor libraries or OEM builds and due to issues on ARM32. Thus, our stack unwinding relies on libunwindstack which uses DWARF information from the library ELF files to determine return addresses. This can significantly slower, with unwinding of a stack taking between 100μs and ~100 ms ([data from simpleperf](https://gist.github.com/fmayer/a3a5a352196f9037f34241f8fb09004d)). 223*6dbdd20aSAndroid Build Coastguard Worker 224*6dbdd20aSAndroid Build Coastguard Worker[libunwindstack](https://cs.android.com/android/platform/superproject/main/+/main:system/unwinding/libunwindstack/) is Android's replacement for [libunwind](https://www.nongnu.org/libunwind/). It has a modern C++ object-oriented API surface and support for Android specific features allowing it to unwind mixed native and Java applications using information emitted by ART depending on execution mode. It also supports symbolization for native code and all three execution modes or ART. 225*6dbdd20aSAndroid Build Coastguard Worker 226*6dbdd20aSAndroid Build Coastguard Worker### Symbolization 227*6dbdd20aSAndroid Build Coastguard WorkerSymbolization is the process of determining function name and line number from a code address. For builds by Google, we can get symbolized binaries (i.e. binaries with an ELF section that can be used for symbolization) from go/ab or https://ci.android.com (e.g. https://ci.android.com/builds/submitted/6410994/aosp_cf_x86_phone-userdebug/latest/aosp_cf_x86_phone-symbols-6410994.zip). 228*6dbdd20aSAndroid Build Coastguard Worker 229*6dbdd20aSAndroid Build Coastguard WorkerFor other builds, symbolization requires debug info contained within the binary. This information is often compressed. Symbolization of JIT-ed code requires information contained in process memory. 230*6dbdd20aSAndroid Build Coastguard Worker 231*6dbdd20aSAndroid Build Coastguard Worker### Perfetto 232*6dbdd20aSAndroid Build Coastguard Worker[Perfetto](https://perfetto.dev) is an open-source, highly efficient and expandable platform-wide tracing system that allows collection of performance data from kernel, apps and services. It aims to become the next-gen performance tracing mechanism for both Android and Chrome. 233*6dbdd20aSAndroid Build Coastguard Worker 234*6dbdd20aSAndroid Build Coastguard Worker 235*6dbdd20aSAndroid Build Coastguard Worker## Related Work 236*6dbdd20aSAndroid Build Coastguard Worker 237*6dbdd20aSAndroid Build Coastguard Worker### simpleperf 238*6dbdd20aSAndroid Build Coastguard WorkerEven though [simpleperf](https://cs.android.com/android/platform/superproject/main/+/main:system/extras/simpleperf/doc/README.md) is a CPU rather than memory profiler, it is similar in nature to the work proposed here in that it supports offline unwinding. The kernel is asked to provide copies of stack traces at regular intervals, which are dumped onto disk. The dumped information is then used to unwind the stacks after the profiling is complete. 239*6dbdd20aSAndroid Build Coastguard Worker 240*6dbdd20aSAndroid Build Coastguard Worker 241*6dbdd20aSAndroid Build Coastguard Worker### malloc-debug 242*6dbdd20aSAndroid Build Coastguard Worker[malloc-debug](https://cs.android.com/android/platform/superproject/main/+/main:bionic/libc/malloc_debug/) instruments bionic's allocation functions to detect common memory problems like buffer overflows, double frees, etc. This is similar to the project described in this document as it uses the same mechanism to instrument the libc allocation functions. Unlike heapprofd, it does not provide the user with heap dumps. 243*6dbdd20aSAndroid Build Coastguard Worker 244*6dbdd20aSAndroid Build Coastguard Worker 245*6dbdd20aSAndroid Build Coastguard Worker### Feature Matrix 246*6dbdd20aSAndroid Build Coastguard Worker| | use after free detection | Java object graph attribution | native memory attribution | Android | out-of-process | 247*6dbdd20aSAndroid Build Coastguard Worker|--------------|--------------------------|-------------------------------|---------------------------|---------|----------------| 248*6dbdd20aSAndroid Build Coastguard Worker| heapprofd | no | no | yes | yes | yes | 249*6dbdd20aSAndroid Build Coastguard Worker| malloc-debug | yes | no | yes | yes | no | 250*6dbdd20aSAndroid Build Coastguard Worker 251*6dbdd20aSAndroid Build Coastguard Worker## Alternatives Considered 252*6dbdd20aSAndroid Build Coastguard Worker 253*6dbdd20aSAndroid Build Coastguard Worker### Copy-on-write stack 254*6dbdd20aSAndroid Build Coastguard WorkerThe lower frames of the stack are unlikely to change between the client sending and the server unwinding the stack information. We wanted to exploit this fact by marking the stack pages as copy-on-write by [`vmsplice(2)`](http://man7.org/linux/man-pages/man2/vmsplice.2.html)-ing them into a pipe. Unfortunately, the vmsplice system call does not mark pages as copy-on-write, but is conceptually a mmap into the pipe buffer, which causes the daemon to see changes to the stack that happen after the vmsplice and hence corrupt the unwinder. 255*6dbdd20aSAndroid Build Coastguard Worker 256*6dbdd20aSAndroid Build Coastguard Worker 257*6dbdd20aSAndroid Build Coastguard Worker### Profiling across fork(2) 258*6dbdd20aSAndroid Build Coastguard WorkerIf we want to enable profiling for the newly forked process, we need to establish new connections to heapprofd and create a new connection pool. This is to prevent messages from the parent and child process from being interleaved. 259*6dbdd20aSAndroid Build Coastguard Worker 260*6dbdd20aSAndroid Build Coastguard WorkerFor non-zygote processes, we could use [`pthread_atfork(3)`](http://man7.org/linux/man-pages/man3/pthread_atfork.3.html) to establish new connections. 261*6dbdd20aSAndroid Build Coastguard Worker 262*6dbdd20aSAndroid Build Coastguard WorkerFor zygote processes, [`FileDescriptorInfo::ReopenOrDetach`](https://cs.android.com/android/platform/superproject/main/+/main:frameworks/base/core/jni/fd_utils.cpp?q=%22void%20FileDescriptorInfo::ReopenOrDetach%22), which is called after `fork(2)`– and thus after the `pthread_atfork` handlers – detaches all sockets, i.e. replaces them with file descriptors to `/dev/null`. If the socket is not contained within [`kPathWhiteList`](https://cs.android.com/android/platform/superproject/main/+/main:frameworks/base/core/jni/fd_utils.cpp?q=symbol:kPathWhitelist), zygote crashes instead. Thus using only a `pthread_atfork` handler is not feasible, as the connections established within will immediately get disconnected in zygote children. 263*6dbdd20aSAndroid Build Coastguard Worker 264*6dbdd20aSAndroid Build Coastguard WorkerAfter forking, zygote calls [`PreApplicationInit`](https://cs.android.com/android/platform/superproject/main/+/main:frameworks/base/core/jni/com_android_internal_os_Zygote.cpp?q=symbol:PreApplicationInit), which is currently used by malloc\_debug to detect whether it is in the root zygote or in a child process by setting `gMallocLeakZygoteChild`. It also calls [Java callbacks](https://cs.android.com/android/platform/superproject/main/+/main:frameworks/base/core/jni/com_android_internal_os_Zygote.cpp?q=CallStaticVoidMethod.*gCallPostForkChildHooks), but there does not seem to currently exist a way to dynamically register native callbacks. 265*6dbdd20aSAndroid Build Coastguard Worker 266*6dbdd20aSAndroid Build Coastguard WorkerNaive lazy initialization (i.e. closing the socket in the atfork handler, and then reconnecting on the first call to malloc) is problematic, as the code in zygote between fork and `ReopenOrDetach` might call `malloc`, thus leading to the connection to be established, which then gets closed by `ReopenOrDetach` again. 267*6dbdd20aSAndroid Build Coastguard Worker 268*6dbdd20aSAndroid Build Coastguard WorkerTo solve this, we can take an approach similar to `gMallocLeakZygoteChild`. Before forking, zygote will be modified to set `gheapprofdInZygoteFork` to true, and after the fork handling is finished it will be set to false. This way we can make sure we delay the lazy initialization until the fork is fully complete. `pthread_atfork` is used to close the file descriptors after fork in the child. 269*6dbdd20aSAndroid Build Coastguard Worker 270*6dbdd20aSAndroid Build Coastguard Worker 271*6dbdd20aSAndroid Build Coastguard Worker### Profiling app from startup by externally detecting startup 272*6dbdd20aSAndroid Build Coastguard WorkerThis option relies on the ability of the tracing system to detect app startup (which we'll need regardless for perf profiling). 273*6dbdd20aSAndroid Build Coastguard Worker 274*6dbdd20aSAndroid Build Coastguard Worker**Advantages** 275*6dbdd20aSAndroid Build Coastguard Worker* one case fewer to handle from the libc viewpoint 276*6dbdd20aSAndroid Build Coastguard Worker 277*6dbdd20aSAndroid Build Coastguard Worker**Disadvantages** 278*6dbdd20aSAndroid Build Coastguard Worker* less accurate, will miss the first X ms of startup 279*6dbdd20aSAndroid Build Coastguard Worker* a mechanism that watches ftrace events to detect startup is non-trivial. 280*6dbdd20aSAndroid Build Coastguard Worker 281*6dbdd20aSAndroid Build Coastguard Worker### Delayed Unwinding 282*6dbdd20aSAndroid Build Coastguard WorkerAnticipating that many allocations are short-lived, we can delay the unwinding of stacks by a fixed time. This is a memory vs CPU usage trade-off, as these stacks have to be held in memory until they are either unwound or freed. 283*6dbdd20aSAndroid Build Coastguard Worker 284*6dbdd20aSAndroid Build Coastguard WorkerThis graph shows that 20 % of allocations are freed within 900 sampled allocations (at 1 %, so 500000 total) from the same process. 285*6dbdd20aSAndroid Build Coastguard Worker 286*6dbdd20aSAndroid Build Coastguard Worker 287*6dbdd20aSAndroid Build Coastguard Worker<table> 288*6dbdd20aSAndroid Build Coastguard Worker <tr> 289*6dbdd20aSAndroid Build Coastguard Worker <td> 290*6dbdd20aSAndroid Build Coastguard Worker 291*6dbdd20aSAndroid Build Coastguard Worker 292*6dbdd20aSAndroid Build Coastguard Worker 293*6dbdd20aSAndroid Build Coastguard Worker<p> 294*6dbdd20aSAndroid Build Coastguard Worker<strong>Mean:</strong> 7000 allocations 295*6dbdd20aSAndroid Build Coastguard Worker </td> 296*6dbdd20aSAndroid Build Coastguard Worker <td> 297*6dbdd20aSAndroid Build Coastguard Worker 298*6dbdd20aSAndroid Build Coastguard Worker 299*6dbdd20aSAndroid Build Coastguard Worker 300*6dbdd20aSAndroid Build Coastguard Worker<p> 301*6dbdd20aSAndroid Build Coastguard Worker<strong>Mean:</strong> 8950 bytes 302*6dbdd20aSAndroid Build Coastguard Worker </td> 303*6dbdd20aSAndroid Build Coastguard Worker </tr> 304*6dbdd20aSAndroid Build Coastguard Worker</table> 305*6dbdd20aSAndroid Build Coastguard Worker 306*6dbdd20aSAndroid Build Coastguard Worker 307*6dbdd20aSAndroid Build Coastguard WorkerSo, at 1 % sampling rate, for at a cost of ~8 megabytes (900 \* 8950) per process, we can reduce the number of unwinds by around 20 %. This will not allow us to get an accurate number of "allocated space", so this idea was rejected. 308