1# Batch Trace Processor 2This document describes the overall design of Batch Trace Processor and 3aids in integrating it into other systems. 4 5 6 7## Motivation 8The Perfetto trace processor is the de-facto way to perform analysis on a 9single trace. Using the 10[trace processor Python API](/docs/analysis/trace-processor#python-api), 11traces can be queried interactively, plots made from those results etc. 12 13While queries on a single trace are useful when debugging a specific problem 14in that trace or in the very early stages of understanding a domain, it soon 15becomes limiting. One trace is unlikely to be representative 16of the entire population and it's easy to overfit queries i.e. spend a 17lot of effort on breaking down a problem in that trace while neglecting 18other, more common issues in the population. 19 20Because of this, what we actually want is to be able to query many traces 21(usually on the order of 250-10000+) and identify the patterns which show 22up in a significant fraction of them. This ensures that time is being spent 23on issues which are affecting user experience instead of just a random 24problem which happened to show up in the trace. 25 26One low-effort option for solving this problem is simply to ask people to use 27utilities like [Executors](https://docs.python.org/3/library/concurrent.futures.html#executor-objects) 28with the Python API to load multiple traces and query them in parallel. 29Unfortunately, there are several downsides to this approach: 30* Every user has to reinvent the wheel every time they want to query multiple 31 traces. Over time, there would likely be a proliferation of slightly modified 32 code which is copied from each place. 33* While the basics of parallelising queries on multiple traces on a single 34 machine is straightforward, one day, we may want to shard trace processing 35 across multiple machines. Once this happens, the complexity of the code would 36 rise significantly to the point where a central implementation becomes a 37 necessity. Because of this, it's better to have the API first before engineers 38 start building their own custom solutions. 39* A big aim for the Perfetto team these days is to make trace analysis more 40 accessible to reduce the number of places where we need to be in the loop. 41 Having a well supported API for an important usecase like bulk trace analysis 42 directly helps with this. 43 44While we've discussed querying traces so far, the experience for loading traces 45from different traces should be just as good. This has historically been a big 46reason why the Python API has not gained as much adoption as we would have 47liked. 48 49Especially internally in Google, we should not be relying on engineers 50knowing where traces live on the network filesystem and the directory layout. 51Instead, they should be able to simply be able to specify the data source (i.e. 52lab, testing population) and some parameters (e.g. build id, date, kernel 53version) that traces should match should match and traces meeting these criteria 54should found and loaded. 55 56Putting all this together, we want to build a library which can: 57* Interactively query ~1000+ traces in O(s) (for simple queries) 58* Expose full SQL expressiveness from trace processor 59* Load traces from many sources with minimal ceremony. This should include 60 Google-internal sources: e.g. lab runs and internal testing populations 61* Integrate with data analysis libraries for easy charting and visulazation 62 63## Design Highlights 64In this section, we briefly discuss some of the most impactful design decisions 65taken when building batch trace processor and the reasons behind them. 66 67### Language 68The choice of langugage is pretty straightforward. Python is already the go-to 69langugage for data analysis in a wide variety of domains and our problem 70is not unique enough to warrant making a different decision. Moreover, another 71point in favour is the existence of the Python API for trace processor. This 72further eases the implementation as we do not have to start from scratch. 73 74The main downside of choosing Python is performance but given that that all 75the data crunching happens in C++ inside TP, this is not a big factor. 76 77### Trace URIs and Resolvers 78[Trace URIs](/docs/analysis/batch-trace-processor#trace-uris) 79are an elegant solution to the problem of loading traces from a diverse range 80of public and internal sources. As with web URIs, the idea with trace URIs is 81to describe both the protocol (i.e. the source) from which traces should be 82fetched and the arguments (i.e. query parameters) which the traces should match. 83 84Batch trace processor should integrate tightly with trace URIs and their 85resolvers. Users should be able to pass either just the URI (whcih is really 86just a string for maximum flexibility) or a resolver object which can yield a 87list of trace file paths. 88 89To handle URI strings, there should be some mecahinsm of "registering" resolvers 90to make them eligible to resolve a certain "protocol". By default, we should 91provide a resolver to handle filesystem. We should ensure that the resolver 92design is such that resolvers can be closed soruce while the rest of batch trace 93processor is open. 94 95Along with the job of yielding a list of traces, resolvers should also be 96responsible for creating metadata for each trace these are different pieces 97of information about the trace that the user might be interested in e.g. OS 98version, device name, collected date etc. The metadata can then be used when 99"flattening" results across many traces as discussed below. 100 101### Persisting loaded traces 102Optimizing the loading of traces is critical for the O(s) query performance 103we want out of batch trace processor. Traces are often accessed 104over the network meaning fetching their contents has a high latency. 105Traces also take at least a few seconds to parse, eating up the budget for 106O(s) before even getting the running time of queries. 107 108To address this issue, we take the decision to keep all traces fully loaded in 109memory in trace processor instances. That way, instead of loading them on every 110query/set of queries, we can issue queries directly. 111 112For the moment, we restrict the loading and querying of traces to a 113single machine. While querying n traces is "embarassngly parallel" and shards 114perfectly across multiple machines, introducing distributed systems to any 115solution simply makes everything more complicated. The move to multiple 116machines is explored further in the "Future plans" section. 117 118### Flattening query results 119The naive way to return the result of querying n traces is a list 120of n elements, with each element being result for a single trace. However, 121after performing several case-study performance investigations using BTP, it 122became obvious that this obvious answer was not the most convienent for the end 123user. 124 125Instead, a pattern which proved very useful was to "flatten" the results into 126a single table, containing the results from all the traces. However, 127simply flattening causes us to lose the information about which trace a row 128originated from. We can deal with this by allowing resolvers to silently add 129columns with the metadata for each trace. 130 131 132So suppose we query three traces with: 133 134```SELECT ts, dur FROM slice``` 135 136Then in the flattening operation might do something like this behind the scenes: 137 138 139 140## Integration points 141Batch trace processor needs to be both open source yet allow deep integration 142with Google internal tooling. Because of this, there are various integration 143points built design to allow closed compoentns to be slotted in place of the 144default, open source ones. 145 146The first point is the formalization of the idea "platform" code. Even since the 147begining of the Python API, there was always a need for code internally to be 148run slightly different to open source code. For example, Google internal Python 149distrubution does not use Pip, instead packaging dependencies into a single 150binary. The notion of a "platform" loosely existed to abstract this sort of 151differences but this was very ad-hoc. As part of batch trace processor 152implementation, this has been retroactively formalized. 153 154Resolvers are another big point of pluggability. By allowing registration of 155a "protocol" for each internal trace source (e.g. lab, testing population), we 156allow for trace loading to be neatly abstracted. 157 158Finally, for batch trace processor specifically, we abstract the creation of 159thread pools for loading traces and running queries. The parallelism and memory 160available to programs internally is often does not 1:1 correspond with the 161available CPUs/memory on the system: internal APIs need to be accessed to find 162out this information. 163 164## Future plans 165One common problem when running batch trace processor is that we are 166constrained by a single machine and so can only load O(1000) traces. 167For rare problems, there might only be a handful of traces matching a given 168pattern even in such a large sample. 169 170A way around this would be to build a "no trace limit" mode. The idea here 171is that you would develop queries like usual with batch trace processor 172operating on a O(1000) traces with O(s) performance. Once the queries are 173relatively finalized, we could then "switch" the mode of batch trace processor 174to opeate closer to a "MapReduce" style pipeline which operates over O(10000)+ 175traces loading O(n cpus) traces at any one time. 176 177This allows us to retain both the quick iteration speed while developing queries 178while also allowing for large scale analysis without needing to move code 179to a pipeline model. However, this approach does not really resolve the root 180cause of the problem which is that we are restricted to a single machine. 181 182The "ideal" solution here is to, as mentioned above, shard batch trace processor 183across >1 machine. When querying traces, each trace is entirely independent of 184any other so paralleising across multiple machines yields very close to perfect 185gains in performance at little cost. 186 187This is would be however quite a complex undertaking. We would need to design 188the API in such a way that allows for pluggable integration with various compute 189platforms (e.g. GCP, Google internal, your custom infra). Even restricting to 190just Google infra and leaving others as open for contribution, internal infra's 191ideal workload does not match the approach of "have a bunch of machines tied to 192one user waiting for their input". There would need to be significiant research 193and design work before going here but it would likely be wortwhile. 194