xref: /aosp_15_r20/external/perfetto/docs/design-docs/batch-trace-processor.md (revision 6dbdd20afdafa5e3ca9b8809fa73465d530080dc)
1# Batch Trace Processor
2This document describes the overall design of Batch Trace Processor and
3aids in integrating it into other systems.
4
5![BTP Overview](/docs/images/perfetto-btp-overview.svg)
6
7## Motivation
8The Perfetto trace processor is the de-facto way to perform analysis on a
9single trace. Using the
10[trace processor Python API](/docs/analysis/trace-processor#python-api),
11traces can be queried interactively, plots made from those results etc.
12
13While queries on a single trace are useful when debugging a specific problem
14in that trace or in the very early stages of understanding a domain, it soon
15becomes limiting. One trace is unlikely to be representative
16of the entire population and it's easy to overfit queries i.e. spend a
17lot of effort on breaking down a problem in that trace while neglecting
18other, more common issues in the population.
19
20Because of this, what we actually want is to be able to query many traces
21(usually on the order of 250-10000+) and identify the patterns which show
22up in a significant fraction of them. This ensures that time is being spent
23on issues which are affecting user experience instead of just a random
24problem which happened to show up in the trace.
25
26One low-effort option for solving this problem is simply to ask people to use
27utilities like [Executors](https://docs.python.org/3/library/concurrent.futures.html#executor-objects)
28with the Python API to load multiple traces and query them in parallel.
29Unfortunately, there are several downsides to this approach:
30* Every user has to reinvent the wheel every time they want to query multiple
31  traces. Over time, there would likely be a proliferation of slightly modified
32  code which is copied from each place.
33* While the basics of parallelising queries on multiple traces on a single
34  machine is straightforward, one day, we may want to shard trace processing
35  across multiple machines. Once this happens, the complexity of the code would
36  rise significantly to the point where a central implementation becomes a
37  necessity. Because of this, it's better to have the API first before engineers
38  start building their own custom solutions.
39* A big aim for the Perfetto team these days is to make trace analysis more
40  accessible to reduce the number of places where we need to be in the loop.
41  Having a well supported API for an important usecase like bulk trace analysis
42  directly helps with this.
43
44While we've discussed querying traces so far, the experience for loading traces
45from different traces should be just as good. This has historically been a big
46reason why the Python API has not gained as much adoption as we would have
47liked.
48
49Especially internally in Google, we should not be relying on engineers
50knowing where traces live on the network filesystem and the directory layout.
51Instead, they should be able to simply be able to specify the data source (i.e.
52lab, testing population) and some parameters (e.g. build id, date, kernel
53version) that traces should match should match and traces meeting these criteria
54should found and loaded.
55
56Putting all this together, we want to build a library which can:
57* Interactively query ~1000+ traces in O(s) (for simple queries)
58* Expose full SQL expressiveness from trace processor
59* Load traces from many sources with minimal ceremony. This should  include
60  Google-internal sources: e.g. lab runs and internal testing populations
61* Integrate with data analysis libraries for easy charting and visulazation
62
63## Design Highlights
64In this section, we briefly discuss some of the most impactful design decisions
65taken when building batch trace processor and the reasons behind them.
66
67### Language
68The choice of langugage is pretty straightforward. Python is already the go-to
69langugage for data analysis in a wide variety of domains and our problem
70is not unique enough to warrant making a different decision. Moreover, another
71point in favour is the existence of the Python API for trace processor. This
72further eases the implementation as we do not have to start from scratch.
73
74The main downside of choosing Python is performance but given that that all
75the data crunching happens in C++ inside TP,  this is not a big factor.
76
77### Trace URIs and Resolvers
78[Trace URIs](/docs/analysis/batch-trace-processor#trace-uris)
79are an elegant solution to the problem of loading traces from a diverse range
80of public and internal sources. As with web URIs, the idea with trace URIs is
81to describe both the protocol (i.e. the source) from which traces should be
82fetched and the arguments (i.e. query parameters) which the traces should match.
83
84Batch trace processor should integrate tightly with trace URIs and their
85resolvers. Users should be able to pass either just the URI (whcih is really
86just a string for maximum flexibility) or a resolver object which can yield a
87list of trace file paths.
88
89To handle URI strings, there should be some mecahinsm of "registering" resolvers
90to make them eligible to resolve a certain "protocol". By default, we should
91provide a resolver to handle filesystem. We should ensure that the resolver
92design is such that resolvers can be closed soruce while the rest of batch trace
93processor is open.
94
95Along with the job of yielding a list of traces, resolvers should also be
96responsible for creating metadata for each trace these are different pieces
97of information about the trace that the user might be interested in e.g. OS
98version, device name, collected date etc. The metadata can then be used when
99"flattening" results across many traces as discussed below.
100
101### Persisting loaded traces
102Optimizing the loading of traces is critical for the O(s) query performance
103we want out of batch trace processor. Traces are often accessed
104over the network meaning fetching their contents has a high latency.
105Traces also take at least a few seconds to parse, eating up the budget for
106O(s) before even getting the running time of queries.
107
108To address this issue, we take the decision to keep all traces fully loaded in
109memory in trace processor instances. That way, instead of loading them on every
110query/set of queries, we can issue queries directly.
111
112For the moment, we restrict the loading and querying of traces to a
113single machine. While querying n traces is "embarassngly parallel" and shards
114perfectly across multiple machines, introducing distributed systems to any
115solution simply makes everything more complicated. The move to multiple
116machines is explored further in the "Future plans" section.
117
118### Flattening query results
119The naive way to return the result of querying n traces is a list
120of n elements, with each element being result for a single trace. However,
121after performing several case-study performance investigations using BTP, it
122became obvious that this obvious answer was not the most convienent for the end
123user.
124
125Instead, a pattern which proved very useful was to "flatten" the results into
126a single table, containing the results from all the traces. However,
127simply flattening causes us to lose the information about which trace a row
128originated from. We can deal with this by allowing resolvers to silently add
129columns with the metadata for each trace.
130
131
132So suppose we query three traces with:
133
134```SELECT ts, dur FROM slice```
135
136Then in the flattening operation might do something like this behind the scenes:
137![BTP Flattening](/docs/images/perfetto-btp-flattening.svg)
138
139
140## Integration points
141Batch trace processor needs to be both open source yet allow deep integration
142with Google internal tooling. Because of this, there are various integration
143points built design to allow closed compoentns to be slotted in place of the
144default, open source ones.
145
146The first point is the formalization of the idea "platform" code. Even since the
147begining of the Python API, there was always a need for code internally to be
148run slightly different to open source code. For example, Google internal Python
149distrubution does not use Pip, instead packaging dependencies into a single
150binary. The notion of a "platform" loosely existed to abstract this sort of
151differences but this was very ad-hoc. As part of batch trace processor
152implementation, this has been retroactively formalized.
153
154Resolvers are another big point of pluggability. By allowing registration of
155a "protocol" for each internal trace source (e.g. lab, testing population), we
156allow for trace loading to be neatly abstracted.
157
158Finally, for batch trace processor specifically, we abstract the creation of
159thread pools for loading traces and running queries. The parallelism and memory
160available to programs internally is often does not 1:1 correspond with the
161available CPUs/memory on the system: internal APIs need to be accessed to find
162out this information.
163
164## Future plans
165One common problem when running batch trace processor is that we are
166constrained by a single machine and so can only load O(1000) traces.
167For rare problems, there might only be a handful of traces matching a given
168pattern even in such a large sample.
169
170A way around this would be to build a "no trace limit" mode. The idea here
171is that you would develop queries like usual with batch trace processor
172operating on a O(1000) traces with O(s) performance. Once the queries are
173relatively finalized, we could then "switch" the mode of batch trace processor
174to opeate closer to a "MapReduce" style pipeline which operates over O(10000)+
175traces loading O(n cpus) traces at any one time.
176
177This allows us to retain both the quick iteration speed while developing queries
178while also allowing for large scale analysis without needing to move code
179to a pipeline model. However, this approach does not really resolve the root
180cause of the problem which is that we are restricted to a single machine.
181
182The "ideal" solution here is to, as mentioned above, shard batch trace processor
183across >1 machine. When querying traces, each trace is entirely independent of
184any other so paralleising across multiple machines yields very close to perfect
185gains in performance at little cost.
186
187This is would be however quite a complex undertaking. We would need to design
188the API in such a way that allows for pluggable integration with various compute
189platforms (e.g. GCP, Google internal, your custom infra). Even restricting to
190just Google infra and leaving others as open for contribution, internal infra's
191ideal workload does not match the approach of "have a bunch of machines tied to
192one user waiting for their input". There would need to be significiant research
193and design work before going here but it would likely be wortwhile.
194