xref: /aosp_15_r20/external/perfetto/docs/data-sources/cpu-scheduling.md (revision 6dbdd20afdafa5e3ca9b8809fa73465d530080dc)
1# CPU Scheduling events
2
3On Android and Linux Perfetto can gather scheduler traces via the Linux Kernel
4[ftrace](https://www.kernel.org/doc/Documentation/trace/ftrace.txt)
5infrastructure.
6
7This allows to get fine grained scheduling events such as:
8
9* Which threads were scheduled on which CPU core at any point in time, with
10  nanosecond accuracy.
11* The reason why a running thread got descheduled (e.g. pre-emption, blocked on
12  a mutex, blocking syscall or any other wait queue).
13* The point in time when a thread became eligible to be executed, even if it was
14  not put immediately on any CPU run queue, together with the source thread that
15  made it executable.
16
17## UI
18
19The UI represents individual scheduling events as slices:
20
21![](/docs/images/cpu-zoomed.png "Detailed view of CPU run queues")
22
23Clicking on a CPU slice shows the relevant information in the details panel:
24
25![](/docs/images/cpu-sched-details.png "CPU scheduling details")
26
27Scrolling down, when expanding individual processes, the scheduling events also
28create one track for each thread, which allows to follow the evolution of the
29state of individual threads:
30
31![](/docs/images/thread-states.png "States of individual threads")
32
33## SQL
34
35At the SQL level, the scheduling data is exposed in the
36[`sched_slice`](/docs/analysis/sql-tables.autogen#sched_slice) table.
37
38```sql
39select ts, dur, cpu, end_state, priority, process.name, thread.name
40from sched_slice left join thread using(utid) left join process using(upid)
41```
42
43ts | dur | cpu | end_state | priority | process.name, | thread.name
44---|-----|-----|-----------|----------|---------------|------------
45261187012170995 | 247188 | 2 | S | 130 | /system/bin/logd | logd.klogd
46261187012418183 | 12812 | 2 | D | 120 | /system/bin/traced_probes | traced_probes0
47261187012421099 | 220000 | 4 | D | 120 | kthreadd | kworker/u16:2
48261187012430995 | 72396 | 2 | D | 120 | /system/bin/traced_probes | traced_probes1
49261187012454537 | 13958 | 0 | D | 120 | /system/bin/traced_probes | traced_probes0
50261187012460318 | 46354 | 3 | S | 120 | /system/bin/traced_probes | traced_probes2
51261187012468495 | 10625 | 0 | R | 120 | [NULL] | swapper/0
52261187012479120 | 6459 | 0 | D | 120 | /system/bin/traced_probes | traced_probes0
53261187012485579 | 7760 | 0 | R | 120 | [NULL] | swapper/0
54261187012493339 | 34896 | 0 | D | 120 | /system/bin/traced_probes | traced_probes0
55
56## TraceConfig
57
58To collect this data, include the following data sources:
59
60```protobuf
61# Scheduling data from the kernel.
62data_sources: {
63  config {
64    name: "linux.ftrace"
65    ftrace_config {
66      compact_sched: {
67        enabled: true
68      }
69      ftrace_events: "sched/sched_switch"
70      # optional: precise thread lifetime tracking:
71      ftrace_events: "sched/sched_process_exit"
72      ftrace_events: "sched/sched_process_free"
73      ftrace_events: "task/task_newtask"
74      ftrace_events: "task/task_rename"
75    }
76  }
77}
78
79# Adds full process names and thread<>process relationships:
80data_sources: {
81  config {
82    name: "linux.process_stats"
83  }
84}
85```
86
87## Scheduling wakeups and latency analysis
88
89By further enabling the following in the TraceConfig, the ftrace data source
90will record also scheduling wake up events:
91
92```protobuf
93  ftrace_events: "sched/sched_wakeup_new"
94  ftrace_events: "sched/sched_waking"
95```
96
97While `sched_switch` events are emitted only when a thread is in the
98`R(unnable)` state AND is running on a CPU run queue, `sched_waking` events are
99emitted when any event causes a thread state to change.
100
101Consider the following example:
102
103```
104Thread A
105condition_variable.wait()
106                                     Thread B
107                                     condition_variable.notify()
108```
109
110When Thread A suspends on the wait() it will enter the state `S(sleeping)` and
111get removed from the CPU run queue. When Thread B notifies the variable, the
112kernel will transition Thread A into the `R(unnable)` state. Thread A at that
113point is eligible to be put back on a run queue. However this might not happen
114for some time because, for instance:
115
116* All CPUs might be busy running some other thread, and Thread A needs to wait
117  to get a run queue slot assigned (or the other threads have higher priority).
118* Some other CPUs other than the current one, but the scheduler load balancer
119  might take some time to move the thread on another CPU.
120
121Unless using real-time thread priorities, most Linux Kernel scheduler
122configurations are not strictly work-conserving. For instance the scheduler
123might prefer to wait some time in the hope that the thread running on the
124current CPU goes to idle, avoiding a cross-cpu migration which might be more
125costly both in terms of overhead and power.
126
127NOTE: `sched_waking` and `sched_wakeup` provide nearly the same information. The
128      difference lies in wakeup events across CPUs, which involve
129      inter-processor interrupts. The former is always emitted on the source (wakee)
130      CPU, the latter may be executed on either the source or the destination (waked) CPU
131      depending on several factors. `sched_waking` is usually sufficient for latency
132      analysis, unless you are looking into breaking down latency due to
133      the scheduler's wake up path, such as inter-processor signaling.
134
135When enabling `sched_waking` events, the following will appear in the UI when
136selecting a CPU slice:
137
138![](/docs/images/latency.png "Scheduling wake-up events in the UI")
139
140### Decoding `end_state`
141
142The [sched_slice](/docs/analysis/sql-tables.autogen#sched_slice) table contains
143information on scheduling activity of the system:
144
145```
146> select * from sched_slice limit 1
147id  type        ts          dur    cpu utid end_state priority
1480   sched_slice 70730062200 125364 0   1    S         130
149```
150
151Each row of the table shows when a given thread (`utid`) began running
152(`ts`), on which core it ran (`cpu`), for how long it ran (`dur`),
153and why it stopped running: `end_state`.
154
155`end_state` is encoded as one or more ascii characters. The UI uses
156the following translations to convert `end_state` into human readable
157text:
158
159| end_state  | Translation            |
160|------------|------------------------|
161| R          | Runnable               |
162| R+         | Runnable (Preempted)   |
163| S          | Sleeping               |
164| D          | Uninterruptible Sleep  |
165| T          | Stopped                |
166| t          | Traced                 |
167| X          | Exit (Dead)            |
168| Z          | Exit (Zombie)          |
169| x          | Task Dead              |
170| I          | Idle                   |
171| K          | Wake Kill              |
172| W          | Waking                 |
173| P          | Parked                 |
174| N          | No Load                |
175
176Not all combinations of characters are meaningful.
177
178If we do not know when the scheduling ended (for example because the
179trace ended while the thread was still running) `end_state` will be
180`NULL` and `dur` will be -1.
181
182