1# CPU Scheduling events 2 3On Android and Linux Perfetto can gather scheduler traces via the Linux Kernel 4[ftrace](https://www.kernel.org/doc/Documentation/trace/ftrace.txt) 5infrastructure. 6 7This allows to get fine grained scheduling events such as: 8 9* Which threads were scheduled on which CPU core at any point in time, with 10 nanosecond accuracy. 11* The reason why a running thread got descheduled (e.g. pre-emption, blocked on 12 a mutex, blocking syscall or any other wait queue). 13* The point in time when a thread became eligible to be executed, even if it was 14 not put immediately on any CPU run queue, together with the source thread that 15 made it executable. 16 17## UI 18 19The UI represents individual scheduling events as slices: 20 21 22 23Clicking on a CPU slice shows the relevant information in the details panel: 24 25 26 27Scrolling down, when expanding individual processes, the scheduling events also 28create one track for each thread, which allows to follow the evolution of the 29state of individual threads: 30 31 32 33## SQL 34 35At the SQL level, the scheduling data is exposed in the 36[`sched_slice`](/docs/analysis/sql-tables.autogen#sched_slice) table. 37 38```sql 39select ts, dur, cpu, end_state, priority, process.name, thread.name 40from sched_slice left join thread using(utid) left join process using(upid) 41``` 42 43ts | dur | cpu | end_state | priority | process.name, | thread.name 44---|-----|-----|-----------|----------|---------------|------------ 45261187012170995 | 247188 | 2 | S | 130 | /system/bin/logd | logd.klogd 46261187012418183 | 12812 | 2 | D | 120 | /system/bin/traced_probes | traced_probes0 47261187012421099 | 220000 | 4 | D | 120 | kthreadd | kworker/u16:2 48261187012430995 | 72396 | 2 | D | 120 | /system/bin/traced_probes | traced_probes1 49261187012454537 | 13958 | 0 | D | 120 | /system/bin/traced_probes | traced_probes0 50261187012460318 | 46354 | 3 | S | 120 | /system/bin/traced_probes | traced_probes2 51261187012468495 | 10625 | 0 | R | 120 | [NULL] | swapper/0 52261187012479120 | 6459 | 0 | D | 120 | /system/bin/traced_probes | traced_probes0 53261187012485579 | 7760 | 0 | R | 120 | [NULL] | swapper/0 54261187012493339 | 34896 | 0 | D | 120 | /system/bin/traced_probes | traced_probes0 55 56## TraceConfig 57 58To collect this data, include the following data sources: 59 60```protobuf 61# Scheduling data from the kernel. 62data_sources: { 63 config { 64 name: "linux.ftrace" 65 ftrace_config { 66 compact_sched: { 67 enabled: true 68 } 69 ftrace_events: "sched/sched_switch" 70 # optional: precise thread lifetime tracking: 71 ftrace_events: "sched/sched_process_exit" 72 ftrace_events: "sched/sched_process_free" 73 ftrace_events: "task/task_newtask" 74 ftrace_events: "task/task_rename" 75 } 76 } 77} 78 79# Adds full process names and thread<>process relationships: 80data_sources: { 81 config { 82 name: "linux.process_stats" 83 } 84} 85``` 86 87## Scheduling wakeups and latency analysis 88 89By further enabling the following in the TraceConfig, the ftrace data source 90will record also scheduling wake up events: 91 92```protobuf 93 ftrace_events: "sched/sched_wakeup_new" 94 ftrace_events: "sched/sched_waking" 95``` 96 97While `sched_switch` events are emitted only when a thread is in the 98`R(unnable)` state AND is running on a CPU run queue, `sched_waking` events are 99emitted when any event causes a thread state to change. 100 101Consider the following example: 102 103``` 104Thread A 105condition_variable.wait() 106 Thread B 107 condition_variable.notify() 108``` 109 110When Thread A suspends on the wait() it will enter the state `S(sleeping)` and 111get removed from the CPU run queue. When Thread B notifies the variable, the 112kernel will transition Thread A into the `R(unnable)` state. Thread A at that 113point is eligible to be put back on a run queue. However this might not happen 114for some time because, for instance: 115 116* All CPUs might be busy running some other thread, and Thread A needs to wait 117 to get a run queue slot assigned (or the other threads have higher priority). 118* Some other CPUs other than the current one, but the scheduler load balancer 119 might take some time to move the thread on another CPU. 120 121Unless using real-time thread priorities, most Linux Kernel scheduler 122configurations are not strictly work-conserving. For instance the scheduler 123might prefer to wait some time in the hope that the thread running on the 124current CPU goes to idle, avoiding a cross-cpu migration which might be more 125costly both in terms of overhead and power. 126 127NOTE: `sched_waking` and `sched_wakeup` provide nearly the same information. The 128 difference lies in wakeup events across CPUs, which involve 129 inter-processor interrupts. The former is always emitted on the source (wakee) 130 CPU, the latter may be executed on either the source or the destination (waked) CPU 131 depending on several factors. `sched_waking` is usually sufficient for latency 132 analysis, unless you are looking into breaking down latency due to 133 the scheduler's wake up path, such as inter-processor signaling. 134 135When enabling `sched_waking` events, the following will appear in the UI when 136selecting a CPU slice: 137 138 139 140### Decoding `end_state` 141 142The [sched_slice](/docs/analysis/sql-tables.autogen#sched_slice) table contains 143information on scheduling activity of the system: 144 145``` 146> select * from sched_slice limit 1 147id type ts dur cpu utid end_state priority 1480 sched_slice 70730062200 125364 0 1 S 130 149``` 150 151Each row of the table shows when a given thread (`utid`) began running 152(`ts`), on which core it ran (`cpu`), for how long it ran (`dur`), 153and why it stopped running: `end_state`. 154 155`end_state` is encoded as one or more ascii characters. The UI uses 156the following translations to convert `end_state` into human readable 157text: 158 159| end_state | Translation | 160|------------|------------------------| 161| R | Runnable | 162| R+ | Runnable (Preempted) | 163| S | Sleeping | 164| D | Uninterruptible Sleep | 165| T | Stopped | 166| t | Traced | 167| X | Exit (Dead) | 168| Z | Exit (Zombie) | 169| x | Task Dead | 170| I | Idle | 171| K | Wake Kill | 172| W | Waking | 173| P | Parked | 174| N | No Load | 175 176Not all combinations of characters are meaningful. 177 178If we do not know when the scheduling ended (for example because the 179trace ended while the thread was still running) `end_state` will be 180`NULL` and `dur` will be -1. 181 182