1SQLHIST(1) 2========== 3 4NAME 5---- 6sqlhist - Tool that uses SQL language to create / show creation of tracefs histograms and synthetic events. 7 8SYNOPSIS 9-------- 10*sqlhist* ['OPTIONS'] ['SQL-select-command'] 11 12DESCRIPTION 13----------- 14The sqlhist(1) will take an SQL like statement to create tracefs histograms and 15synthetic events that can perform various actions for various handling of the 16data. 17 18The tracefs file system interfaces with the Linux tracing infrastructure that 19has various dynamic and static events through out the kernel. Each of these 20events can have a "histogram" attached to it, where the fields of the event 21will define the buckets of the histogram. 22 23A synthetic event is a way to attach two separate events and use the fields 24and time stamps of those events to create a new dynamic event. This new 25dynamic event is call a synthetic event. The fields of each event can have 26simple calculations done on them where, for example, the delta between 27a field of one event to a field of the other event can be taken. This also 28works for the time stamps of the events where the time delta between the 29two events can also be extracted and placed into the synthetic event. 30 31Other actions can be done from the fields of the events. A snapshot can 32be taken of the kernel ring buffer a variable used in the synthetic 33event creating hits a max, or simply changes. 34 35The commands to create histograms and synthetic events are complex and 36not easy to remember. *sqlhist* is used to convert SQL syntax into the 37commands needed to create the histogram or synthetic event. 38 39The *SQL-select-command* is a SQL string defined by *tracefs_sql*(3). 40 41Note, this must be run as root (or sudo) as interacting with the tracefs 42directory requires root privilege, unless the *-t* option is given with 43a copy of the _tracefs_ directory and its events. 44 45The *sqlhist* is a simple program where its code actual exists in the 46*tracefs_sql*(3) man page. 47 48OPTIONS 49------- 50*-n* 'name':: 51 The name of the synthetic event to create. This event can then be 52 used like any other event, and enabled via *trace-cmd*(1). 53 54*-t* 'tracefs-dir':: 55 In order to test this out as non root user, a copy of the tracefs directory 56 can be used, and passing that directory with this option will allow 57 the program to work. Obviously, *-e* will not work as non-root because 58 it will not be able to execute. 59 60 # mkdir /tmp/tracing 61 # cp -r /sys/kernel/tracing/events /tmp/tracing 62 # exit 63 $ ./sqlhist -t /tmp/tracing ... 64 65*-e*:: 66 Not only display the commands to create the histogram, but also execute them. 67 This requires root privilege. 68 69*-f* 'file':: 70 Instead of reading the SQL commands from the command line, read them from 71 _file_. If _file_ is '-' then read from standard input. 72 73*-m* 'var':: 74 Do the given action when the variable _var_ hits a new maximum. This can 75 not be used with *-c*. 76 77*-c* 'var':: 78 Do the given action when the variable _var_ changes its value. This can 79 not be used with *-m*. 80 81*-s*:: 82 Perform a snapshot instead of calling the synthetic event. 83 84*-T*:: 85 Perform both a snapshot and trace the synthetic event. 86 87*-S* 'fields[,fields]':: 88 Save the given fields. The fields must be fields of the "end" event given 89 in the *SQL-select-command* 90 91*-B* 'instance':: 92 For simple statements that only produce a histogram, the instance given here 93 will be where the histogram will be created. This is ignored for full synthetic 94 event creation, as sythetic events have a global affect on all tracing instances, 95 where as, histograms only affect a single instance. 96 97EXAMPLES 98-------- 99 100Create the sqlhist executable: 101 102[source, c] 103-- 104 man tracefs_sql | sed -ne '/^EXAMPLE/,/FILES/ { /EXAMPLE/d ; /FILES/d ; p}' > sqlhist.c 105 gcc -o sqlhist sqlhist.c `pkg-config --cflags --libs libtracefs` 106-- 107 108As described above, for testing purposes, make a copy of the event directory: 109[source, c] 110-- 111 $ mkdir /tmp/tracing 112 $ sudo cp -r /sys/kernel/tracing/events /tmp/tracing/ 113 $ sudo chmod -R 0644 /tmp/tracing/ 114-- 115 116For an example of simple histogram output using the copy of the tracefs directory. 117[source, c] 118-- 119 $ ./sqlhist -t /tmp/tracing/ 'SELECT CAST(call_site as SYM-OFFSET), bytes_req, CAST(bytes_alloc AS _COUNTER_) FROM kmalloc' 120-- 121 122Produces the output: 123[source, c] 124-- 125 echo 'hist:keys=call_site.sym-offset,bytes_req:vals=bytes_alloc' > /sys/kernel/tracing/events/kmem/kmalloc/trigger 126-- 127 128Which could be used by root: 129[source, c] 130-- 131 # echo 'hist:keys=call_site.sym-offset,bytes_req:vals=bytes_alloc' > /sys/kernel/tracing/events/kmem/kmalloc/trigger 132 # cat /sys/kernel/tracing/events/kmem/kmalloc/hist 133# event histogram 134# 135# trigger info: hist:keys=call_site.sym-offset,bytes_req:vals=hitcount,bytes_alloc:sort=hitcount:size=2048 [active] 136# 137 138{ call_site: [ffffffff813f8d8a] load_elf_phdrs+0x4a/0xb0 , bytes_req: 728 } hitcount: 1 bytes_alloc: 1024 139{ call_site: [ffffffffc0c69e74] nf_ct_ext_add+0xd4/0x1d0 [nf_conntrack] , bytes_req: 128 } hitcount: 1 bytes_alloc: 128 140{ call_site: [ffffffff818355e6] dma_resv_get_fences+0xf6/0x440 , bytes_req: 8 } hitcount: 1 bytes_alloc: 8 141{ call_site: [ffffffffc06dc73f] intel_gt_get_buffer_pool+0x15f/0x290 [i915] , bytes_req: 424 } hitcount: 1 bytes_alloc: 512 142{ call_site: [ffffffff813f8d8a] load_elf_phdrs+0x4a/0xb0 , bytes_req: 616 } hitcount: 1 bytes_alloc: 1024 143{ call_site: [ffffffff8161a44c] __sg_alloc_table+0x11c/0x180 , bytes_req: 32 } hitcount: 1 bytes_alloc: 32 144{ call_site: [ffffffffc070749d] shmem_get_pages+0xad/0x5d0 [i915] , bytes_req: 16 } hitcount: 1 bytes_alloc: 16 145{ call_site: [ffffffffc07507f5] intel_framebuffer_create+0x25/0x60 [i915] , bytes_req: 408 } hitcount: 1 bytes_alloc: 512 146{ call_site: [ffffffffc06fc20f] eb_parse+0x34f/0x910 [i915] , bytes_req: 408 } hitcount: 1 bytes_alloc: 512 147{ call_site: [ffffffffc0700ebd] i915_gem_object_get_pages_internal+0x5d/0x270 [i915] , bytes_req: 16 } hitcount: 1 bytes_alloc: 16 148{ call_site: [ffffffffc0771188] intel_frontbuffer_get+0x38/0x220 [i915] , bytes_req: 400 } hitcount: 1 bytes_alloc: 512 149{ call_site: [ffffffff8161a44c] __sg_alloc_table+0x11c/0x180 , bytes_req: 128 } hitcount: 1 bytes_alloc: 128 150{ call_site: [ffffffff813f8f45] load_elf_binary+0x155/0x1680 , bytes_req: 28 } hitcount: 1 bytes_alloc: 32 151{ call_site: [ffffffffc07038c8] __assign_mmap_offset+0x208/0x3d0 [i915] , bytes_req: 288 } hitcount: 1 bytes_alloc: 512 152{ call_site: [ffffffff813737b2] alloc_bprm+0x32/0x2f0 , bytes_req: 416 } hitcount: 1 bytes_alloc: 512 153{ call_site: [ffffffff813f9027] load_elf_binary+0x237/0x1680 , bytes_req: 64 } hitcount: 1 bytes_alloc: 64 154{ call_site: [ffffffff8161a44c] __sg_alloc_table+0x11c/0x180 , bytes_req: 64 } hitcount: 1 bytes_alloc: 64 155{ call_site: [ffffffffc040ffe7] drm_vma_node_allow+0x27/0xe0 [drm] , bytes_req: 40 } hitcount: 2 bytes_alloc: 128 156{ call_site: [ffffffff813cda98] __do_sys_timerfd_create+0x58/0x1c0 , bytes_req: 336 } hitcount: 2 bytes_alloc: 1024 157{ call_site: [ffffffff818355e6] dma_resv_get_fences+0xf6/0x440 , bytes_req: 40 } hitcount: 2 bytes_alloc: 128 158{ call_site: [ffffffff8139b75a] single_open+0x2a/0xa0 , bytes_req: 32 } hitcount: 2 bytes_alloc: 64 159{ call_site: [ffffffff815df715] bio_kmalloc+0x25/0x80 , bytes_req: 136 } hitcount: 2 bytes_alloc: 384 160{ call_site: [ffffffffc071e5cd] i915_vma_work+0x1d/0x50 [i915] , bytes_req: 416 } hitcount: 3 bytes_alloc: 1536 161{ call_site: [ffffffff81390d0d] alloc_fdtable+0x4d/0x100 , bytes_req: 56 } hitcount: 3 bytes_alloc: 192 162{ call_site: [ffffffffc06ff65f] i915_gem_do_execbuffer+0x158f/0x2440 [i915] , bytes_req: 16 } hitcount: 4 bytes_alloc: 64 163{ call_site: [ffffffff8137713c] alloc_pipe_info+0x5c/0x230 , bytes_req: 384 } hitcount: 5 bytes_alloc: 2560 164{ call_site: [ffffffff813771b4] alloc_pipe_info+0xd4/0x230 , bytes_req: 640 } hitcount: 5 bytes_alloc: 5120 165{ call_site: [ffffffff81834cdb] dma_resv_list_alloc+0x1b/0x40 , bytes_req: 40 } hitcount: 6 bytes_alloc: 384 166{ call_site: [ffffffff81834cdb] dma_resv_list_alloc+0x1b/0x40 , bytes_req: 56 } hitcount: 9 bytes_alloc: 576 167{ call_site: [ffffffff8120086e] tracing_map_sort_entries+0x9e/0x3e0 , bytes_req: 24 } hitcount: 60 bytes_alloc: 1920 168 169Totals: 170 Hits: 122 171 Entries: 30 172 Dropped: 0 173-- 174 175Note, although the examples use uppercase for the SQL keywords, they do not have 176to be. 'SELECT' could also be 'select' or even 'sElEcT'. 177 178By using the full SQL language, synthetic events can be made and processed. 179For example, using *sqlhist* along with *trace-cmd*(1), wake up latency can 180be recorded by creating a synthetic event by attaching the _sched_waking_ 181and the _sched_switch_ events. 182 183[source, c] 184-- 185 # sqlhist -n wakeup_lat -e -T -m lat 'SELECT end.next_comm AS comm, (end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) AS lat FROM ' \ 186 'sched_waking AS start JOIN sched_switch AS end ON start.pid = end.next_pid WHERE end.next_prio < 100 && end.next_comm == "cyclictest"' 187 # trace-cmd start -e all -e wakeup_lat -R stacktrace 188 # cyclictest -l 1000 -p80 -i250 -a -t -q -m -d 0 -b 1000 --tracemark 189 # trace-cmd show -s | tail -30 190 <idle>-0 [002] dNh4 23454.902246: sched_wakeup: comm=cyclictest pid=12272 prio=120 target_cpu=002 191 <idle>-0 [005] ...1 23454.902246: cpu_idle: state=4294967295 cpu_id=5 192 <idle>-0 [007] d..1 23454.902246: cpu_idle: state=0 cpu_id=7 193 <idle>-0 [002] dNh1 23454.902247: hrtimer_expire_exit: hrtimer=0000000037956dc2 194 <idle>-0 [005] d..1 23454.902248: cpu_idle: state=0 cpu_id=5 195 <idle>-0 [002] dNh1 23454.902248: write_msr: 6e0, value 4866ce957272 196 <idle>-0 [006] ...1 23454.902248: cpu_idle: state=4294967295 cpu_id=6 197 <idle>-0 [002] dNh1 23454.902249: local_timer_exit: vector=236 198 <idle>-0 [006] d..1 23454.902250: cpu_idle: state=0 cpu_id=6 199 <idle>-0 [002] .N.1 23454.902250: cpu_idle: state=4294967295 cpu_id=2 200 <idle>-0 [002] dN.1 23454.902251: rcu_utilization: Start context switch 201 <idle>-0 [002] dN.1 23454.902252: rcu_utilization: End context switch 202 <idle>-0 [001] ...1 23454.902252: cpu_idle: state=4294967295 cpu_id=1 203 <idle>-0 [002] dN.3 23454.902253: prandom_u32: ret=3692516021 204 <idle>-0 [001] d..1 23454.902254: cpu_idle: state=0 cpu_id=1 205 <idle>-0 [002] d..2 23454.902254: sched_switch: prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=cyclictest next_pid=12275 next_prio=19 206 <idle>-0 [002] d..4 23454.902256: wakeup_lat: next_comm=cyclictest lat=17 207 <idle>-0 [002] d..5 23454.902258: <stack trace> 208 => trace_event_raw_event_synth 209 => action_trace 210 => event_hist_trigger 211 => event_triggers_call 212 => trace_event_buffer_commit 213 => trace_event_raw_event_sched_switch 214 => __traceiter_sched_switch 215 => __schedule 216 => schedule_idle 217 => do_idle 218 => cpu_startup_entry 219 => secondary_startup_64_no_verify 220-- 221 222Here's the options for *sqlhist* explained: 223 224 *-n wakeup_lat* :: 225 Name the synthetic event to use *wakeup_lat*. 226 227 *-e*:: 228 Execute the commands that are printed. 229 230 *-T*:: 231 Perform both a trace action and then a snapshot action (swap the buffer into the static 'snapshot' buffer). 232 233 *-m lat*:: 234 Trigger the actions whenever 'lat' hits a new maximum value. 235 236Now a breakdown of the SQL statement: 237[source, c] 238-- 239 'SELECT end.next_comm AS comm, (end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) AS lat FROM ' \ 240 'sched_waking AS start JOIN sched_switch AS end ON start.pid = end.next_pid WHERE end.next_prio < 100 && end.next_comm == "cyclictest"' 241-- 242 *end.next_comm AS comm*:: 243 Save the 'sched_switch' field *next_comm* and place it into the *comm* field of the 'wakeup_lat' synthetic event. 244 245 *(end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) AS lat*:: 246 Take the delta of the time stamps from the 'sched_switch' event and the 'sched_waking' event. 247 As time stamps are usually recorded in nanoseconds, *TIMESTAMP* would give the full nanosecond time stamp, 248 but here, the *TIMESTAMP_USECS* will truncate it into microseconds. The value is saved in the 249 variable *lat*, which will also be recorded in the synthetic event. 250 251 *FROM 'sched_waking' AS start JOIN sched_switch AS end ON start.pid = end.next_pid*:: 252 Create the synthetic event by joining _sched_waking_ to _sched_switch_, matching 253 the _sched_waking_ 'pid' field with the _sched_switch_ 'next_pid' field. 254 Also make *start* an alias for _sched_waking_ and *end* an alias for _sched_switch_ 255 which then an use *start* and *end* as a subsitute for _sched_waking_ and _sched_switch_ 256 respectively through out the rest of the SQL statement. 257 258 *WHERE end.next_prio < 100 && end.next_comm == "cyclictest"*:: 259 Filter the logic where it executes only if _sched_waking_ 'next_prio' field 260 is less than 100. (Note, in the Kernel, priorities are inverse, and the real-time 261 priorities are represented from 0-100 where 0 is the highest priority). 262 Also only trace when the 'next_comm' (the task scheduling in) of the _sched_switch_ 263 event has the name "cyclictest". 264 265For the *trace-cmd*(3) command: 266[source, c] 267-- 268 trace-cmd start -e all -e wakeup_lat -R stacktrace 269-- 270 271 *trace-cmd start*:: 272 Enables tracing (does not record to a file). 273 274 *-e all*:: 275 Enable all events 276 277 *-e wakeup_lat -R stacktrace*:: 278 have the "wakeup_lat" event (our synthetic event) enable the *stacktrace* trigger, were 279 for every instance of the "wakeup_lat" event, a kernel stack trace will be recorded 280 in the ring buffer. 281 282After calling *cyclictest* (a real-time tool to measure wakeup latency), read the snapshot 283buffer. 284 285 *trace-cmd show -s*:: 286 *trace-cmd show* reads the kernel ring buffer, and the *-s* option will read the *snapshot* 287 buffer instead of the normal one. 288 289[source, c] 290-- 291 <idle>-0 [002] d..4 23454.902256: wakeup_lat: next_comm=cyclictest lat=17 292-- 293 We see on the "wakeup_lat" event happened on CPU 2, with a wake up latency 17 microseconds. 294 295This can be extracted into a *trace.dat* file that *trace-cmd*(3) can read and do further 296analysis, as well as *kernelshark*. 297 298[source, c] 299-- 300 # trace-cmd extract -s 301 # trace-cmd report --cpu 2 | tail -30 302 <idle>-0 [002] 23454.902238: prandom_u32: ret=1633425088 303 <idle>-0 [002] 23454.902239: sched_wakeup: cyclictest:12275 [19] CPU:002 304 <idle>-0 [002] 23454.902241: hrtimer_expire_exit: hrtimer=0xffffbbd68286fe60 305 <idle>-0 [002] 23454.902241: hrtimer_cancel: hrtimer=0xffffbbd6826efe70 306 <idle>-0 [002] 23454.902242: hrtimer_expire_entry: hrtimer=0xffffbbd6826efe70 now=23455294430750 function=hrtimer_wakeup/0x0 307 <idle>-0 [002] 23454.902243: sched_waking: comm=cyclictest pid=12272 prio=120 target_cpu=002 308 <idle>-0 [002] 23454.902244: prandom_u32: ret=1102749734 309 <idle>-0 [002] 23454.902246: sched_wakeup: cyclictest:12272 [120] CPU:002 310 <idle>-0 [002] 23454.902247: hrtimer_expire_exit: hrtimer=0xffffbbd6826efe70 311 <idle>-0 [002] 23454.902248: write_msr: 6e0, value 4866ce957272 312 <idle>-0 [002] 23454.902249: local_timer_exit: vector=236 313 <idle>-0 [002] 23454.902250: cpu_idle: state=4294967295 cpu_id=2 314 <idle>-0 [002] 23454.902251: rcu_utilization: Start context switch 315 <idle>-0 [002] 23454.902252: rcu_utilization: End context switch 316 <idle>-0 [002] 23454.902253: prandom_u32: ret=3692516021 317 <idle>-0 [002] 23454.902254: sched_switch: swapper/2:0 [120] R ==> cyclictest:12275 [19] 318 <idle>-0 [002] 23454.902256: wakeup_lat: next_comm=cyclictest lat=17 319 <idle>-0 [002] 23454.902258: kernel_stack: <stack trace > 320=> trace_event_raw_event_synth (ffffffff8121a0db) 321=> action_trace (ffffffff8121e9fb) 322=> event_hist_trigger (ffffffff8121ca8d) 323=> event_triggers_call (ffffffff81216c72) 324=> trace_event_buffer_commit (ffffffff811f7618) 325=> trace_event_raw_event_sched_switch (ffffffff8110fda4) 326=> __traceiter_sched_switch (ffffffff8110d449) 327=> __schedule (ffffffff81c02002) 328=> schedule_idle (ffffffff81c02c86) 329=> do_idle (ffffffff8111e898) 330=> cpu_startup_entry (ffffffff8111eba9) 331=> secondary_startup_64_no_verify (ffffffff81000107) 332-- 333 334BUGS 335---- 336 337As *sqlhist* is just example code from a man page, it is guaranteed to contain 338lots of bugs. For one thing, not all error paths are covered properly. 339 340SEE ALSO 341-------- 342trace-cmd(1), tracefs_sql(3) 343 344AUTHOR 345------ 346Written by Steven Rostedt, <[email protected]> 347 348RESOURCES 349--------- 350https://git.kernel.org/pub/scm/utils/trace-cmd/trace-cmd.git/ 351 352COPYING 353------- 354Copyright \(C) 2021 , Inc. Free use of this software is granted under 355the terms of the GNU Public License (GPL). 356 357