1libtracefs(3) 2============= 3 4NAME 5---- 6tracefs_sql - Create a synthetic event via an SQL statement 7 8SYNOPSIS 9-------- 10[verse] 11-- 12*#include <tracefs.h>* 13 14struct tracefs_synth pass:[*]*tracefs_sql*(struct tep_handle pass:[*]_tep_, const char pass:[*]_name_, 15 const char pass:[*]_sql_buffer_, char pass:[**]_err_); 16-- 17 18DESCRIPTION 19----------- 20Synthetic events are dynamically created events that attach two existing events 21together via one or more matching fields between the two events. It can be used 22to find the latency between the events, or to simply pass fields of the first event 23on to the second event to display as one event. 24 25The Linux kernel interface to create synthetic events is complex, and there needs 26to be a better way to create synthetic events that is easy and can be understood 27via existing technology. 28 29If you think of each event as a table, where the fields are the column of the table 30and each instance of the event as a row, you can understand how SQL can be used 31to attach two events together and form another event (table). Utilizing the 32SQL *SELECT* *FROM* *JOIN* *ON* [ *WHERE* ] syntax, a synthetic event can easily 33be created from two different events. 34 35For simple SQL queries to make a histogram instead of a synthetic event, see 36HISTOGRAMS below. 37 38*tracefs_sql*() takes in a _tep_ handler (See _tep_local_events_(3)) that is used to 39verify the events within the _sql_buffer_ expression. The _name_ is the name of the 40synthetic event to create. If _err_ points to an address of a string, it will be filled 41with a detailed message on any type of parsing error, including fields that do not belong 42to an event, or if the events or fields are not properly compared. 43 44The example program below is a fully functional parser where it will create a synthetic 45event from a SQL syntax passed in via the command line or a file. 46 47The SQL format is as follows: 48 49*SELECT* <fields> *FROM* <start-event> *JOIN* <end-event> *ON* <matching-fields> *WHERE* <filter> 50 51Note, although the examples show the SQL commands in uppercase, they are not required to 52be so. That is, you can use "SELECT" or "select" or "sElEct". 53 54For example: 55[source,c] 56-- 57SELECT syscalls.sys_enter_read.fd, syscalls.sys_exit_read.ret FROM syscalls.sys_enter_read 58 JOIN syscalls.sys_exit_read 59 ON syscalls.sys_enter_read.common_pid = syscalls.sys_exit_write.common_pid 60-- 61 62Will create a synthetic event that with the fields: 63 64 u64 fd; s64 ret; 65 66Because the function takes a _tep_ handle, and usually all event names are unique, you can 67leave off the system (group) name of the event, and *tracefs_sql*() will discover the 68system for you. 69 70That is, the above statement would work with: 71 72[source,c] 73-- 74SELECT sys_enter_read.fd, sys_exit_read.ret FROM sys_enter_read JOIN sys_exit_read 75 ON sys_enter_read.common_pid = sys_exit_write.common_pid 76-- 77 78The *AS* keyword can be used to name the fields as well as to give an alias to the 79events, such that the above can be simplified even more as: 80 81[source,c] 82-- 83SELECT start.fd, end.ret FROM sys_enter_read AS start JOIN sys_exit_read AS end ON start.common_pid = end.common_pid 84-- 85 86The above aliases _sys_enter_read_ as *start* and _sys_exit_read_ as *end* and uses 87those aliases to reference the event throughout the statement. 88 89Using the *AS* keyword in the selection portion of the SQL statement will define what 90those fields will be called in the synthetic event. 91 92[source,c] 93-- 94SELECT start.fd AS filed, end.ret AS return FROM sys_enter_read AS start JOIN sys_exit_read AS end 95 ON start.common_pid = end.common_pid 96-- 97 98The above labels the _fd_ of _start_ as *filed* and the _ret_ of _end_ as *return* where 99the synthetic event that is created will now have the fields: 100 101 u64 filed; s64 return; 102 103The fields can also be calculated with results passed to the synthetic event: 104 105[source,c] 106-- 107select start.truesize, end.len, (start.truesize - end.len) as diff from napi_gro_receive_entry as start 108 JOIN netif_receive_skb as end ON start.skbaddr = end.skbaddr 109-- 110 111Which would show the *truesize* of the _napi_gro_receive_entry_ event, the actual 112_len_ of the content, shown by the _netif_receive_skb_, and the delta between 113the two and expressed by the field *diff*. 114 115The code also supports recording the timestamps at either event, and performing calculations 116on them. For wakeup latency, you have: 117 118[source,c] 119-- 120select start.pid, (end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as lat from sched_waking as start 121 JOIN sched_switch as end ON start.pid = end.next_pid 122-- 123 124The above will create a synthetic event that records the _pid_ of the task being woken up, 125and the time difference between the _sched_waking_ event and the _sched_switch_ event. 126The *TIMESTAMP_USECS* will truncate the time down to microseconds as the timestamp usually 127recorded in the tracing buffer has nanosecond resolution. If you do not want that 128truncation, use *TIMESTAMP* instead of *TIMESTAMP_USECS*. 129 130Finally, the *WHERE* clause can be added, that will let you add filters on either or both events. 131 132[source,c] 133-- 134select start.pid, (end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as lat from sched_waking as start 135 JOIN sched_switch as end ON start.pid = end.next_pid 136 WHERE start.prio < 100 && (!(end.prev_pid < 1 || end.prev_prio > 100) || end.prev_pid == 0) 137-- 138 139*NOTE* 140 141Although both events can be used together in the *WHERE* clause, they must not be mixed outside 142the top most "&&" statements. You can not OR (||) the events together, where a filter of one 143event is OR'd to a filter of the other event. This does not make sense, as the synthetic event 144requires both events to take place to be recorded. If one is filtered out, then the synthetic 145event does not execute. 146 147[source,c] 148-- 149select start.pid, (end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as lat from sched_waking as start 150 JOIN sched_switch as end ON start.pid = end.next_pid 151 WHERE start.prio < 100 && end.prev_prio < 100 152-- 153 154The above is valid. 155 156Where as the below is not. 157 158[source,c] 159-- 160select start.pid, (end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as lat from sched_waking as start 161 JOIN sched_switch as end ON start.pid = end.next_pid 162 WHERE start.prio < 100 || end.prev_prio < 100 163-- 164 165 166KEYWORDS AS EVENT FIELDS 167------------------------ 168 169In some cases, an event may have a keyword. For example, regcache_drop_region has "from" 170as a field and the following will not work 171 172[source,c] 173-- 174 select from from regcache_drop_region 175-- 176 177In such cases, add a backslash to the conflicting field, and this will tell the parser 178that the "from" is a field and not a keyword: 179 180[source,c] 181-- 182 select \from from regcache_drop_region 183-- 184 185HISTOGRAMS 186---------- 187 188Simple SQL statements without the *JOIN* *ON* may also be used, which will create a histogram 189instead. When doing this, the struct tracefs_hist descriptor can be retrieved from the 190returned synthetic event descriptor via the *tracefs_synth_get_start_hist*(3). 191 192In order to utilize the histogram types (see xxx) the CAST command of SQL can be used. 193 194That is: 195 196[source,c] 197-- 198 select CAST(common_pid AS comm), CAST(id AS syscall) FROM sys_enter 199-- 200 201Which produces: 202 203[source,c] 204-- 205 # echo 'hist:keys=common_pid.execname,id.syscall' > events/raw_syscalls/sys_enter/trigger 206 207 # cat events/raw_syscalls/sys_enter/hist 208 209{ common_pid: bash [ 18248], id: sys_setpgid [109] } hitcount: 1 210{ common_pid: sendmail [ 1812], id: sys_read [ 0] } hitcount: 1 211{ common_pid: bash [ 18247], id: sys_getpid [ 39] } hitcount: 1 212{ common_pid: bash [ 18247], id: sys_dup2 [ 33] } hitcount: 1 213{ common_pid: gmain [ 13684], id: sys_inotify_add_watch [254] } hitcount: 1 214{ common_pid: cat [ 18247], id: sys_access [ 21] } hitcount: 1 215{ common_pid: bash [ 18248], id: sys_getpid [ 39] } hitcount: 1 216{ common_pid: cat [ 18247], id: sys_fadvise64 [221] } hitcount: 1 217{ common_pid: sendmail [ 1812], id: sys_openat [257] } hitcount: 1 218{ common_pid: less [ 18248], id: sys_munmap [ 11] } hitcount: 1 219{ common_pid: sendmail [ 1812], id: sys_close [ 3] } hitcount: 1 220{ common_pid: gmain [ 1534], id: sys_poll [ 7] } hitcount: 1 221{ common_pid: bash [ 18247], id: sys_execve [ 59] } hitcount: 1 222-- 223 224Note, string fields may not be cast. 225 226The possible types to cast to are: 227 228*HEX* - convert the value to use hex and not decimal 229 230*SYM* - convert a pointer to symbolic (kallsyms values) 231 232*SYM-OFFSET* - convert a pointer to symbolic and include the offset. 233 234*SYSCALL* - convert the number to the mapped system call name 235 236*EXECNAME* or *COMM* - can only be used with the common_pid field. Will show the task 237name of the process. 238 239*LOG* or *LOG2* - bucket the key values in a log 2 values (1, 2, 3-4, 5-8, 9-16, 17-32, ...) 240 241The above fields are not case sensitive, and "LOG2" works as good as "log". 242 243A special CAST to _COUNTER_ or __COUNTER__ will make the field a value and not 244a key. For example: 245 246[source,c] 247-- 248 SELECT common_pid, CAST(bytes_req AS _COUNTER_) FROM kmalloc 249-- 250 251Which will create 252 253[source,c] 254-- 255 echo 'hist:keys=common_pid:vals=bytes_req' > events/kmem/kmalloc/trigger 256 257 cat events/kmem/kmalloc/hist 258 259{ common_pid: 1812 } hitcount: 1 bytes_req: 32 260{ common_pid: 9111 } hitcount: 2 bytes_req: 272 261{ common_pid: 1768 } hitcount: 3 bytes_req: 1112 262{ common_pid: 0 } hitcount: 4 bytes_req: 512 263{ common_pid: 18297 } hitcount: 11 bytes_req: 2004 264-- 265 266RETURN VALUE 267------------ 268Returns 0 on success and -1 on failure. On failure, if _err_ is defined, it will be 269allocated to hold a detailed description of what went wrong if it the error was caused 270by a parsing error, or that an event, field does not exist or is not compatible with 271what it was combined with. 272 273CREATE A TOOL 274------------- 275 276The below example is a functional program that can be used to parse SQL commands into 277synthetic events. 278 279[source, c] 280-- 281 man tracefs_sql | sed -ne '/^EXAMPLE/,/FILES/ { /EXAMPLE/d ; /FILES/d ; p}' > sqlhist.c 282 gcc -o sqlhist sqlhist.c `pkg-config --cflags --libs libtracefs` 283-- 284 285Then you can run the above examples: 286 287[source, c] 288-- 289 sudo ./sqlhist 'select start.pid, (end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as lat from sched_waking as start 290 JOIN sched_switch as end ON start.pid = end.next_pid 291 WHERE start.prio < 100 || end.prev_prio < 100' 292-- 293EXAMPLE 294------- 295[source,c] 296-- 297#include <stdio.h> 298#include <stdlib.h> 299#include <stdarg.h> 300#include <string.h> 301#include <errno.h> 302#include <unistd.h> 303#include <tracefs.h> 304 305static void usage(char **argv) 306{ 307 fprintf(stderr, "usage: %s [-ed][-n name][-s][-S fields][-m var][-c var][-T][-t dir][-f file | sql-command-line]\n" 308 " -n name - name of synthetic event 'Anonymous' if left off\n" 309 " -t dir - use dir instead of /sys/kernel/tracing\n" 310 " -e - execute the commands to create the synthetic event\n" 311 " -m - trigger the action when var is a new max.\n" 312 " -c - trigger the action when var changes.\n" 313 " -s - used with -m or -c to do a snapshot of the tracing buffer\n" 314 " -S - used with -m or -c to save fields of the end event (comma deliminated)\n" 315 " -T - used with -m or -c to do both a snapshot and a trace\n" 316 " -f file - read sql lines from file otherwise from the command line\n" 317 " if file is '-' then read from standard input.\n", 318 argv[0]); 319 exit(-1); 320} 321 322enum action { 323 ACTION_DEFAULT = 0, 324 ACTION_SNAPSHOT = (1 << 0), 325 ACTION_TRACE = (1 << 1), 326 ACTION_SAVE = (1 << 2), 327 ACTION_MAX = (1 << 3), 328 ACTION_CHANGE = (1 << 4), 329}; 330 331#define ACTIONS ((ACTION_MAX - 1)) 332 333static int do_sql(const char *instance_name, 334 const char *buffer, const char *name, const char *var, 335 const char *trace_dir, bool execute, int action, 336 char **save_fields) 337{ 338 struct tracefs_synth *synth; 339 struct tep_handle *tep; 340 struct trace_seq seq; 341 enum tracefs_synth_handler handler; 342 char *err; 343 int ret; 344 345 if ((action & ACTIONS) && !var) { 346 fprintf(stderr, "Error: -s, -S and -T not supported without -m or -c"); 347 exit(-1); 348 } 349 350 if (!name) 351 name = "Anonymous"; 352 353 trace_seq_init(&seq); 354 tep = tracefs_local_events(trace_dir); 355 if (!tep) { 356 if (!trace_dir) 357 trace_dir = "tracefs directory"; 358 perror(trace_dir); 359 exit(-1); 360 } 361 362 synth = tracefs_sql(tep, name, buffer, &err); 363 if (!synth) { 364 perror("Failed creating synthetic event!"); 365 if (err) 366 fprintf(stderr, "%s", err); 367 free(err); 368 exit(-1); 369 } 370 371 if (tracefs_synth_complete(synth)) { 372 if (var) { 373 if (action & ACTION_MAX) 374 handler = TRACEFS_SYNTH_HANDLE_MAX; 375 else 376 handler = TRACEFS_SYNTH_HANDLE_CHANGE; 377 378 if (action & ACTION_SAVE) { 379 ret = tracefs_synth_save(synth, handler, var, save_fields); 380 if (ret < 0) { 381 err = "adding save"; 382 goto failed_action; 383 } 384 } 385 if (action & ACTION_TRACE) { 386 /* 387 * By doing the trace before snapshot, it will be included 388 * in the snapshot. 389 */ 390 ret = tracefs_synth_trace(synth, handler, var); 391 if (ret < 0) { 392 err = "adding trace"; 393 goto failed_action; 394 } 395 } 396 if (action & ACTION_SNAPSHOT) { 397 ret = tracefs_synth_snapshot(synth, handler, var); 398 if (ret < 0) { 399 err = "adding snapshot"; 400 failed_action: 401 perror(err); 402 if (errno == ENODEV) 403 fprintf(stderr, "ERROR: '%s' is not a variable\n", 404 var); 405 exit(-1); 406 } 407 } 408 } 409 tracefs_synth_echo_cmd(&seq, synth); 410 if (execute) { 411 ret = tracefs_synth_create(synth); 412 if (ret < 0) { 413 fprintf(stderr, "%s\n", tracefs_error_last(NULL)); 414 exit(-1); 415 } 416 } 417 } else { 418 struct tracefs_instance *instance = NULL; 419 struct tracefs_hist *hist; 420 421 hist = tracefs_synth_get_start_hist(synth); 422 if (!hist) { 423 perror("get_start_hist"); 424 exit(-1); 425 } 426 if (instance_name) { 427 if (execute) 428 instance = tracefs_instance_create(instance_name); 429 else 430 instance = tracefs_instance_alloc(trace_dir, 431 instance_name); 432 if (!instance) { 433 perror("Failed to create instance"); 434 exit(-1); 435 } 436 } 437 tracefs_hist_echo_cmd(&seq, instance, hist, 0); 438 if (execute) { 439 ret = tracefs_hist_start(instance, hist); 440 if (ret < 0) { 441 fprintf(stderr, "%s\n", tracefs_error_last(instance)); 442 exit(-1); 443 } 444 } 445 } 446 447 tracefs_synth_free(synth); 448 449 trace_seq_do_printf(&seq); 450 trace_seq_destroy(&seq); 451 return 0; 452} 453 454int main (int argc, char **argv) 455{ 456 char *trace_dir = NULL; 457 char *buffer = NULL; 458 char buf[BUFSIZ]; 459 int buffer_size = 0; 460 const char *file = NULL; 461 const char *instance = NULL; 462 bool execute = false; 463 char **save_fields = NULL; 464 const char *name; 465 const char *var; 466 int action = 0; 467 char *tok; 468 FILE *fp; 469 size_t r; 470 int c; 471 int i; 472 473 for (;;) { 474 c = getopt(argc, argv, "ht:f:en:m:c:sS:TB:"); 475 if (c == -1) 476 break; 477 478 switch(c) { 479 case 'h': 480 usage(argv); 481 case 't': 482 trace_dir = optarg; 483 break; 484 case 'f': 485 file = optarg; 486 break; 487 case 'e': 488 execute = true; 489 break; 490 case 'm': 491 action |= ACTION_MAX; 492 var = optarg; 493 break; 494 case 'c': 495 action |= ACTION_CHANGE; 496 var = optarg; 497 break; 498 case 's': 499 action |= ACTION_SNAPSHOT; 500 break; 501 case 'S': 502 action |= ACTION_SAVE; 503 tok = strtok(optarg, ","); 504 while (tok) { 505 save_fields = tracefs_list_add(save_fields, tok); 506 tok = strtok(NULL, ","); 507 } 508 if (!save_fields) { 509 perror(optarg); 510 exit(-1); 511 } 512 break; 513 case 'T': 514 action |= ACTION_TRACE | ACTION_SNAPSHOT; 515 break; 516 case 'B': 517 instance = optarg; 518 break; 519 case 'n': 520 name = optarg; 521 break; 522 } 523 } 524 525 if ((action & (ACTION_MAX|ACTION_CHANGE)) == (ACTION_MAX|ACTION_CHANGE)) { 526 fprintf(stderr, "Can not use both -m and -c together\n"); 527 exit(-1); 528 } 529 if (file) { 530 if (!strcmp(file, "-")) 531 fp = stdin; 532 else 533 fp = fopen(file, "r"); 534 if (!fp) { 535 perror(file); 536 exit(-1); 537 } 538 while ((r = fread(buf, 1, BUFSIZ, fp)) > 0) { 539 buffer = realloc(buffer, buffer_size + r + 1); 540 strncpy(buffer + buffer_size, buf, r); 541 buffer_size += r; 542 } 543 fclose(fp); 544 if (buffer_size) 545 buffer[buffer_size] = '\0'; 546 } else if (argc == optind) { 547 usage(argv); 548 } else { 549 for (i = optind; i < argc; i++) { 550 r = strlen(argv[i]); 551 buffer = realloc(buffer, buffer_size + r + 2); 552 if (i != optind) 553 buffer[buffer_size++] = ' '; 554 strcpy(buffer + buffer_size, argv[i]); 555 buffer_size += r; 556 } 557 } 558 559 do_sql(instance, buffer, name, var, trace_dir, execute, action, save_fields); 560 free(buffer); 561 562 return 0; 563} 564-- 565 566FILES 567----- 568[verse] 569-- 570*tracefs.h* 571 Header file to include in order to have access to the library APIs. 572*-ltracefs* 573 Linker switch to add when building a program that uses the library. 574-- 575 576SEE ALSO 577-------- 578*sqlhist*(1), 579*libtracefs*(3), 580*libtraceevent*(3), 581*trace-cmd*(1), 582*tracefs_synth_init*(3), 583*tracefs_synth_add_match_field*(3), 584*tracefs_synth_add_compare_field*(3), 585*tracefs_synth_add_start_field*(3), 586*tracefs_synth_add_end_field*(3), 587*tracefs_synth_append_start_filter*(3), 588*tracefs_synth_append_end_filter*(3), 589*tracefs_synth_create*(3), 590*tracefs_synth_destroy*(3), 591*tracefs_synth_free*(3), 592*tracefs_synth_echo_cmd*(3), 593*tracefs_hist_alloc*(3), 594*tracefs_hist_alloc_2d*(3), 595*tracefs_hist_alloc_nd*(3), 596*tracefs_hist_free*(3), 597*tracefs_hist_add_key*(3), 598*tracefs_hist_add_value*(3), 599*tracefs_hist_add_name*(3), 600*tracefs_hist_start*(3), 601*tracefs_hist_destory*(3), 602*tracefs_hist_add_sort_key*(3), 603*tracefs_hist_sort_key_direction*(3) 604 605AUTHOR 606------ 607[verse] 608-- 609*Steven Rostedt* <[email protected]> 610*Tzvetomir Stoyanov* <[email protected]> 611*sameeruddin shaik* <[email protected]> 612-- 613REPORTING BUGS 614-------------- 615Report bugs to <[email protected]> 616 617LICENSE 618------- 619libtracefs is Free Software licensed under the GNU LGPL 2.1 620 621RESOURCES 622--------- 623https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/ 624 625COPYING 626------- 627Copyright \(C) 2020 VMware, Inc. Free use of this software is granted under 628the terms of the GNU Public License (GPL). 629