1Demonstrations of trace. 2 3 4trace probes functions you specify and displays trace messages if a particular 5condition is met. You can control the message format to display function 6arguments and return values. 7 8For example, suppose you want to trace all commands being exec'd across the 9system: 10 11# trace 'sys_execve "%s", arg1' 12PID COMM FUNC - 134402 bash sys_execve /usr/bin/man 144411 man sys_execve /usr/local/bin/less 154411 man sys_execve /usr/bin/less 164410 man sys_execve /usr/local/bin/nroff 174410 man sys_execve /usr/bin/nroff 184409 man sys_execve /usr/local/bin/tbl 194409 man sys_execve /usr/bin/tbl 204408 man sys_execve /usr/local/bin/preconv 214408 man sys_execve /usr/bin/preconv 224415 nroff sys_execve /usr/bin/locale 234416 nroff sys_execve /usr/bin/groff 244418 groff sys_execve /usr/bin/grotty 254417 groff sys_execve /usr/bin/troff 26^C 27 28The ::sys_execve syntax specifies that you want an entry probe (which is the 29default), in a kernel function (which is the default) called sys_execve. Next, 30the format string to print is simply "%s", which prints a string. Finally, the 31value to print is the first argument to the sys_execve function, which happens 32to be the command that is exec'd. The above trace was generated by executing 33"man ls" in a separate shell. As you see, man executes a number of additional 34programs to finally display the man page. 35 36Next, suppose you are looking for large reads across the system. Let's trace 37the read system call and inspect the third argument, which is the number of 38bytes to be read: 39 40# trace 'sys_read (arg3 > 20000) "read %d bytes", arg3' 41PID COMM FUNC - 424490 dd sys_read read 1048576 bytes 434490 dd sys_read read 1048576 bytes 444490 dd sys_read read 1048576 bytes 454490 dd sys_read read 1048576 bytes 46^C 47 48During the trace, I executed "dd if=/dev/zero of=/dev/null bs=1M count=4". 49The individual reads are visible, with the custom format message printed for 50each read. The parenthesized expression "(arg3 > 20000)" is a filter that is 51evaluated for each invocation of the probe before printing anything. 52 53Event message filter is useful while you only interesting the specific event. 54Like the program open thousands file and you only want to see the "temp" file 55and print stack. 56 57# trace 'do_sys_open "%s", arg2@user' -UK -f temp 58PID TID COMM FUNC - 599557 9557 a.out do_sys_open temp.1 60 do_sys_open+0x1 [kernel] 61 do_syscall_64+0x5b [kernel] 62 entry_SYSCALL_64_after_hwframe+0x44 [kernel] 63 __open_nocancel+0x7 [libc-2.17.so] 64 __libc_start_main+0xf5 [libc-2.17.so] 659558 9558 a.out do_sys_open temp.2 66 do_sys_open+0x1 [kernel] 67 do_syscall_64+0x5b [kernel] 68 entry_SYSCALL_64_after_hwframe+0x44 [kernel] 69 __open_nocancel+0x7 [libc-2.17.so] 70 __libc_start_main+0xf5 [libc-2.17.so] 71 72Process name filter is porting from tools/opensnoop 73 74# trace 'do_sys_open "%s", arg2@user' -UK -n out 75PID TID COMM FUNC - 769557 9557 a.out do_sys_open temp.1 77 do_sys_open+0x1 [kernel] 78 do_syscall_64+0x5b [kernel] 79 entry_SYSCALL_64_after_hwframe+0x44 [kernel] 80 __open_nocancel+0x7 [libc-2.17.so] 81 __libc_start_main+0xf5 [libc-2.17.so] 82 83You can also trace user functions. For example, let's simulate the bashreadline 84script, which attaches to the readline function in bash and prints its return 85value, effectively snooping all bash shell input across the system: 86 87# trace 'r:bash:readline "%s", retval' 88PID COMM FUNC - 892740 bash readline echo hi! 902740 bash readline man ls 91^C 92 93The special retval keyword stands for the function's return value, and can 94be used only in a retprobe, specified by the 'r' prefix. The next component 95of the probe is the library that contains the desired function. It's OK to 96specify executables too, as long as they can be found in the PATH. Or, you 97can specify the full path to the executable (e.g. "/usr/bin/bash"). 98 99Sometimes it can be useful to see where in code the events happen. There are 100flags to print the kernel stack (-K), the user stack (-U) and optionally 101include the virtual address in the stacks as well (-a): 102 103# trace.py -U -a 'r::sys_futex "%d", retval' 104PID TID COMM FUNC - 105793922 793951 poller sys_futex 0 106 7f6c72b6497a __lll_unlock_wake+0x1a [libpthread-2.23.so] 107 627fef folly::FunctionScheduler::run()+0x46f [router] 108 7f6c7345f171 execute_native_thread_routine+0x21 [libstdc++.so.6.0.21] 109 7f6c72b5b7a9 start_thread+0xd9 [libpthread-2.23.so] 110 7f6c7223fa7d clone+0x6d [libc-2.23.so] 111 112Multiple probes can be combined on the same command line. For example, let's 113trace failed read and write calls on the libc level, and include a time column: 114 115# trace 'r:c:read ((int)retval < 0) "read failed: %d", retval' \ 116 'r:c:write ((int)retval < 0) "write failed: %d", retval' -T 117TIME PID COMM FUNC - 11805:31:57 3388 bash write write failed: -1 11905:32:00 3388 bash write write failed: -1 120^C 121 122Note that the retval variable must be cast to int before comparing to zero. 123The reason is that the default type for argN and retval is an unsigned 64-bit 124integer, which can never be smaller than 0. 125 126trace has also some basic support for kernel tracepoints. For example, let's 127trace the block:block_rq_complete tracepoint and print out the number of sectors 128transferred: 129 130# trace 't:block:block_rq_complete "sectors=%d", args->nr_sector' -T 131TIME PID COMM FUNC - 13201:23:51 0 swapper/0 block_rq_complete sectors=8 13301:23:55 10017 kworker/u64: block_rq_complete sectors=1 13401:23:55 0 swapper/0 block_rq_complete sectors=8 135^C 136 137Suppose that you want to trace a system-call in a short-lived process, you can use 138the -s option to trace. The option is followed by list of libraries/executables to 139use for symbol resolution. 140# trace -s /lib/x86_64-linux-gnu/libc.so.6,/bin/ping 'p:c:inet_pton' -U 141Note: Kernel bpf will report stack map with ip/build_id 142PID TID COMM FUNC 1434175 4175 ping inet_pton 144 inet_pton+0x136340 [libc.so.6] 145 getaddrinfo+0xfb510 [libc.so.6] 146 _init+0x2a08 [ping] 147 148During the trace, 'ping -c1 google.com' was executed to obtain the above results 149 150To discover the tracepoint structure format (which you can refer to as the "args" 151pointer variable), use the tplist tool. For example: 152 153# tplist -v block:block_rq_complete 154block:block_rq_complete 155 dev_t dev; 156 sector_t sector; 157 unsigned int nr_sector; 158 int errors; 159 char rwbs[8]; 160 161This output tells you that you can use "args->dev", "args->sector", etc. in your 162predicate and trace arguments. 163 164 165More and more high-level libraries are instrumented with USDT probe support. 166These probes can be traced by trace just like kernel tracepoints. For example, 167trace new threads being created and their function name, include time column 168and on which CPU it happened: 169 170# trace 'u:pthread:pthread_create "%U", arg3' -T -C 171TIME CPU PID TID COMM FUNC - 17213:22:01 25 2627 2629 automount pthread_create expire_proc_indirect+0x0 [automount] 17313:22:01 5 21360 21414 osqueryd pthread_create [unknown] [osqueryd] 17413:22:03 25 2627 2629 automount pthread_create expire_proc_indirect+0x0 [automount] 17513:22:04 15 21360 21414 osqueryd pthread_create [unknown] [osqueryd] 17613:22:07 25 2627 2629 automount pthread_create expire_proc_indirect+0x0 [automount] 17713:22:07 4 21360 21414 osqueryd pthread_create [unknown] [osqueryd] 178^C 179 180The "%U" format specifier tells trace to resolve arg3 as a user-space symbol, 181if possible. Similarly, use "%K" for kernel symbols. 182 183Ruby, Node, and OpenJDK are also instrumented with USDT. For example, let's 184trace Ruby methods being called (this requires a version of Ruby built with 185the --enable-dtrace configure flag): 186 187# trace 'u:ruby:method__entry "%s.%s", arg1, arg2' -p $(pidof irb) -T 188TIME PID COMM FUNC - 18912:08:43 18420 irb method__entry IRB::Context.verbose? 19012:08:43 18420 irb method__entry RubyLex.ungetc 19112:08:43 18420 irb method__entry RuxyLex.debug? 192^C 193 194In the previous invocation, arg1 and arg2 are the class name and method name 195for the Ruby method being invoked. 196 197You can also trace exported functions from shared libraries, or an imported 198function on the actual executable: 199 200# sudo ./trace.py 'r:/usr/lib64/libtinfo.so:curses_version "Version=%s", retval' 201# tput -V 202 203PID TID COMM FUNC - 20421720 21720 tput curses_version Version=ncurses 6.0.20160709 205^C 206 207 208Occasionally, it can be useful to filter specific strings. For example, you 209might be interested in open() calls that open a specific file: 210 211# trace 'p:c:open (STRCMP("test.txt", arg1)) "opening %s", arg1' -T 212TIME PID COMM FUNC - 21301:43:15 10938 cat open opening test.txt 21401:43:20 10939 cat open opening test.txt 215^C 216 217 218In the preceding example, as well as in many others, readability may be 219improved by providing the function's signature, which names the arguments and 220lets you access structure sub-fields, which is hard with the "arg1", "arg2" 221convention. For example: 222 223# trace 'p:c:open(char *filename) "opening %s", filename' 224PID TID COMM FUNC - 22517507 17507 cat open opening FAQ.txt 226^C 227 228# trace 'p::SyS_nanosleep(struct timespec *ts) "sleep for %lld ns", ts->tv_nsec' 229PID TID COMM FUNC - 230777 785 automount SyS_nanosleep sleep for 500000000 ns 231777 785 automount SyS_nanosleep sleep for 500000000 ns 232777 785 automount SyS_nanosleep sleep for 500000000 ns 233777 785 automount SyS_nanosleep sleep for 500000000 ns 234^C 235 236Remember to use the -I argument include the appropriate header file. We didn't 237need to do that here because `struct timespec` is used internally by the tool, 238so it always includes this header file. 239 240To aggregate amount of trace, you need specify -A with -M EVENTS. A typical 241example: 2421, if we find that the sys CPU utilization is higher by 'top' command 2432, then find that the timer interrupt is more normal by 'irqtop' command 2443, to confirm kernel timer setting frequence by 'funccount -i 1 clockevents_program_event' 2454, to trace timer setting by 'trace clockevents_program_event -K -A -M 1000' 246 2471294576 1294584 CPU 0/KVM clockevents_program_event 248 clockevents_program_event+0x1 [kernel] 249 hrtimer_start_range_ns+0x209 [kernel] 250 start_sw_timer+0x173 [kvm] 251 restart_apic_timer+0x6c [kvm] 252 kvm_set_msr_common+0x442 [kvm] 253 __kvm_set_msr+0xa2 [kvm] 254 kvm_emulate_wrmsr+0x36 [kvm] 255 vcpu_enter_guest+0x326 [kvm] 256 kvm_arch_vcpu_ioctl_run+0xcc [kvm] 257 kvm_vcpu_ioctl+0x22f [kvm] 258 do_vfs_ioctl+0xa1 [kernel] 259 ksys_ioctl+0x60 [kernel] 260 __x64_sys_ioctl+0x16 [kernel] 261 do_syscall_64+0x59 [kernel] 262 entry_SYSCALL_64_after_hwframe+0x44 [kernel] 263-->COUNT 271 264... 265So we can know that 271 timer setting in recent 1000(~27%). 266 267As a final example, let's trace open syscalls for a specific process. By 268default, tracing is system-wide, but the -p switch overrides this: 269 270# trace -p 2740 'do_sys_open "%s", arg2@user' -T 271TIME PID COMM FUNC - 27205:36:16 15872 ls do_sys_open /etc/ld.so.cache 27305:36:16 15872 ls do_sys_open /lib64/libselinux.so.1 27405:36:16 15872 ls do_sys_open /lib64/libcap.so.2 27505:36:16 15872 ls do_sys_open /lib64/libacl.so.1 27605:36:16 15872 ls do_sys_open /lib64/libc.so.6 27705:36:16 15872 ls do_sys_open /lib64/libpcre.so.1 27805:36:16 15872 ls do_sys_open /lib64/libdl.so.2 27905:36:16 15872 ls do_sys_open /lib64/libattr.so.1 28005:36:16 15872 ls do_sys_open /lib64/libpthread.so.0 28105:36:16 15872 ls do_sys_open /usr/lib/locale/locale-archive 28205:36:16 15872 ls do_sys_open /home/vagrant 283^C 284 285In this example, we traced the "ls ~" command as it was opening its shared 286libraries and then accessing the /home/vagrant directory listing. 287 288 289Lastly, if a high-frequency event is traced you may overflow the perf ring 290buffer. This shows as "Lost N samples": 291 292# trace sys_open 2935087 5087 pgrep sys_open 2945087 5087 pgrep sys_open 2955087 5087 pgrep sys_open 2965087 5087 pgrep sys_open 2975087 5087 pgrep sys_open 298Lost 764896 samples 299Lost 764896 samples 300Lost 764896 samples 301 302The perf ring buffer size can be changed with -b. The unit is size per-CPU buffer 303size and is measured in pages. The value must be a power of two and defaults to 30464 pages. 305 306# trace.py 'sys_setsockopt(int fd, int level, int optname, char* optval, int optlen)(level==0 && optname == 1 && STRCMP("{0x6C, 0x00, 0x00, 0x00}", optval))' -U -M 1 --bin_cmp 307PID TID COMM FUNC - 3081855611 1863183 worker sys_setsockopt found 309 310In this example we are catching setsockopt syscall to change IPv4 IP_TOS 311value only for the cases where new TOS value is equal to 108. we are using 312STRCMP helper in binary mode (--bin_cmp flag) to compare optval array 313against int value of 108 (parametr of setsockopt call) in hex representation 314(little endian format) 315 316For advanced users there is a possibility to insert the kprobes or uprobes 317after a certain offset, rather than the start of the function call 318This is useful for tracing register values at different places of the 319execution of a function. Lets consider the following example: 320 321int main() 322{ 323 int val = 0xdead; 324 printf("%d\n", val); 325 val = 0xbeef; 326 printf("%d\n", val); 327} 328 329After compiling the code with -O3 optimization the object code looks 330like the following (with GCC 10 and x86_64 architecture): 331 332objdump --disassemble=main --prefix-addresses a.out 333 3340000000000001060 <main> endbr64 3350000000000001064 <main+0x4> sub $0x8,%rsp 3360000000000001068 <main+0x8> mov $0xdead,%edx 337000000000000106d <main+0xd> mov $0x1,%edi 3380000000000001072 <main+0x12> xor %eax,%eax 3390000000000001074 <main+0x14> lea 0xf89(%rip),%rsi 340000000000000107b <main+0x1b> callq 0000000000001050 <__printf_chk@plt> 3410000000000001080 <main+0x20> mov $0xbeef,%edx 3420000000000001085 <main+0x25> lea 0xf78(%rip),%rsi 343000000000000108c <main+0x2c> xor %eax,%eax 344000000000000108e <main+0x2e> mov $0x1,%edi 3450000000000001093 <main+0x33> callq 0000000000001050 <__printf_chk@plt> 3460000000000001098 <main+0x38> xor %eax,%eax 347000000000000109a <main+0x3a> add $0x8,%rsp 348000000000000109e <main+0x3e> retq 349 350The 0xdead and later the 0xbeef values are moved into the edx register. 351As the disassembly shows the edx register contains the 0xdead value 352after the 0xd offset and 0xbeef after the 0x25 offset. To verify this 353with trace lets insert probes to those offsets. The following 354command inserts two uprobe one after the 0xd offset and another one 355after the 0x25 offset of the main function. The probe print the 356value of the edx register which will show us the correct values. 357 358trace 'p:/tmp/a.out:main+0xd "%x", ctx->dx' 'p:/tmp/a.out:main+0x25 "%x", ctx->dx' 359PID TID COMM FUNC - 36025754 25754 a.out main dead 36125754 25754 a.out main beef 362 363 364USAGE message: 365 366usage: trace [-h] [-b BUFFER_PAGES] [-p PID] [-L TID] [--uid UID] [-v] 367 [-Z STRING_SIZE] [-S] [-M MAX_EVENTS] [-t] [-u] [-T] [-C] 368 [-c CGROUP_PATH] [-n NAME] [-f MSG_FILTER] [-B] 369 [-s SYM_FILE_LIST] [-K] [-U] [-a] [-I header] 370 probe [probe ...] 371 372Attach to functions and print trace messages. 373 374positional arguments: 375 probe probe specifier (see examples) 376 377optional arguments: 378 -h, --help show this help message and exit 379 -b BUFFER_PAGES, --buffer-pages BUFFER_PAGES 380 number of pages to use for perf_events ring buffer 381 (default: 64) 382 -p PID, --pid PID id of the process to trace (optional) 383 -L TID, --tid TID id of the thread to trace (optional) 384 --uid UID id of the user to trace (optional) 385 -v, --verbose print resulting BPF program code before executing 386 -Z STRING_SIZE, --string-size STRING_SIZE 387 maximum size to read from strings 388 -S, --include-self do not filter trace's own pid from the trace 389 -M MAX_EVENTS, --max-events MAX_EVENTS 390 number of events to print before quitting 391 -t, --timestamp print timestamp column (offset from trace start) 392 -u, --unix-timestamp print UNIX timestamp instead of offset from trace 393 start, requires -t 394 -T, --time print time column 395 -C, --print_cpu print CPU id 396 -c CGROUP_PATH, --cgroup-path CGROUP_PATH 397 cgroup path 398 -n NAME, --name NAME only print process names containing this name 399 -f MSG_FILTER, --msg-filter MSG_FILTER 400 only print the msg of event containing this string 401 -B, --bin_cmp allow to use STRCMP with binary values 402 -s SYM_FILE_LIST, --sym_file_list SYM_FILE_LIST 403 comma separated list of symbol files to use for symbol 404 resolution 405 -K, --kernel-stack output kernel stack trace 406 -U, --user-stack output user stack trace 407 -a, --address print virtual address in stacks 408 -I header, --include header 409 additional header files to include in the BPF program 410 as either full path, or relative to current working 411 directory, or relative to default kernel header search 412 path 413 -A, --aggregate aggregate amount of each trace 414 415EXAMPLES: 416 417trace do_sys_open 418 Trace the open syscall and print a default trace message when entered 419trace kfree_skb+0x12 420 Trace the kfree_skb kernel function after the instruction on the 0x12 offset 421trace 'do_sys_open "%s", arg2@user' 422 Trace the open syscall and print the filename being opened @user is 423 added to arg2 in kprobes to ensure that char * should be copied from 424 the userspace stack to the bpf stack. If not specified, previous 425 behaviour is expected. 426 427trace 'do_sys_open "%s", arg2@user' -n main 428 Trace the open syscall and only print event that process names containing "main" 429trace 'do_sys_open "%s", arg2@user' --uid 1001 430 Trace the open syscall and only print event that processes with user ID 1001 431trace 'do_sys_open "%s", arg2@user' -f config 432 Trace the open syscall and print the filename being opened filtered by "config" 433trace 'sys_read (arg3 > 20000) "read %d bytes", arg3' 434 Trace the read syscall and print a message for reads >20000 bytes 435trace 'r::do_sys_open "%llx", retval' 436 Trace the return from the open syscall and print the return value 437trace 'c:open (arg2 == 42) "%s %d", arg1, arg2' 438 Trace the open() call from libc only if the flags (arg2) argument is 42 439trace 'c:malloc "size = %d", arg1' 440 Trace malloc calls and print the size being allocated 441trace 'p:c:write (arg1 == 1) "writing %d bytes to STDOUT", arg3' 442 Trace the write() call from libc to monitor writes to STDOUT 443trace 'r::__kmalloc (retval == 0) "kmalloc failed!"' 444 Trace returns from __kmalloc which returned a null pointer 445trace 'r:c:malloc (retval) "allocated = %x", retval' 446 Trace returns from malloc and print non-NULL allocated buffers 447trace 't:block:block_rq_complete "sectors=%d", args->nr_sector' 448 Trace the block_rq_complete kernel tracepoint and print # of tx sectors 449trace 'u:pthread:pthread_create (arg4 != 0)' 450 Trace the USDT probe pthread_create when its 4th argument is non-zero 451trace 'u:pthread:libpthread:pthread_create (arg4 != 0)' 452 Ditto, but the provider name "libpthread" is specified. 453trace 'p::SyS_nanosleep(struct timespec *ts) "sleep for %lld ns", ts->tv_nsec' 454 Trace the nanosleep syscall and print the sleep duration in ns 455trace -c /sys/fs/cgroup/system.slice/workload.service '__x64_sys_nanosleep' '__x64_sys_clone' 456 Trace nanosleep/clone syscall calls only under workload.service 457 cgroup hierarchy. 458trace -I 'linux/fs.h' \ 459 'p::uprobe_register(struct inode *inode) "a_ops = %llx", inode->i_mapping->a_ops' 460 Trace the uprobe_register inode mapping ops, and the symbol can be found 461 in /proc/kallsyms 462trace -I 'kernel/sched/sched.h' \ 463 'p::__account_cfs_rq_runtime(struct cfs_rq *cfs_rq) "%d", cfs_rq->runtime_remaining' 464 Trace the cfs scheduling runqueue remaining runtime. The struct cfs_rq is defined 465 in kernel/sched/sched.h which is in kernel source tree and not in kernel-devel 466 package. So this command needs to run at the kernel source tree root directory 467 so that the added header file can be found by the compiler. 468trace -I 'net/sock.h' \ 469 'udpv6_sendmsg(struct sock *sk) (sk->sk_dport == 13568)' 470 Trace udpv6 sendmsg calls only if socket's destination port is equal 471 to 53 (DNS; 13568 in big endian order) 472trace -I 'linux/fs_struct.h' 'mntns_install "users = %d", $task->fs->users' 473 Trace the number of users accessing the file system of the current task 474trace -s /lib/x86_64-linux-gnu/libc.so.6,/bin/ping 'p:c:inet_pton' -U 475 Trace inet_pton system call and use the specified libraries/executables for 476 symbol resolution. 477