xref: /aosp_15_r20/external/bcc/tools/trace_example.txt (revision 387f9dfdfa2baef462e92476d413c7bc2470293e)
1Demonstrations of trace.
2
3
4trace probes functions you specify and displays trace messages if a particular
5condition is met. You can control the message format to display function
6arguments and return values.
7
8For example, suppose you want to trace all commands being exec'd across the
9system:
10
11# trace 'sys_execve "%s", arg1'
12PID    COMM         FUNC             -
134402   bash         sys_execve       /usr/bin/man
144411   man          sys_execve       /usr/local/bin/less
154411   man          sys_execve       /usr/bin/less
164410   man          sys_execve       /usr/local/bin/nroff
174410   man          sys_execve       /usr/bin/nroff
184409   man          sys_execve       /usr/local/bin/tbl
194409   man          sys_execve       /usr/bin/tbl
204408   man          sys_execve       /usr/local/bin/preconv
214408   man          sys_execve       /usr/bin/preconv
224415   nroff        sys_execve       /usr/bin/locale
234416   nroff        sys_execve       /usr/bin/groff
244418   groff        sys_execve       /usr/bin/grotty
254417   groff        sys_execve       /usr/bin/troff
26^C
27
28The ::sys_execve syntax specifies that you want an entry probe (which is the
29default), in a kernel function (which is the default) called sys_execve. Next,
30the format string to print is simply "%s", which prints a string. Finally, the
31value to print is the first argument to the sys_execve function, which happens
32to be the command that is exec'd. The above trace was generated by executing
33"man ls" in a separate shell. As you see, man executes a number of additional
34programs to finally display the man page.
35
36Next, suppose you are looking for large reads across the system. Let's trace
37the read system call and inspect the third argument, which is the number of
38bytes to be read:
39
40# trace 'sys_read (arg3 > 20000) "read %d bytes", arg3'
41PID    COMM         FUNC             -
424490   dd           sys_read         read 1048576 bytes
434490   dd           sys_read         read 1048576 bytes
444490   dd           sys_read         read 1048576 bytes
454490   dd           sys_read         read 1048576 bytes
46^C
47
48During the trace, I executed "dd if=/dev/zero of=/dev/null bs=1M count=4".
49The individual reads are visible, with the custom format message printed for
50each read. The parenthesized expression "(arg3 > 20000)" is a filter that is
51evaluated for each invocation of the probe before printing anything.
52
53Event message filter is useful while you only interesting the specific event.
54Like the program open thousands file and you only want to see the "temp" file
55and print stack.
56
57# trace 'do_sys_open "%s", arg2@user' -UK -f temp
58PID     TID     COMM            FUNC             -
599557    9557    a.out           do_sys_open      temp.1
60        do_sys_open+0x1 [kernel]
61        do_syscall_64+0x5b [kernel]
62        entry_SYSCALL_64_after_hwframe+0x44 [kernel]
63        __open_nocancel+0x7 [libc-2.17.so]
64        __libc_start_main+0xf5 [libc-2.17.so]
659558    9558    a.out           do_sys_open      temp.2
66        do_sys_open+0x1 [kernel]
67        do_syscall_64+0x5b [kernel]
68        entry_SYSCALL_64_after_hwframe+0x44 [kernel]
69        __open_nocancel+0x7 [libc-2.17.so]
70        __libc_start_main+0xf5 [libc-2.17.so]
71
72Process name filter is porting from tools/opensnoop
73
74# trace 'do_sys_open "%s", arg2@user' -UK -n out
75PID     TID     COMM            FUNC             -
769557    9557    a.out           do_sys_open      temp.1
77        do_sys_open+0x1 [kernel]
78        do_syscall_64+0x5b [kernel]
79        entry_SYSCALL_64_after_hwframe+0x44 [kernel]
80        __open_nocancel+0x7 [libc-2.17.so]
81        __libc_start_main+0xf5 [libc-2.17.so]
82
83You can also trace user functions. For example, let's simulate the bashreadline
84script, which attaches to the readline function in bash and prints its return
85value, effectively snooping all bash shell input across the system:
86
87# trace 'r:bash:readline "%s", retval'
88PID    COMM         FUNC             -
892740   bash         readline         echo hi!
902740   bash         readline         man ls
91^C
92
93The special retval keyword stands for the function's return value, and can
94be used only in a retprobe, specified by the 'r' prefix. The next component
95of the probe is the library that contains the desired function. It's OK to
96specify executables too, as long as they can be found in the PATH. Or, you
97can specify the full path to the executable (e.g. "/usr/bin/bash").
98
99Sometimes it can be useful to see where in code the events happen. There are
100flags to print the kernel stack (-K), the user stack (-U) and optionally
101include the virtual address in the stacks as well (-a):
102
103# trace.py -U -a 'r::sys_futex "%d", retval'
104PID     TID     COMM            FUNC             -
105793922  793951  poller          sys_futex        0
106        7f6c72b6497a __lll_unlock_wake+0x1a [libpthread-2.23.so]
107              627fef folly::FunctionScheduler::run()+0x46f [router]
108        7f6c7345f171 execute_native_thread_routine+0x21 [libstdc++.so.6.0.21]
109        7f6c72b5b7a9 start_thread+0xd9 [libpthread-2.23.so]
110        7f6c7223fa7d clone+0x6d [libc-2.23.so]
111
112Multiple probes can be combined on the same command line. For example, let's
113trace failed read and write calls on the libc level, and include a time column:
114
115# trace 'r:c:read ((int)retval < 0) "read failed: %d", retval' \
116        'r:c:write ((int)retval < 0) "write failed: %d", retval' -T
117TIME     PID    COMM         FUNC             -
11805:31:57 3388   bash         write            write failed: -1
11905:32:00 3388   bash         write            write failed: -1
120^C
121
122Note that the retval variable must be cast to int before comparing to zero.
123The reason is that the default type for argN and retval is an unsigned 64-bit
124integer, which can never be smaller than 0.
125
126trace has also some basic support for kernel tracepoints. For example, let's
127trace the block:block_rq_complete tracepoint and print out the number of sectors
128transferred:
129
130# trace 't:block:block_rq_complete "sectors=%d", args->nr_sector' -T
131TIME     PID    COMM         FUNC             -
13201:23:51 0      swapper/0    block_rq_complete sectors=8
13301:23:55 10017  kworker/u64: block_rq_complete sectors=1
13401:23:55 0      swapper/0    block_rq_complete sectors=8
135^C
136
137Suppose that you want to trace a system-call in a short-lived process, you can use
138the -s option to trace. The option is followed by list of libraries/executables to
139use for symbol resolution.
140# trace -s /lib/x86_64-linux-gnu/libc.so.6,/bin/ping 'p:c:inet_pton' -U
141Note: Kernel bpf will report stack map with ip/build_id
142PID     TID     COMM            FUNC
1434175    4175    ping            inet_pton
144        inet_pton+0x136340 [libc.so.6]
145        getaddrinfo+0xfb510 [libc.so.6]
146        _init+0x2a08 [ping]
147
148During the trace, 'ping -c1 google.com' was executed to obtain the above results
149
150To discover the tracepoint structure format (which you can refer to as the "args"
151pointer variable), use the tplist tool. For example:
152
153# tplist -v block:block_rq_complete
154block:block_rq_complete
155    dev_t dev;
156    sector_t sector;
157    unsigned int nr_sector;
158    int errors;
159    char rwbs[8];
160
161This output tells you that you can use "args->dev", "args->sector", etc. in your
162predicate and trace arguments.
163
164
165More and more high-level libraries are instrumented with USDT probe support.
166These probes can be traced by trace just like kernel tracepoints. For example,
167trace new threads being created and their function name, include time column
168and on which CPU it happened:
169
170# trace 'u:pthread:pthread_create "%U", arg3' -T -C
171TIME     CPU PID     TID     COMM            FUNC             -
17213:22:01 25  2627    2629    automount       pthread_create   expire_proc_indirect+0x0 [automount]
17313:22:01 5   21360   21414   osqueryd        pthread_create   [unknown] [osqueryd]
17413:22:03 25  2627    2629    automount       pthread_create   expire_proc_indirect+0x0 [automount]
17513:22:04 15  21360   21414   osqueryd        pthread_create   [unknown] [osqueryd]
17613:22:07 25  2627    2629    automount       pthread_create   expire_proc_indirect+0x0 [automount]
17713:22:07 4   21360   21414   osqueryd        pthread_create   [unknown] [osqueryd]
178^C
179
180The "%U" format specifier tells trace to resolve arg3 as a user-space symbol,
181if possible. Similarly, use "%K" for kernel symbols.
182
183Ruby, Node, and OpenJDK are also instrumented with USDT. For example, let's
184trace Ruby methods being called (this requires a version of Ruby built with
185the --enable-dtrace configure flag):
186
187# trace 'u:ruby:method__entry "%s.%s", arg1, arg2' -p $(pidof irb) -T
188TIME     PID    COMM         FUNC             -
18912:08:43 18420  irb          method__entry    IRB::Context.verbose?
19012:08:43 18420  irb          method__entry    RubyLex.ungetc
19112:08:43 18420  irb          method__entry    RuxyLex.debug?
192^C
193
194In the previous invocation, arg1 and arg2 are the class name and method name
195for the Ruby method being invoked.
196
197You can also trace exported functions from shared libraries, or an imported
198function on the actual executable:
199
200# sudo ./trace.py 'r:/usr/lib64/libtinfo.so:curses_version "Version=%s", retval'
201# tput -V
202
203PID    TID    COMM         FUNC             -
20421720  21720  tput         curses_version   Version=ncurses 6.0.20160709
205^C
206
207
208Occasionally, it can be useful to filter specific strings. For example, you
209might be interested in open() calls that open a specific file:
210
211# trace 'p:c:open (STRCMP("test.txt", arg1)) "opening %s", arg1' -T
212TIME     PID    COMM         FUNC             -
21301:43:15 10938  cat          open             opening test.txt
21401:43:20 10939  cat          open             opening test.txt
215^C
216
217
218In the preceding example, as well as in many others, readability may be
219improved by providing the function's signature, which names the arguments and
220lets you access structure sub-fields, which is hard with the "arg1", "arg2"
221convention. For example:
222
223# trace 'p:c:open(char *filename) "opening %s", filename'
224PID    TID    COMM         FUNC             -
22517507  17507  cat          open             opening FAQ.txt
226^C
227
228# trace 'p::SyS_nanosleep(struct timespec *ts) "sleep for %lld ns", ts->tv_nsec'
229PID    TID    COMM         FUNC             -
230777    785    automount    SyS_nanosleep    sleep for 500000000 ns
231777    785    automount    SyS_nanosleep    sleep for 500000000 ns
232777    785    automount    SyS_nanosleep    sleep for 500000000 ns
233777    785    automount    SyS_nanosleep    sleep for 500000000 ns
234^C
235
236Remember to use the -I argument include the appropriate header file. We didn't
237need to do that here because `struct timespec` is used internally by the tool,
238so it always includes this header file.
239
240To aggregate amount of trace, you need specify -A with -M EVENTS. A typical
241example:
2421, if we find that the sys CPU utilization is higher by 'top' command
2432, then find that the timer interrupt is more normal by 'irqtop' command
2443, to confirm kernel timer setting frequence by 'funccount -i 1 clockevents_program_event'
2454, to trace timer setting by 'trace clockevents_program_event -K -A -M 1000'
246
2471294576 1294584 CPU 0/KVM       clockevents_program_event
248        clockevents_program_event+0x1 [kernel]
249        hrtimer_start_range_ns+0x209 [kernel]
250        start_sw_timer+0x173 [kvm]
251        restart_apic_timer+0x6c [kvm]
252        kvm_set_msr_common+0x442 [kvm]
253        __kvm_set_msr+0xa2 [kvm]
254        kvm_emulate_wrmsr+0x36 [kvm]
255        vcpu_enter_guest+0x326 [kvm]
256        kvm_arch_vcpu_ioctl_run+0xcc [kvm]
257        kvm_vcpu_ioctl+0x22f [kvm]
258        do_vfs_ioctl+0xa1 [kernel]
259        ksys_ioctl+0x60 [kernel]
260        __x64_sys_ioctl+0x16 [kernel]
261        do_syscall_64+0x59 [kernel]
262        entry_SYSCALL_64_after_hwframe+0x44 [kernel]
263-->COUNT 271
264...
265So we can know that 271 timer setting in recent 1000(~27%).
266
267As a final example, let's trace open syscalls for a specific process. By
268default, tracing is system-wide, but the -p switch overrides this:
269
270# trace -p 2740 'do_sys_open "%s", arg2@user' -T
271TIME     PID    COMM         FUNC             -
27205:36:16 15872  ls           do_sys_open      /etc/ld.so.cache
27305:36:16 15872  ls           do_sys_open      /lib64/libselinux.so.1
27405:36:16 15872  ls           do_sys_open      /lib64/libcap.so.2
27505:36:16 15872  ls           do_sys_open      /lib64/libacl.so.1
27605:36:16 15872  ls           do_sys_open      /lib64/libc.so.6
27705:36:16 15872  ls           do_sys_open      /lib64/libpcre.so.1
27805:36:16 15872  ls           do_sys_open      /lib64/libdl.so.2
27905:36:16 15872  ls           do_sys_open      /lib64/libattr.so.1
28005:36:16 15872  ls           do_sys_open      /lib64/libpthread.so.0
28105:36:16 15872  ls           do_sys_open      /usr/lib/locale/locale-archive
28205:36:16 15872  ls           do_sys_open      /home/vagrant
283^C
284
285In this example, we traced the "ls ~" command as it was opening its shared
286libraries and then accessing the /home/vagrant directory listing.
287
288
289Lastly, if a high-frequency event is traced you may overflow the perf ring
290buffer. This shows as "Lost N samples":
291
292# trace sys_open
2935087   5087   pgrep        sys_open
2945087   5087   pgrep        sys_open
2955087   5087   pgrep        sys_open
2965087   5087   pgrep        sys_open
2975087   5087   pgrep        sys_open
298Lost 764896 samples
299Lost 764896 samples
300Lost 764896 samples
301
302The perf ring buffer size can be changed with -b. The unit is size per-CPU buffer
303size and is measured in pages. The value must be a power of two and defaults to
30464 pages.
305
306# trace.py 'sys_setsockopt(int fd, int level, int optname, char* optval, int optlen)(level==0 && optname == 1 && STRCMP("{0x6C, 0x00, 0x00, 0x00}", optval))' -U -M 1 --bin_cmp
307PID     TID     COMM            FUNC             -
3081855611 1863183 worker          sys_setsockopt   found
309
310In this example we are catching setsockopt syscall to change IPv4 IP_TOS
311value only for the cases where new TOS value is equal to 108. we are using
312STRCMP helper in binary mode (--bin_cmp flag) to compare optval array
313against int value of 108 (parametr of setsockopt call) in hex representation
314(little endian format)
315
316For advanced users there is a possibility to insert the kprobes or uprobes
317after a certain offset, rather than the start of the function call
318This is useful for tracing register values at different places of the
319execution of a function. Lets consider the following example:
320
321int main()
322{
323	int val = 0xdead;
324	printf("%d\n", val);
325	val = 0xbeef;
326	printf("%d\n", val);
327}
328
329After compiling the code with -O3 optimization the object code looks
330like the following (with GCC 10 and x86_64 architecture):
331
332objdump --disassemble=main --prefix-addresses a.out
333
3340000000000001060 <main> endbr64
3350000000000001064 <main+0x4> sub    $0x8,%rsp
3360000000000001068 <main+0x8> mov    $0xdead,%edx
337000000000000106d <main+0xd> mov    $0x1,%edi
3380000000000001072 <main+0x12> xor    %eax,%eax
3390000000000001074 <main+0x14> lea    0xf89(%rip),%rsi
340000000000000107b <main+0x1b> callq  0000000000001050 <__printf_chk@plt>
3410000000000001080 <main+0x20> mov    $0xbeef,%edx
3420000000000001085 <main+0x25> lea    0xf78(%rip),%rsi
343000000000000108c <main+0x2c> xor    %eax,%eax
344000000000000108e <main+0x2e> mov    $0x1,%edi
3450000000000001093 <main+0x33> callq  0000000000001050 <__printf_chk@plt>
3460000000000001098 <main+0x38> xor    %eax,%eax
347000000000000109a <main+0x3a> add    $0x8,%rsp
348000000000000109e <main+0x3e> retq
349
350The 0xdead and later the 0xbeef values are moved into the edx register.
351As the disassembly shows the edx register contains the 0xdead value
352after the 0xd offset and 0xbeef after the 0x25 offset. To verify this
353with trace lets insert probes to those offsets. The following
354command inserts two uprobe one after the 0xd offset and another one
355after the 0x25 offset of the main function. The probe print the
356value of the edx register which will show us the correct values.
357
358trace 'p:/tmp/a.out:main+0xd "%x", ctx->dx' 'p:/tmp/a.out:main+0x25 "%x", ctx->dx'
359PID     TID     COMM            FUNC             -
36025754   25754   a.out           main             dead
36125754   25754   a.out           main             beef
362
363
364USAGE message:
365
366usage: trace [-h] [-b BUFFER_PAGES] [-p PID] [-L TID] [--uid UID] [-v]
367             [-Z STRING_SIZE] [-S] [-M MAX_EVENTS] [-t] [-u] [-T] [-C]
368             [-c CGROUP_PATH] [-n NAME] [-f MSG_FILTER] [-B]
369             [-s SYM_FILE_LIST] [-K] [-U] [-a] [-I header]
370             probe [probe ...]
371
372Attach to functions and print trace messages.
373
374positional arguments:
375  probe                 probe specifier (see examples)
376
377optional arguments:
378  -h, --help            show this help message and exit
379  -b BUFFER_PAGES, --buffer-pages BUFFER_PAGES
380                        number of pages to use for perf_events ring buffer
381                        (default: 64)
382  -p PID, --pid PID     id of the process to trace (optional)
383  -L TID, --tid TID     id of the thread to trace (optional)
384  --uid UID             id of the user to trace (optional)
385  -v, --verbose         print resulting BPF program code before executing
386  -Z STRING_SIZE, --string-size STRING_SIZE
387                        maximum size to read from strings
388  -S, --include-self    do not filter trace's own pid from the trace
389  -M MAX_EVENTS, --max-events MAX_EVENTS
390                        number of events to print before quitting
391  -t, --timestamp       print timestamp column (offset from trace start)
392  -u, --unix-timestamp  print UNIX timestamp instead of offset from trace
393                        start, requires -t
394  -T, --time            print time column
395  -C, --print_cpu       print CPU id
396  -c CGROUP_PATH, --cgroup-path CGROUP_PATH
397                        cgroup path
398  -n NAME, --name NAME  only print process names containing this name
399  -f MSG_FILTER, --msg-filter MSG_FILTER
400                        only print the msg of event containing this string
401  -B, --bin_cmp         allow to use STRCMP with binary values
402  -s SYM_FILE_LIST, --sym_file_list SYM_FILE_LIST
403                        comma separated list of symbol files to use for symbol
404                        resolution
405  -K, --kernel-stack    output kernel stack trace
406  -U, --user-stack      output user stack trace
407  -a, --address         print virtual address in stacks
408  -I header, --include header
409                        additional header files to include in the BPF program
410                        as either full path, or relative to current working
411                        directory, or relative to default kernel header search
412                        path
413  -A, --aggregate       aggregate amount of each trace
414
415EXAMPLES:
416
417trace do_sys_open
418        Trace the open syscall and print a default trace message when entered
419trace kfree_skb+0x12
420        Trace the kfree_skb kernel function after the instruction on the 0x12 offset
421trace 'do_sys_open "%s", arg2@user'
422        Trace the open syscall and print the filename being opened @user is
423        added to arg2 in kprobes to ensure that char * should be copied from
424        the userspace stack to the bpf stack. If not specified, previous
425        behaviour is expected.
426
427trace 'do_sys_open "%s", arg2@user' -n main
428        Trace the open syscall and only print event that process names containing "main"
429trace 'do_sys_open "%s", arg2@user' --uid 1001
430        Trace the open syscall and only print event that processes with user ID 1001
431trace 'do_sys_open "%s", arg2@user' -f config
432        Trace the open syscall and print the filename being opened filtered by "config"
433trace 'sys_read (arg3 > 20000) "read %d bytes", arg3'
434        Trace the read syscall and print a message for reads >20000 bytes
435trace 'r::do_sys_open "%llx", retval'
436        Trace the return from the open syscall and print the return value
437trace 'c:open (arg2 == 42) "%s %d", arg1, arg2'
438        Trace the open() call from libc only if the flags (arg2) argument is 42
439trace 'c:malloc "size = %d", arg1'
440        Trace malloc calls and print the size being allocated
441trace 'p:c:write (arg1 == 1) "writing %d bytes to STDOUT", arg3'
442        Trace the write() call from libc to monitor writes to STDOUT
443trace 'r::__kmalloc (retval == 0) "kmalloc failed!"'
444        Trace returns from __kmalloc which returned a null pointer
445trace 'r:c:malloc (retval) "allocated = %x", retval'
446        Trace returns from malloc and print non-NULL allocated buffers
447trace 't:block:block_rq_complete "sectors=%d", args->nr_sector'
448        Trace the block_rq_complete kernel tracepoint and print # of tx sectors
449trace 'u:pthread:pthread_create (arg4 != 0)'
450        Trace the USDT probe pthread_create when its 4th argument is non-zero
451trace 'u:pthread:libpthread:pthread_create (arg4 != 0)'
452        Ditto, but the provider name "libpthread" is specified.
453trace 'p::SyS_nanosleep(struct timespec *ts) "sleep for %lld ns", ts->tv_nsec'
454        Trace the nanosleep syscall and print the sleep duration in ns
455trace -c /sys/fs/cgroup/system.slice/workload.service '__x64_sys_nanosleep' '__x64_sys_clone'
456        Trace nanosleep/clone syscall calls only under workload.service
457        cgroup hierarchy.
458trace -I 'linux/fs.h' \
459      'p::uprobe_register(struct inode *inode) "a_ops = %llx", inode->i_mapping->a_ops'
460        Trace the uprobe_register inode mapping ops, and the symbol can be found
461        in /proc/kallsyms
462trace -I 'kernel/sched/sched.h' \
463      'p::__account_cfs_rq_runtime(struct cfs_rq *cfs_rq) "%d", cfs_rq->runtime_remaining'
464        Trace the cfs scheduling runqueue remaining runtime. The struct cfs_rq is defined
465        in kernel/sched/sched.h which is in kernel source tree and not in kernel-devel
466        package.  So this command needs to run at the kernel source tree root directory
467        so that the added header file can be found by the compiler.
468trace -I 'net/sock.h' \
469      'udpv6_sendmsg(struct sock *sk) (sk->sk_dport == 13568)'
470        Trace udpv6 sendmsg calls only if socket's destination port is equal
471        to 53 (DNS; 13568 in big endian order)
472trace -I 'linux/fs_struct.h' 'mntns_install "users = %d", $task->fs->users'
473        Trace the number of users accessing the file system of the current task
474trace -s /lib/x86_64-linux-gnu/libc.so.6,/bin/ping 'p:c:inet_pton' -U
475        Trace inet_pton system call and use the specified libraries/executables for
476        symbol resolution.
477