3.1.2.2. Using perf to do Basic Tracing

Profiling is a great tool for solving many problems or for getting a high-level view of what's going on with a workload or across the system. It is however by definition an approximation, as suggested by the most prominent word associated with it, 'sampling'. On the one hand, it allows a representative picture of what's going on in the system to be cheaply taken, but on the other hand, that cheapness limits its utility when that data suggests a need to 'dive down' more deeply to discover what's really going on. In such cases, the only way to see what's really going on is to be able to look at (or summarize more intelligently) the individual steps that go into the higher-level behavior exposed by the coarse-grained profiling data.

As a concrete example, we can trace all the events we think might be applicable to our workload:

     root@crownbay:~# perf record -g -e skb:* -e net:* -e napi:* -e sched:sched_switch -e sched:sched_wakeup -e irq:*
      -e syscalls:sys_enter_read -e syscalls:sys_exit_read -e syscalls:sys_enter_write -e syscalls:sys_exit_write
      wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2
                

We can look at the raw trace output using 'perf script' with no arguments:

     root@crownbay:~# perf script

           perf  1262 [000] 11624.857082: sys_exit_read: 0x0
           perf  1262 [000] 11624.857193: sched_wakeup: comm=migration/0 pid=6 prio=0 success=1 target_cpu=000
           wget  1262 [001] 11624.858021: softirq_raise: vec=1 [action=TIMER]
           wget  1262 [001] 11624.858074: softirq_entry: vec=1 [action=TIMER]
           wget  1262 [001] 11624.858081: softirq_exit: vec=1 [action=TIMER]
           wget  1262 [001] 11624.858166: sys_enter_read: fd: 0x0003, buf: 0xbf82c940, count: 0x0200
           wget  1262 [001] 11624.858177: sys_exit_read: 0x200
           wget  1262 [001] 11624.858878: kfree_skb: skbaddr=0xeb248d80 protocol=0 location=0xc15a5308
           wget  1262 [001] 11624.858945: kfree_skb: skbaddr=0xeb248000 protocol=0 location=0xc15a5308
           wget  1262 [001] 11624.859020: softirq_raise: vec=1 [action=TIMER]
           wget  1262 [001] 11624.859076: softirq_entry: vec=1 [action=TIMER]
           wget  1262 [001] 11624.859083: softirq_exit: vec=1 [action=TIMER]
           wget  1262 [001] 11624.859167: sys_enter_read: fd: 0x0003, buf: 0xb7720000, count: 0x0400
           wget  1262 [001] 11624.859192: sys_exit_read: 0x1d7
           wget  1262 [001] 11624.859228: sys_enter_read: fd: 0x0003, buf: 0xb7720000, count: 0x0400
           wget  1262 [001] 11624.859233: sys_exit_read: 0x0
           wget  1262 [001] 11624.859573: sys_enter_read: fd: 0x0003, buf: 0xbf82c580, count: 0x0200
           wget  1262 [001] 11624.859584: sys_exit_read: 0x200
           wget  1262 [001] 11624.859864: sys_enter_read: fd: 0x0003, buf: 0xb7720000, count: 0x0400
           wget  1262 [001] 11624.859888: sys_exit_read: 0x400
           wget  1262 [001] 11624.859935: sys_enter_read: fd: 0x0003, buf: 0xb7720000, count: 0x0400
           wget  1262 [001] 11624.859944: sys_exit_read: 0x400
                

This gives us a detailed timestamped sequence of events that occurred within the workload with respect to those events.

In many ways, profiling can be viewed as a subset of tracing - theoretically, if you have a set of trace events that's sufficient to capture all the important aspects of a workload, you can derive any of the results or views that a profiling run can.

Another aspect of traditional profiling is that while powerful in many ways, it's limited by the granularity of the underlying data. Profiling tools offer various ways of sorting and presenting the sample data, which make it much more useful and amenable to user experimentation, but in the end it can't be used in an open-ended way to extract data that just isn't present as a consequence of the fact that conceptually, most of it has been thrown away.

Full-blown detailed tracing data does however offer the opportunity to manipulate and present the information collected during a tracing run in an infinite variety of ways.

Another way to look at it is that there are only so many ways that the 'primitive' counters can be used on their own to generate interesting output; to get anything more complicated than simple counts requires some amount of additional logic, which is typically very specific to the problem at hand. For example, if we wanted to make use of a 'counter' that maps to the value of the time difference between when a process was scheduled to run on a processor and the time it actually ran, we wouldn't expect such a counter to exist on its own, but we could derive one called say 'wakeup_latency' and use it to extract a useful view of that metric from trace data. Likewise, we really can't figure out from standard profiling tools how much data every process on the system reads and writes, along with how many of those reads and writes fail completely. If we have sufficient trace data, however, we could with the right tools easily extract and present that information, but we'd need something other than pre-canned profiling tools to do that.

Luckily, there is a general-purpose way to handle such needs, called 'programming languages'. Making programming languages easily available to apply to such problems given the specific format of data is called a 'programming language binding' for that data and language. Perf supports two programming language bindings, one for Python and one for Perl.

Tying it Together: Language bindings for manipulating and aggregating trace data are of course not a new idea. One of the first projects to do this was IBM's DProbes dpcc compiler, an ANSI C compiler which targeted a low-level assembly language running on an in-kernel interpreter on the target system. This is exactly analagous to what Sun's DTrace did, except that DTrace invented its own language for the purpose. Systemtap, heavily inspired by DTrace, also created its own one-off language, but rather than running the product on an in-kernel interpreter, created an elaborate compiler-based machinery to translate its language into kernel modules written in C.

Now that we have the trace data in perf.data, we can use 'perf script -g' to generate a skeleton script with handlers for the read/write entry/exit events we recorded:

     root@crownbay:~# perf script -g python
     generated Python script: perf-script.py
                

The skeleton script simply creates a python function for each event type in the perf.data file. The body of each function simply prints the event name along with its parameters. For example:

     def net__netif_rx(event_name, context, common_cpu,
            common_secs, common_nsecs, common_pid, common_comm,
            skbaddr, len, name):
                    print_header(event_name, common_cpu, common_secs, common_nsecs,
                            common_pid, common_comm)

		     print "skbaddr=%u, len=%u, name=%s\n" % (skbaddr, len, name),
                

We can run that script directly to print all of the events contained in the perf.data file:

     root@crownbay:~# perf script -s perf-script.py

     in trace_begin
     syscalls__sys_exit_read     0 11624.857082795     1262 perf                  nr=3, ret=0
     sched__sched_wakeup      0 11624.857193498     1262 perf                  comm=migration/0, pid=6, prio=0,      success=1, target_cpu=0
     irq__softirq_raise       1 11624.858021635     1262 wget                  vec=TIMER
     irq__softirq_entry       1 11624.858074075     1262 wget                  vec=TIMER
     irq__softirq_exit        1 11624.858081389     1262 wget                  vec=TIMER
     syscalls__sys_enter_read     1 11624.858166434     1262 wget                  nr=3, fd=3, buf=3213019456,      count=512
     syscalls__sys_exit_read     1 11624.858177924     1262 wget                  nr=3, ret=512
     skb__kfree_skb           1 11624.858878188     1262 wget                  skbaddr=3945041280,           location=3243922184, protocol=0
     skb__kfree_skb           1 11624.858945608     1262 wget                  skbaddr=3945037824,      location=3243922184, protocol=0
     irq__softirq_raise       1 11624.859020942     1262 wget                  vec=TIMER
     irq__softirq_entry       1 11624.859076935     1262 wget                  vec=TIMER
     irq__softirq_exit        1 11624.859083469     1262 wget                  vec=TIMER
     syscalls__sys_enter_read     1 11624.859167565     1262 wget                  nr=3, fd=3, buf=3077701632,      count=1024
     syscalls__sys_exit_read     1 11624.859192533     1262 wget                  nr=3, ret=471
     syscalls__sys_enter_read     1 11624.859228072     1262 wget                  nr=3, fd=3, buf=3077701632,      count=1024
     syscalls__sys_exit_read     1 11624.859233707     1262 wget                  nr=3, ret=0
     syscalls__sys_enter_read     1 11624.859573008     1262 wget                  nr=3, fd=3, buf=3213018496,      count=512
     syscalls__sys_exit_read     1 11624.859584818     1262 wget                  nr=3, ret=512
     syscalls__sys_enter_read     1 11624.859864562     1262 wget                  nr=3, fd=3, buf=3077701632,      count=1024
     syscalls__sys_exit_read     1 11624.859888770     1262 wget                  nr=3, ret=1024
     syscalls__sys_enter_read     1 11624.859935140     1262 wget                  nr=3, fd=3, buf=3077701632,      count=1024
     syscalls__sys_exit_read     1 11624.859944032     1262 wget                  nr=3, ret=1024
                

That in itself isn't very useful; after all, we can accomplish pretty much the same thing by simply running 'perf script' without arguments in the same directory as the perf.data file.

We can however replace the print statements in the generated function bodies with whatever we want, and thereby make it infinitely more useful.

As a simple example, let's just replace the print statements in the function bodies with a simple function that does nothing but increment a per-event count. When the program is run against a perf.data file, each time a particular event is encountered, a tally is incremented for that event. For example:

     def net__netif_rx(event_name, context, common_cpu,
            common_secs, common_nsecs, common_pid, common_comm,
            skbaddr, len, name):
		          inc_counts(event_name)
                

Each event handler function in the generated code is modified to do this. For convenience, we define a common function called inc_counts() that each handler calls; inc_counts() simply tallies a count for each event using the 'counts' hash, which is a specialized hash function that does Perl-like autovivification, a capability that's extremely useful for kinds of multi-level aggregation commonly used in processing traces (see perf's documentation on the Python language binding for details):

     counts = autodict()

     def inc_counts(event_name):
            try:
                    counts[event_name] += 1
            except TypeError:
                    counts[event_name] = 1
                

Finally, at the end of the trace processing run, we want to print the result of all the per-event tallies. For that, we use the special 'trace_end()' function:

     def trace_end():
            for event_name, count in counts.iteritems():
                    print "%-40s %10s\n" % (event_name, count)
                

The end result is a summary of all the events recorded in the trace:

     skb__skb_copy_datagram_iovec                  13148
     irq__softirq_entry                             4796
     irq__irq_handler_exit                          3805
     irq__softirq_exit                              4795
     syscalls__sys_enter_write                      8990
     net__net_dev_xmit                               652
     skb__kfree_skb                                 4047
     sched__sched_wakeup                            1155
     irq__irq_handler_entry                         3804
     irq__softirq_raise                             4799
     net__net_dev_queue                              652
     syscalls__sys_enter_read                      17599
     net__netif_receive_skb                         1743
     syscalls__sys_exit_read                       17598
     net__netif_rx                                     2
     napi__napi_poll                                1877
     syscalls__sys_exit_write                       8990
                

Note that this is pretty much exactly the same information we get from 'perf stat', which goes a little way to support the idea mentioned previously that given the right kind of trace data, higher-level profiling-type summaries can be derived from it.

Documentation on using the 'perf script' python binding.