Profiling is a great tool for solving many problems or for getting a high-level view of what's going on with a workload or across the system. It is however by definition an approximation, as suggested by the most prominent word associated with it, 'sampling'. On the one hand, it allows a representative picture of what's going on in the system to be cheaply taken, but on the other hand, that cheapness limits its utility when that data suggests a need to 'dive down' more deeply to discover what's really going on. In such cases, the only way to see what's really going on is to be able to look at (or summarize more intelligently) the individual steps that go into the higher-level behavior exposed by the coarse-grained profiling data.
As a concrete example, we can trace all the events we think might be applicable to our workload:
root@crownbay:~# perf record -g -e skb:* -e net:* -e napi:* -e sched:sched_switch -e sched:sched_wakeup -e irq:* -e syscalls:sys_enter_read -e syscalls:sys_exit_read -e syscalls:sys_enter_write -e syscalls:sys_exit_write wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2
We can look at the raw trace output using 'perf script' with no arguments:
root@crownbay:~# perf script perf 1262 [000] 11624.857082: sys_exit_read: 0x0 perf 1262 [000] 11624.857193: sched_wakeup: comm=migration/0 pid=6 prio=0 success=1 target_cpu=000 wget 1262 [001] 11624.858021: softirq_raise: vec=1 [action=TIMER] wget 1262 [001] 11624.858074: softirq_entry: vec=1 [action=TIMER] wget 1262 [001] 11624.858081: softirq_exit: vec=1 [action=TIMER] wget 1262 [001] 11624.858166: sys_enter_read: fd: 0x0003, buf: 0xbf82c940, count: 0x0200 wget 1262 [001] 11624.858177: sys_exit_read: 0x200 wget 1262 [001] 11624.858878: kfree_skb: skbaddr=0xeb248d80 protocol=0 location=0xc15a5308 wget 1262 [001] 11624.858945: kfree_skb: skbaddr=0xeb248000 protocol=0 location=0xc15a5308 wget 1262 [001] 11624.859020: softirq_raise: vec=1 [action=TIMER] wget 1262 [001] 11624.859076: softirq_entry: vec=1 [action=TIMER] wget 1262 [001] 11624.859083: softirq_exit: vec=1 [action=TIMER] wget 1262 [001] 11624.859167: sys_enter_read: fd: 0x0003, buf: 0xb7720000, count: 0x0400 wget 1262 [001] 11624.859192: sys_exit_read: 0x1d7 wget 1262 [001] 11624.859228: sys_enter_read: fd: 0x0003, buf: 0xb7720000, count: 0x0400 wget 1262 [001] 11624.859233: sys_exit_read: 0x0 wget 1262 [001] 11624.859573: sys_enter_read: fd: 0x0003, buf: 0xbf82c580, count: 0x0200 wget 1262 [001] 11624.859584: sys_exit_read: 0x200 wget 1262 [001] 11624.859864: sys_enter_read: fd: 0x0003, buf: 0xb7720000, count: 0x0400 wget 1262 [001] 11624.859888: sys_exit_read: 0x400 wget 1262 [001] 11624.859935: sys_enter_read: fd: 0x0003, buf: 0xb7720000, count: 0x0400 wget 1262 [001] 11624.859944: sys_exit_read: 0x400
This gives us a detailed timestamped sequence of events that occurred within the workload with respect to those events.
In many ways, profiling can be viewed as a subset of tracing - theoretically, if you have a set of trace events that's sufficient to capture all the important aspects of a workload, you can derive any of the results or views that a profiling run can.
Another aspect of traditional profiling is that while powerful in many ways, it's limited by the granularity of the underlying data. Profiling tools offer various ways of sorting and presenting the sample data, which make it much more useful and amenable to user experimentation, but in the end it can't be used in an open-ended way to extract data that just isn't present as a consequence of the fact that conceptually, most of it has been thrown away.
Full-blown detailed tracing data does however offer the opportunity to manipulate and present the information collected during a tracing run in an infinite variety of ways.
Another way to look at it is that there are only so many ways that the 'primitive' counters can be used on their own to generate interesting output; to get anything more complicated than simple counts requires some amount of additional logic, which is typically very specific to the problem at hand. For example, if we wanted to make use of a 'counter' that maps to the value of the time difference between when a process was scheduled to run on a processor and the time it actually ran, we wouldn't expect such a counter to exist on its own, but we could derive one called say 'wakeup_latency' and use it to extract a useful view of that metric from trace data. Likewise, we really can't figure out from standard profiling tools how much data every process on the system reads and writes, along with how many of those reads and writes fail completely. If we have sufficient trace data, however, we could with the right tools easily extract and present that information, but we'd need something other than pre-canned profiling tools to do that.
Luckily, there is a general-purpose way to handle such needs, called 'programming languages'. Making programming languages easily available to apply to such problems given the specific format of data is called a 'programming language binding' for that data and language. Perf supports two programming language bindings, one for Python and one for Perl.
Now that we have the trace data in perf.data, we can use 'perf script -g' to generate a skeleton script with handlers for the read/write entry/exit events we recorded:
root@crownbay:~# perf script -g python generated Python script: perf-script.py
The skeleton script simply creates a python function for each event type in the perf.data file. The body of each function simply prints the event name along with its parameters. For example:
def net__netif_rx(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, skbaddr, len, name): print_header(event_name, common_cpu, common_secs, common_nsecs, common_pid, common_comm) print "skbaddr=%u, len=%u, name=%s\n" % (skbaddr, len, name),
We can run that script directly to print all of the events contained in the perf.data file:
root@crownbay:~# perf script -s perf-script.py in trace_begin syscalls__sys_exit_read 0 11624.857082795 1262 perf nr=3, ret=0 sched__sched_wakeup 0 11624.857193498 1262 perf comm=migration/0, pid=6, prio=0, success=1, target_cpu=0 irq__softirq_raise 1 11624.858021635 1262 wget vec=TIMER irq__softirq_entry 1 11624.858074075 1262 wget vec=TIMER irq__softirq_exit 1 11624.858081389 1262 wget vec=TIMER syscalls__sys_enter_read 1 11624.858166434 1262 wget nr=3, fd=3, buf=3213019456, count=512 syscalls__sys_exit_read 1 11624.858177924 1262 wget nr=3, ret=512 skb__kfree_skb 1 11624.858878188 1262 wget skbaddr=3945041280, location=3243922184, protocol=0 skb__kfree_skb 1 11624.858945608 1262 wget skbaddr=3945037824, location=3243922184, protocol=0 irq__softirq_raise 1 11624.859020942 1262 wget vec=TIMER irq__softirq_entry 1 11624.859076935 1262 wget vec=TIMER irq__softirq_exit 1 11624.859083469 1262 wget vec=TIMER syscalls__sys_enter_read 1 11624.859167565 1262 wget nr=3, fd=3, buf=3077701632, count=1024 syscalls__sys_exit_read 1 11624.859192533 1262 wget nr=3, ret=471 syscalls__sys_enter_read 1 11624.859228072 1262 wget nr=3, fd=3, buf=3077701632, count=1024 syscalls__sys_exit_read 1 11624.859233707 1262 wget nr=3, ret=0 syscalls__sys_enter_read 1 11624.859573008 1262 wget nr=3, fd=3, buf=3213018496, count=512 syscalls__sys_exit_read 1 11624.859584818 1262 wget nr=3, ret=512 syscalls__sys_enter_read 1 11624.859864562 1262 wget nr=3, fd=3, buf=3077701632, count=1024 syscalls__sys_exit_read 1 11624.859888770 1262 wget nr=3, ret=1024 syscalls__sys_enter_read 1 11624.859935140 1262 wget nr=3, fd=3, buf=3077701632, count=1024 syscalls__sys_exit_read 1 11624.859944032 1262 wget nr=3, ret=1024
That in itself isn't very useful; after all, we can accomplish pretty much the same thing by simply running 'perf script' without arguments in the same directory as the perf.data file.
We can however replace the print statements in the generated function bodies with whatever we want, and thereby make it infinitely more useful.
As a simple example, let's just replace the print statements in the function bodies with a simple function that does nothing but increment a per-event count. When the program is run against a perf.data file, each time a particular event is encountered, a tally is incremented for that event. For example:
def net__netif_rx(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, skbaddr, len, name): inc_counts(event_name)
Each event handler function in the generated code is modified to do this. For convenience, we define a common function called inc_counts() that each handler calls; inc_counts() simply tallies a count for each event using the 'counts' hash, which is a specialized hash function that does Perl-like autovivification, a capability that's extremely useful for kinds of multi-level aggregation commonly used in processing traces (see perf's documentation on the Python language binding for details):
counts = autodict() def inc_counts(event_name): try: counts[event_name] += 1 except TypeError: counts[event_name] = 1
Finally, at the end of the trace processing run, we want to print the result of all the per-event tallies. For that, we use the special 'trace_end()' function:
def trace_end(): for event_name, count in counts.iteritems(): print "%-40s %10s\n" % (event_name, count)
The end result is a summary of all the events recorded in the trace:
skb__skb_copy_datagram_iovec 13148 irq__softirq_entry 4796 irq__irq_handler_exit 3805 irq__softirq_exit 4795 syscalls__sys_enter_write 8990 net__net_dev_xmit 652 skb__kfree_skb 4047 sched__sched_wakeup 1155 irq__irq_handler_entry 3804 irq__softirq_raise 4799 net__net_dev_queue 652 syscalls__sys_enter_read 17599 net__netif_receive_skb 1743 syscalls__sys_exit_read 17598 net__netif_rx 2 napi__napi_poll 1877 syscalls__sys_exit_write 8990
Note that this is pretty much exactly the same information we get from 'perf stat', which goes a little way to support the idea mentioned previously that given the right kind of trace data, higher-level profiling-type summaries can be derived from it.
Documentation on using the 'perf script' python binding.