dtrace.conf 2012 videos

Last month was dtrace.conf 2012, the 2nd DTrace unconference. This is a meetup of DTrace practitioners and developers, where we discuss the latest uses, developments and future directions of the technology. It was great to see old friends of DTrace, and to put new faces to names. See the video list for the sessions, which, for fun, includes all the attendees riding a carousel.

I gave a short talk on DTrace-based visualizations, which is on youtube:

The visualizations I discussed included:

My talk was followed by a couple more on DTrace-based visualizations: Theo Schlossnagle on Visualizations and USDT, and Richard Elling.

The conference ran very well, and was attended by engineers working on DTrace with a variety of operating systems, including FreeBSD, Mac OS X, Sony PlayStation, and Linux. Thanks to all who helped, especially Deirdré Straughan for organizing it and videoing the sessions.

For more about the conference, see Adam’s wrap-up.

Posted on May 8, 2012 at 4:06 pm by Brendan Gregg · Permalink · Comments Closed
In: DTrace · Tagged with: , ,

Subsecond Offset Heat Maps

“Wow, that’s weird!”. My subsecond offset visualization type looked great, but others found it weird and unfamiliar. I developed it for inclusion in Joyent’s Cloud Analytics tool for the purposes of workload characterization. Given that it was so unfamiliar, I had some explaining to do.

Voxer, a company that makes a walkie-talkie application for smart phones, had been seeing a performance issue with their Riak database. The issue appeared to be related to TCP listen drops – when SYNs are dropped as the application can’t keep up with the accept() queue. Voxer has millions of users whose numbers are growing fast, so I expected to see Riak hit 100% CPU usage when these drops occurred. The subsecond offset heat map (top on the right) painted a different story, which led to an operating system kernel fix.

Weird but wonderful, this heat map helped solve a hard problem, and I was left with some interesting screenshots to help explain this visualization type.

In this post, I’ll explain subsecond offset heat maps using the Voxer issue as a case study, then show various other interesting examples from a production cloud environment. This environment is a single datacenter that includes 200 physical servers and thousands of OS instances. The heat maps are all generated by Joyent Cloud Analytics, which uses DTrace to fetch the data.

Description

The subsecond offset heat map puts time on two axes. The x-axis shows the passage of time, with each column representing one second. The y-axis shows the time within a second, spanning from 0.0s to 1.0s (time offsets). The z-axis (color) show the count of samples or events, quantized into x- and y-axis ranges (“buckets”), with the color darkness reflecting the event count (darker == more). This relationship is shown to the right.

I previously explained the use of quantized heat maps in section 11 of Visualizing Device Utilization. I use them to show event latency as well.

Time on Two Axes

Heat maps aren’t the weird part. What’s weird is putting time on more than one axis. Stephen Wolfram recently posted The Personal Analytics of My Life, which included an amazing scatter plot (on the left). This has time on both x- and y- axes. I’ve included it as it may be a much easier example to grasp at first glance, before the subsecond offset heat maps.

His is at a much longer time scale: the x-axis shows days, and the y-axis shows offset within a day. Using similar terminology, this could be called a “subday-offset” or “24hr-offset” scatter plot. Each point on his plot shows when Wolfram sent an email, revealing his sleeping habits as the white gap in the morning.

Scatter plots are limited in the density of the points they can display, and don’t compress the data set (x & y coordinates are kept for each event). Heat maps solve both issues, allowing them to scale, which is especially important for the cloud computing uses that follow. These use the subsecond offset scale, but other ranges are possible as well (minute-offset, hour-offset, day-offset).

That’s No Artifact

The screenshot at the top of this page (click any for full-res) used a subsecond offset heat map for CPU thread samples – showing when applications were on-CPU during the second. The sampling was at 99 Hertz across all CPUs, to minimize overhead (instead of, say, 1000 Hz), and to avoid lockstep (with any power-of-10 Hz task). These CPU samples are then quantized into the buckets seen as pixels.

The heat map revealed that CPU usage dropped at the same time as the TCP listen drops. I was expecting the opposite.

By selecting Riak (as “beam.smp”, the Erlang VM it uses) and “Isolate selected”, only Riak is shown:

Left of center shows two columns, each with about 40% of the offsets colored white. Assuming no sampling issue, it means that the Riak database was entirely off-CPU for hundreds of consecutive milliseconds. This is similar to the white gaps showing when Wolfram was asleep — except that we aren’t expecting the Riak database to take naps! This was so bizarre that I first thought that something was wrong with the instrumentation, and that the white gaps were an artifact.

Application threads normally spend time off-CPU when blocked on I/O or waiting for work. What’s odd here is that for so long the number of running Riak threads is zero, when normally it varies more quickly. And this event coincided with TCP listen drops.

The Shoe That Fits

In Cloud Analytics, heat maps can be clicked to reveal details at that point. I clicked inside the white gap, which revealed that a process called “zoneadmd” was running; isolating it:

This fits the white gap closely, and a similar relationship was observed at other times as well. This pointed suspicion to zoneadmd, which other observability tools had missed. Some tools sampled the running processes every few seconds or minutes, and usually missed the short-lived zoneadmd completely. Even watching every second was difficult: Riak’s CPU usage dropped for two seconds, at a different rate to what zoneadmd consumed (Riak is multi-threaded, so it can consume more CPU in the same interval than the single-threaded zoneadmd). The subsecond offset heat map showed the clearest correlation: the duration of these events was similar, and the starting and ending points were nearby.

If zoneadmd was somehow blocking Riak from executing, it would explain the off-CPU gap and also the TCP listen drops – as Riak wouldn’t be running to accept the connections.

Kernel Fix

Investigation on the server using DTrace quickly found that Riak was getting blocked as it waited for an address space lock (as_lock) during mmap()/munmap() calls from its bitcask storage engine. That lock was being held by zoneadmd for hundreds of milliseconds (see the Artifacts section later for a longer description). zoneadmd enforces multi-tenant memory limits, and every couple of minutes checked the size of Riak. It did this via kernel calls which scan memory pages while holding as_lock. This scan took time, as Riak was tens of Gbytes in size.

We found other applications exhibiting the same behavior, including Riak’s “memsup” memory monitor. All of these were blocking Riak, and with Riak off-CPU unable to accept() connections, the TCP backlog queue often hit its limit resulting in TCP listen drops (tcpListenDrop). Jerry Jelinek of Joyent has been fixing these codepaths via kernel changes.

Patterns

The previous heat map included a “Distribution details” box at the bottom, summarizing the quantized bucket that I clicked on. It shows that “zoneadmd” and “ipmitool” were running, each sampled twice in the range 743 – 763 ms (consistent with them being single threaded and sampled at 99 Hertz).


ipmitool and zabbix_agentd

To check whether ipmitool was an issue, I isolated its on-CPU usage and found that it often did not coincide with Riak off-CPU time. While checking this, I found a interesting pattern caused by zabbix_agentd. These are shown on the right: ipmitool is highlighted in yellow, and zabbix_agentd in red.

Just based on the heat map, it would appear that zabbix_agentd is a single thread (or process) that wakes up every second to perform a small amount of work. It then sleeps for an entire second. Repeat. This causes the diagonal rising line, the slope of which is relative to time zabbix_agentd worked before sleeping for the next full second: with greater slopes (approaching 90 degrees) reflecting more work was performed before the next sleep.

zabbix_agentd is part of the Zabbix monitoring software. If it is supposed to perform work roughly every second, then it should be ok. But if it is supposed to perform work exactly once a second, such as reading system counters to calculate the statistics it is monitoring, then there could be problems.

Cloud Scale

The previous examples showed CPU thread samples on a single server (each title included “predicated by server hostname == …”). Cloud Analytics can show these for the entire cloud – which may be hundreds of systems (virtualized operating system instances). I’ll show this with a different heat map type: instead of CPU thread samples, which shows the CPU usage of applications, I’ll show subsecond offset of system calls (syscalls), which paints a different picture – one better reflecting the I/O behavior. Tracing syscalls can reveal more processes than by sampling, which can miss short-lived events.

The two images that follow show subsecond offsets for syscalls across an entire datacenter (200 physical servers, thousands of OS instances). On the left are syscalls by “httpd” (Apache web server), and the right are those by the “ls” command:


httpd

ls

Neither of these may be very surprising. The httpd syscalls will arrive at random times based on the client workload, and combining them for dozens of busy web servers results in a heat map with random color intensities (which have been enhanced due to the rank-based default color map).

Sometimes the heat maps are surprising. The next two show zeus.flipper (web load balancing software), on the left for the entire cloud, and on the right for a single server:


zeus.flipper

zeus.flipper (single)

The cloud-wide heat map does show that there is a pattern present, which has been isolated for a single server on the right. It appears that multiple threads are present: many waking up more than once a second (the two large bands), and others waking up every two (), five () and ten seconds ().

Cloud Wide vs Single Server

Here are some other examples comparing an entire cloud vs a single server (click for full screenshot). These are also syscall subsecond offsets:


node.js

node.js (single)

Java

Java (single)

Python

Python (single)

I’ve just shown one single server example for node.js, Java, and Python, however each server can look quite different based on its use and workload. Applications such as zeus.flipper are more likely to look similar as they serve the same function on every server.

Cloud Identification Chart

Some other cloud-wide examples, using syscall subsecond offsets:


awk

bash

kstat

munin-node

Perl

php-fpm

The munin-node heat map has several lines of dots , each dot two seconds apart. Can you guess what those might be?

Color Maps

The colors chosen for the heat map can either be rank-based or linear-based, which select color saturation differently. The selected type for the previous heat maps can be seen after “COLOR BY:” in the full screenshots (click images)

This shows node.js processes across the entire cloud, to compare the color maps side-by-side:


node.js (rank)

node.js (linear)

The rank-based heat map highlights subtle variation well. The linear colored heat map reflects reality. This is an extreme example; often the heat maps look much more similar. For a longer description of rank vs linear, see Dave Pacheco’s heat map coloring post, and the Saturation section in my Visualizing Device Utilization post.

Artifacts

The first example I showed featured the Riak database being blocked by zoneadmd. The blocking event was continuous, and lasted for almost a full second. It was shown twice in the first subsecond offset column due to the way the data is currently collected by Joyent Cloud Analytics – resulting in an “artifact”.

This is shown on the right. The time that a column of data is collected from the server does not occur at the 0.0 offset, but rather some other offset during the second. This means that an activity that is in-flight will suddenly jump to the next column, as has happened here (at the “3″ mark). It also means that an activity at the top of the column can wrap and continue at the bottom of the same column (at the “2″ mark), before the column switch occurs. I think this is fixable by recalculating offsets relative to the data collection time, so the switch happens at offset 0.0. (It hasn’t been fixed already since it usually isn’t annoying, and didn’t noticeably interfere with the many other examples I’ve shown.)

On the left is a different type of artifact, one caused when data collection is delayed. To minimize overhead, data is aggregated in-kernel, then read at a gentle rate (usually once per second) by a user-land process. This problem occurs when the user-land process is delayed slightly for some reason, and the kernel aggregations include extra data (overlapping offsets) by the time they are read. Those offsets are then missing from the next column, on the right.

Thread Scheduling

I intended to include a “checkerboard” heat map of CPU samples, like those Robert Mustacchi showed in his Visualizing KVM post. This involves running two threads (or processes) that share one CPU, each performing the same CPU-bound workload. When each is highlighted in different colors it should look like a checkerboard, as the kernel scheduler evenly switches between running them.

Robert was testing on the Linux kernel under KVM, and used DTrace to inspect running threads from the SmartOS host (by checking the VM MMU context). I performed the experiment on SmartOS directly, which resulted in the following heat map:

This breaks my head. Instead of a neat checkerboard, this is messy – showing uneven runtimes for the identical threads. One column in particular is entirely red, which if true (not a sampling or instrumentation error) meant that the scheduler left the same thread running for an entire second, while another was in the ready-to-run state on the CPU dispatcher queue. This is much longer than the maximum runtime quantum set by the scheduler (110 ms for the FSS class). I confirmed this behavior using two different means (DTrace, and thread microstate accounting), and saw even worse instances – threads blocked for many seconds when they should have been running.

Jerry Jelinek has been wading into the scheduler code, finding evidence that this is a kernel bug (in code that hasn’t changed since Solaris) and developing the fix. Fortunately, not many of our customers have hit this since it requires CPUs running at saturation (which isn’t normal for us).

UPDATE (April 2nd)

Jerry has fixed the code, which was a bug with how thread priorities were updated in the scheduler. The following screenshot shows the same workload post-fix:

This looks much better. There are no longer any full seconds where one thread hogs the CPU, with the other thread waiting. Looking more closely, there appear to be cases where the thread has switched early – which is much better than switching late.

We also found that the bug was indeed hurting a customer along with a confluence of other factors.

Conclusion

The subsecond offset heat map provides new visibility for software execution time, which can be used for workload characterization and performance analysis. These are currently available in Joyent Cloud Analytics, from which I included screenshots of these heat maps for production environments.

Using these heat maps I identified two kernel scheduling issues, one of which was causing dropped TCP connections for a large scale cloud-based service. Kernel fixes are being developed for both. I also showed various applications running on single servers and the cloud, which produced fascinating patterns – providing a new way of understanding application runtime.

The examples I included here were based on sampled thread runtime, and traced system call execution times. Other event sources can be visualized in this way, and these could also be produced on other time frames: sub-minute, sub-hour, etc.

Acknowledgements

Thanks to the folk at Voxer for realizing (earlier than I did) that something more than just normal bursts of load was causing the tcpListenDrops, and pushing for the real answer.

Posted on March 26, 2012 at 11:16 am by Brendan Gregg · Permalink · 2 Comments
In: Performance · Tagged with: , , ,

Linux Kernel Performance: Flame Graphs

To get the most out of your systems, you want detailed insight into what the operating system kernel is doing. A typical approach is to sample stack traces; however, the data collected can be time consuming to read or navigate. Flame Graphs are a new way to visualize sampled stack traces, and can be applied to the Linux kernel for some useful (and stunning!) visualizations.

I’ve used these many times to help solve other kernel and application issues. Since I posted the source to github, others have been using it too (eg, for node.js and IP scaling). I’ve recently been using them to investigate Linux kernel performance issues (under KVM), which I’ll demonstrate in this post using a couple of different profiling tools: perf_events and SystemTap.

Linux Perf Events

This flame graph shows a network workload for the 3.2.9-1 Linux kernel, running as a KVM instance:


click image for interactive SVG; larger PNG here

Flame Graphs show the sample population across the x-axis, and stack depth on the y-axis. Each function (stack frame) is drawn as a rectangle, with the width relative to the number of samples. See my previous post for the full description of how these work.

You can use the mouse to explore where kernel CPU time is spent, quickly quantifying code-paths and determining where performance tuning efforts are best spent. This example shows that most time was spent in the vp_notify() code-path, spending 70.52% of all on-CPU samples performing iowrite16().

This was generated using perf_events and the FlameGraph tools:

# perf record -a -g -F 1000 sleep 60
# perf script | ./stackcollapse-perf.pl > out.perf-folded
# cat out.perf-folded | ./flamegraph.pl > perf-kernel.svg

The first command runs perf in sampling mode (polling) at 1000 Hertz (-F 1000; more on this later) across all CPUs (-a), capturing stack traces so that a call graph (-g) of function ancestry can be generated later. The samples are saved in a perf.data file:

# ls -lh perf.data
-rw-------. 1 root root 15M Mar 12 20:13 perf.data

This can be processed in a variety of ways. On recent versions, the perf report command launches an ncurses navigator for call graph inspection. Older versions of perf (or if you pipe the new version) print the call graph as a tree, annotated with percentages:

# perf report | more
# ========
# captured on: Wed Mar 14 00:09:59 2012
# hostname : fedora1
# os release : 3.2.9-1.fc16.x86_64
# perf version : 3.2.9-1.fc16.x86_64
# arch : x86_64
# nrcpus online : 1
# nrcpus avail : 1
# cpudesc : QEMU Virtual CPU version 0.14.1
# cpuid : GenuineIntel,6,2,3
# total memory : 1020560 kB
# cmdline : /usr/bin/perf record -a -g -F 1000 sleep 60
# event : name = cycles, type = 1, config = 0x0, config1 = 0x0, config2 = 0x0, excl_usr = ...
# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# ========
#
# Events: 60K cpu-clock
#
# Overhead          Command          Shared Object                               Symbol
# ........  ...............  .....................  ...................................
#
    72.18%            iperf  [kernel.kallsyms]      [k] iowrite16
                      |
                      --- iowrite16
                         |
                         |--99.53%-- vp_notify
                         |          virtqueue_kick
                         |          start_xmit
                         |          dev_hard_start_xmit
                         |          sch_direct_xmit
                         |          dev_queue_xmit
                         |          ip_finish_output
                         |          ip_output
                         |          ip_local_out
                         |          ip_queue_xmit
                         |          tcp_transmit_skb
                         |          tcp_write_xmit
                         |          |
                         |          |--98.16%-- tcp_push_one
                         |          |          tcp_sendmsg
                         |          |          inet_sendmsg
                         |          |          sock_aio_write
                         |          |          do_sync_write
                         |          |          vfs_write
                         |          |          sys_write
                         |          |          system_call
                         |          |          0x369e40e5cd
                         |          |
                         |           --1.84%-- __tcp_push_pending_frames
[...]

This tree starts with the on-CPU functions and works back through the ancestry (this is a “callee based call graph”). This follows the flame graph when reading the flame graph top-down. (This behavior can be flipped by using the “caller” option to -g/–call-graph, instead of the “callee” default, generating a tree that follows the flame graph when read bottom-up.) The hottest (most frequent) stack trace in the flame graph (@70.52%) can be seen in this perf call graph as the product of the top three nodes (72.18% x 99.53% x 98.16%, which are relative rates). perf report can also be run with “-g graph” to show absolute overhead rates, in which case “70.52%” is directly displayed on the node.

The perf report tree (and the ncurses navigator) do an excellent job at presenting this information as text. However, with text there are limitations. The output often does not fit in one screen (you could say it doesn’t need to, if the bulk of the samples are identified on the first page). Also, identifying the hottest code paths requires reading the percentages. With the flame graph, all the data is on screen at once, and the hottest code-paths are immediately obvious as the widest functions.

For generating the flame graph, the perf script command (a newer addition to perf) was used to dump the stack samples, which are then aggregated by stackcollapse-perf.pl and folded into single lines per-stack. That output is then converted by flamegraph.pl into the SVG. I included a gratuitous “cat” command to make it clear that flamegraph.pl can process the output of a pipe, which could include Unix commands to filter or preprocess (grep, sed, awk).

The last two commands could be connected via a pipe:

# perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > perf-kernel.svg

In practice I don’t do this, as I often re-run flamegraph.pl multiple times, and this one-liner would execute everything multiple times. The output of perf script can be many Mbytes, taking many seconds to process. By writing stackcollapse-perf.pl to a file, you’ve cached the slowest step, and can also edit the file (vi) to delete unimportant stacks. The one-line-per-stack output of stackcollapse-perf.pl is also great food for grep(1). Eg:

# perf script | ./stackcollapse-perf.pl > out.perf-folded
# grep -v cpu_idle out.perf-folded | ./flamegraph.pl > nonidle.svg
# grep ext4 out.perf-folded | ./flamegraph.pl > ext4internals.svg
# egrep 'system_call.*sys_(read|write)' out.perf-folded | ./flamegraph.pl > rw.svg

Note that it would be a little more efficient to process the output of perf report instead of perf script; better still, perf report could have a report style (eg, “-g folded”) that output folded stacks directly, obviating the need for stackcollapse-perf.pl. There could even be a perf mode that output the SVG directly (which wouldn’t be the first one; see perf-timechart), although, that would miss the value of being able to grep the folded stacks (which I use frequently).

If you’ve never used perf_events before, you may want to test before production use (it has had kernel panic bugs in the past). My experience has been a good one (no panics).

SystemTap

SystemTap can also sample stack traces via the timer.profile probe, which fires at the system clock rate (CONFIG_HZ). Unlike perf, which dumps samples to a file for later aggregation and reporting, SystemTap can do the aggregation in-kernel and pass a (much smaller) report to user-land. The data collected and output generated can be customized much further via its scripting language. The examples here were generated on Fedora 16 (where it works much better than Ubuntu/CentOS).

The commands for SystemTap version 1.6 are:

# stap -s 32 -D MAXTRACE=100 -D MAXSTRINGLEN=4096 -D MAXMAPENTRIES=10240 \
    -D MAXACTION=10000 -D STP_OVERLOAD_THRESHOLD=5000000000 --all-modules \
    -ve 'global s; probe timer.profile { s[backtrace()] <<< 1; }
    probe end { foreach (i in s+) { print_stack(i);
    printf("\t%d\n", @count(s[i])); } } probe timer.s(60) { exit(); }' \
    > out.stap-stacks
# ./stackcollapse-stap.pl out.stap-stacks > out.stap-folded
# cat out.stap-folded | ./flamegraph.pl > stap-kernel.svg

The six options used (-s 32, -D …) increase various SystemTap limits. The only ones really necessary for flame graphs are “-D MAXTRACE=100 -D MAXSTRINGLEN=4096″, so that stack traces aren’t truncated; the others became necessary when sampling for long periods (in this case, 60 seconds) on busy workloads, since you can get errors such as:

WARNING: There were 233 transport failures.

ERROR: Array overflow, check MAXMAPENTRIES near identifier 's' at <input>:1:33

MAXACTION:
ERROR: MAXACTION exceeded near operator '{' at <input>:1:87

STP_OVERLOAD_THRESHOLD:
ERROR: probe overhead exceeded threshold

The “transport failures” is fixed by increasing the buffer size (-s); the other messages include the names of the tunables that need to be increased.

Also, be sure you have the fix for the #13714 kernel panic (which led to CVE-2012-0875), or the latest version of SystemTap.

On SystemTap v1.7 (latest):

# stap -s 32 -D MAXBACKTRACE=100 -D MAXSTRINGLEN=4096 -D MAXMAPENTRIES=10240 \
    -D MAXACTION=10000 -D STP_OVERLOAD_THRESHOLD=5000000000 --all-modules \
    -ve 'global s; probe timer.profile { s[backtrace()] <<< 1; }
    probe end { foreach (i in s+) { print_stack(i);
    printf("\t%d\n", @count(s[i])); } } probe timer.s(60) { exit(); }' \
    > out.stap-stacks
# ./stackcollapse-stap.pl out.stap-stacks > out.stap-folded
# cat out.stap-folded | ./flamegraph.pl > stap-kernel.svg

The only difference is that MAXTRACE became MAXBACKTRACE.

The “-v” option is used to provide details on what SystemTap is doing. When running this one-liner for the first time, it printed:

Pass 1: parsed user script and 82 library script(s) using 200364virt/23076res/2996shr kb, in 100usr/10sys/260real ms.
Pass 2: analyzed script: 3 probe(s), 3 function(s), 0 embed(s), 1 global(s) using 200892virt/23868res/3228shr kb, in 0usr/0sys/9real ms.
Pass 3: translated to C into "/tmp/stapllG8kv/stap_778fac70871457bfb540977b1ef376d3_2113_src.c" using 361936virt/48824res/16640shr kb, in 710usr/90sys/1843real ms.
Pass 4: compiled C into "stap_778fac70871457bfb540977b1ef376d3_2113.ko" in 7630usr/560sys/19155real ms.
Pass 5: starting run.

This provides timing details for each initialization stage. Compilation took over 18 seconds, during which the performance of the system dropped by 45%. Fortunately, this only occurs on the first invocation. SystemTap caches the compiled objects under ~/.systemtap, which subsequent executions use. I haven’t tried, but I suspect it’s possible to compile on one machine (eg, lab, to test for panics), then transfer the cached objects to the target for execution – avoiding the compilation step.

1000 Hertz

The above examples both used 1000 Hertz, so that I could show them both doing the same thing. Ideally, I’d sample at 997 Hertz (or something similar) to avoid sampling in lock-step with timed tasks (which can lead to over-sampling or under-sampling, misrepresenting what is actually happening). With perf_events, the frequency can be set with -F; for example, “-F 997″.

For SystemTap, sampling at 997 Hertz (or anything other than CONFIG_HZ) is currently difficult: the timer.hz(997) probe fires at the correct rate, but can’t read stack backtraces. It’s possible that it can be done via the perf probes based on CPU reference cycle counts (eg, “probe perf.type(0).config(0). sample(N)”, where N = CPU_MHz * 1000000 / Sample_Hz). See #13820 for the status on this.

File System

As an example of a different workload, this shows the kernel CPU time while an ext4 file system was being archived:

This used perf_events (PNG version); the SystemTap version looks almost identical (SVG, PNG).

This shows how the file system was being read and where kernel CPU time was spent. Most of the kernel time is in sys_newfstatat() and sys_getdents() – metadata work as the file system is walked. sys_openat() is on the right, as files are opened to be read, which are then mmap()d (look to the right of sys_getdents(), these are in alphabetical order), and finally page faulted into user-space (see the page_fault() mountain on the left). The actual work of moving bytes is then spent in user-land on the mmap’d segments (and not shown in this kernel flame graph). Had the archiver used the read() syscall instead, this flame graph would look very different, and have a large sys_read() component.

Short Lived Processes

For this flame graph, I executed a workload of short-lived processes to see where kernel time is spent creating them (PNG version):

Apart from performance analysis, this is also a great tool for learning the internals of the Linux kernel.

oprofile

Before anyone asks, oprofile could also be used for stack sampling. I haven’t written a stackcollapse.pl version for oprofile yet.

Notes

All of the above flame graphs were generated on the Linux 3.2.9 kernel (Fedora 16 guest) running under KVM (Ubuntu host), with one virtual CPU. Some code paths and sample ratios will be very different on bare-metal: networking won’t be processed via the virtio-net driver, for a start. On systems with a high degree of idle time, the flame graph can be dominated by the idle task, which can be filtered using “grep -v cpu_idle” of the folded stacks. Note that by default the flame graph aggregates samples from multiple CPUs; with some shell scripting, you could aggregate samples from multiple hosts as well. Although, it’s sometimes useful to generate separate flame graphs for individual CPUs: I’ve done this for mapped hardware interrupts, for example.

Conclusion

With the Flame Graph visualization, CPU time in the Linux kernel can be quickly understood and inspected. In this post, I showed Flame Graphs for different workloads: networking, file system I/O, and process execution. As a SVG in the browser, they can be navigated with the mouse to inspect element details, revealing percentages so that performance issues or tuning efforts can be quantified.

I used perf_events and SystemTap to sample stack traces, one task out of many that these powerful tools can do. It shouldn’t be too hard to use oprofile to provide the data for Flame Graphs as well.

References

Thanks to those using flame graphs and putting it to use in other new areas, and to the SystemTap engineers for answering questions and fixing bugs.

Posted on March 17, 2012 at 9:24 am by Brendan Gregg · Permalink · 4 Comments
In: Performance · Tagged with: , , , , ,