<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>dtrace.org</title>
	<atom:link href="http://dtrace.org/blogs/feed/" rel="self" type="application/rss+xml" />
	<link>http://dtrace.org/blogs</link>
	<description></description>
	<lastBuildDate>Tue, 15 May 2012 18:38:07 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>mdb tab completion</title>
		<link>http://dtrace.org/blogs/rm/2012/05/15/mdb-tab-completion/</link>
		<comments>http://dtrace.org/blogs/rm/2012/05/15/mdb-tab-completion/#comments</comments>
		<pubDate>Tue, 15 May 2012 18:38:07 +0000</pubDate>
		<dc:creator>rm</dc:creator>
				<category><![CDATA[Miscellaneous]]></category>

		<guid isPermaLink="false">http://dtrace.org/blogs/rm/?p=301</guid>
		<description><![CDATA[Last October, the first illumos hack-a-thon took place. Out of that a lot of interesting things were done and have since been integrated into illumos. Two of the more interesting gems were Adam Leventhal and Matt Ahrens adding dtrace -x temporal and Eric Schrock adding the DTrace print() action. Already print() is in the ranks [...]]]></description>
			<content:encoded><![CDATA[<p>Last October, the first <a href="http://illumos.org">illumos</a> hack-a-thon took place. Out of that a lot of interesting things were done and have since been integrated into illumos. Two of the more interesting gems were <a href="http://dtrace.org/blogs/ahl">Adam Leventhal</a> and <a href="http://blog.delphix.com/matt/">Matt Ahrens</a> adding <a href="https://twitter.com/#!/ahl/status/195609092327342081">dtrace -x temporal</a> and <a href="http://dtrace.org/blogs/eschrock">Eric Schrock</a> adding the DTrace <a href="http://dtrace.org/blogs/eschrock/2011/10/26/your-mdb-fell-into-my-dtrace/">print()</a> action. Already print() is in the ranks of things where once you have it you really miss it when you don&#8217;t. During the hack-a-thon I had the chance to work <a href="http://blog.delphix.com/mba/">Matt Amdur</a>. Together we worked on another one of those nice to haves that has finally landed in illumos: tab completion for mdb.</p>
<h3>md-what?</h3>
<p>For those who have never used it, mdb is the Modular Debugger that comes as a part of illumos and was originally developed for Solaris 8. mdb is primarily used for post-mortem of user and kernel applications and kernel debugger. mdb isn&#8217;t a source level debugger, but it works quite well on core dumps from userland, inspects and modifies live kernel state without stopping the system, and provides facilities for traditional debugging where a program is stopped, stepped over, and inspected.  mdb replaced <a href="https://en.wikipedia.org/wiki/Advanced_Debugger">adb</a> which came out of AT&#038;T. While mdb isn&#8217;t 100% compatible with adb, it does remind you that there&#8217;s &#8216;No algol 68 here&#8217;. For the full history, take a look at Mike Shapiro&#8217;s <a href="http://fingolfin.org/illumos/talks/mws-brown.pdf">talk</a> that he gave at the <a href="http://cs.brown.edu">Brown CS</a> <a href="http://cs.brown.edu/about/partners/symposia/ipp37/home.html">37th IPP Symposium</a>.</p>
<p>One of the more useful pieces of mdb is its module API which allows other <a href="http://src.illumos.org/source/xref/illumos-gate/usr/src/cmd/mdb/common/modules/">modules</a> to deliver specifically tailored commands and walkers. This is used for everything from the <a href="http://dtrace.org/blogs/dap/2012/01/13/playing-with-nodev8-postmortem-debugging/">v8 Javascript Engine</a> to understanding <a href="http://src.illumos.org/source/xref/illumos-gate/usr/src/cmd/mdb/common/modules/genunix/cyclic.c">cyclics</a>. Between that, pipelines, and other niceties, there isn&#8217;t too much else you could want from your debugger.</p>
<h3>What&#8217;s involved</h3>
<p>The work that we&#8217;ve done falls into three parts:</p>
<ul>
<li>A tab completion engine.</li>
<li>Changes to the module API and several new functions to allow dcmds<br />
to implement their own tab completion.</li>
<li>Tab completion support for several dcmds</li>
</ul>
<p>Thanks to CTF data in the kernel, we can tab complete everything from walker names, to types and their members. We went and added tab completion to the following dcmds:</p>
<ul>
<li>::walk</li>
<li>::sizeof</li>
<li>::list</li>
<li>::offsetof</li>
<li>::print</li>
<li>The dcmds themselves</li>
</ul>
<h3>Seeing is believing: Tab completion in action</h3>
<h5>Completing dcmds</h5>
<p><pre>
> ::pr[tab]
print
printf
project
prov_tab
prtconf
prtusb
</pre>
</p>
<h5>Completing walkers</h5>
<p><pre>
> ::walk ar[tab]
arc_buf_hdr_t
arc_buf_t
> ::walk arc_buf_
</pre>
</p>
<h5>Completing types</h5>
<p><pre>
> ::sizeof struct dt[tab]
struct dtrace_actdesc
struct dtrace_action
struct dtrace_aggbuffer
struct dtrace_aggdesc
struct dtrace_aggkey
struct dtrace_aggregation
struct dtrace_anon
struct dtrace_argdesc
struct dtrace_attribute
struct dtrace_bufdesc
struct dtrace_buffer
struct dtrace_conf
struct dtrace_cred
struct dtrace_difo
struct dtrace_diftype
struct dtrace_difv
struct dtrace_dstate
struct dtrace_dstate_percpu
struct dtrace_dynhash
struct dtrace_dynvar
struct dtrace_ecb
struct dtrace_ecbdesc
struct dtrace_enabling
struct dtrace_eprobedesc
struct dtrace_fmtdesc
struct dtrace_hash
...
</pre>
</p>
<h5>Completing members</h5>
<p><pre>
> ::offsetof zio_t io_v[tab]
io_vd
io_vdev_tree
io_vsd
io_vsd_ops
</pre>
</p>
<h5>Walking across types with ::print</h5>
<p><pre>
> p0::print proc_t p_zone->zone_n[tab]
zone_name
zone_ncpus
zone_ncpus_online
zone_netstack
zone_nlwps
zone_nlwps_ctl
zone_nlwps_lock
zone_nodename
zone_nprocs
zone_nprocs_ctl
zone_nprocs_kstat
zone_ntasks
</pre>
<p>In addition, just as you can walk across structure (.) and array ([]) dereferences in <b>::print</b> invocations, you can also do the same with tab completion.
</p>
<h3>What&#8217;s next?</h3>
<p>
Now that mdb tab completion <a href="https://github.com/illumos/illumos-gate/commit/3b6e0a598869dfc84461624e8699bf51738f68b3">change</a> is in illumos there&#8217;s already some work to add backends to new dcmds including:</p>
<ul>
<li>::printf</li>
<li>::help</li>
<li>::bp</li>
</ul>
<p>What else would you like to see? Let us know in a comment or better yet, go ahead and implement it yourself!</p>
]]></content:encoded>
			<wfw:commentRss>http://dtrace.org/blogs/rm/2012/05/15/mdb-tab-completion/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Debugging RangeError from a core dump</title>
		<link>http://dtrace.org/blogs/dap/2012/05/14/debugging-rangeerror-from-a-core-dump/</link>
		<comments>http://dtrace.org/blogs/dap/2012/05/14/debugging-rangeerror-from-a-core-dump/#comments</comments>
		<pubDate>Mon, 14 May 2012 21:43:27 +0000</pubDate>
		<dc:creator>dap</dc:creator>
				<category><![CDATA[joyent]]></category>
		<category><![CDATA[Node.js]]></category>

		<guid isPermaLink="false">http://dtrace.org/blogs/dap/?p=1069</guid>
		<description><![CDATA[Last week, I tweeted: I had just run into this nasty Node.js error: $ node foo.js timers.js:96 if (!process.listeners('uncaughtException').length) throw e; ^ RangeError: Maximum call stack size exceeded What went wrong? It was reasonably obvious from the error message that the program blew its stack, which I assumed was likely the result of some errant [...]]]></description>
			<content:encoded><![CDATA[<p>Last week, I tweeted:</p>
<p><a href="https://twitter.com/#!/dapsays/status/200719170252447746"><img src="http://dtrace.org/blogs/dap/files/2012/05/rangeerror-tweet.png" alt="" title="RangeError Tweet" width="530" height="136" class="aligncenter size-full wp-image-1085" /></a></p>
<p>I had just run into this nasty Node.js error:</p>
<pre><code>$ node foo.js 

timers.js:96
            if (!process.listeners('uncaughtException').length) throw e;
                                                                      ^
RangeError: Maximum call stack size exceeded
</code></pre>
<p>What went wrong?  It was reasonably obvious from the error message that the program blew its stack, which I assumed was likely the result of some errant recursive function, which was surprising, because I didn&#8217;t know I was using any recursive functions.  But given that the problem is too many function invocations on the stack, the obvious question is: what&#8217;s on the stack?  But the RangeError exception doesn&#8217;t include a stack trace.  Now what do I do?</p>
<p>I&#8217;d previously been playing around with the idea of having Node dump core via <a href="http://www.shrubbery.net/solaris9ab/SUNWaman/hman3c/abort.3c.html">abort(3C)</a> when an exception is thrown that is not caught.  The idea is that with a core dump, we could use <a href="http://dtrace.org/blogs/dap/2011/10/31/nodejs-v8-postmortem-debugging/">mdb_v8</a> to examine the stack at the time that the program crashed, including function arguments and any other heap objects we can find.  This would be much richer than just the stack trace that most fatal failures leave behind.</p>
<p>I ran my program with this modified Node, it dumped core as expected, and I opened the core file.  As expected, the stack was huge, but it was pretty clear what the pattern was:</p>
<pre><code>&gt; ::jsstack
...
80469c4 0x2bf66683 &lt;anonymous&gt; (as ReadStream._emitData) (3a84f56d)
80469dc 0x2bf66b7b &lt;anonymous&gt; (as ReadStream.resume) (3a84f5d9)
8046a14 0x2bf73766 emitNextRecord (3a889829)
8046a44 0x2bfaa133 gotRecord (3a8897f5)
8046a74 0x2bf74c18 handleLogLine (3a8898f9)
8046aa8 0x2bf7703f &lt;anonymous&gt; (as stream.on.leftover) (3a897151)
8046ae0 0x2bf2d2bc &lt;anonymous&gt; (as EventEmitter.emit) (3a83a579)
8046af8 0x2bf0db41 &lt;ArgumentsAdaptorFrame&gt;

8046b18 0x2bf66683 &lt;anonymous&gt; (as ReadStream._emitData) (3a84f56d)
8046b30 0x2bf66b7b &lt;anonymous&gt; (as ReadStream.resume) (3a84f5d9)
8046b68 0x2bf73766 emitNextRecord (3a889829)
8046b98 0x2bfaa133 gotRecord (3a8897f5)
8046920 0x2bf74c18 handleLogLine (3a8898f9)
8046bfc 0x2bf7703f &lt;anonymous&gt; (as stream.on.leftover) (3a897151)
8046c34 0x2bf2d2bc &lt;anonymous&gt; (as EventEmitter.emit) (3a83a579)
8046c4c 0x2bf0db41 &lt;ArgumentsAdaptorFrame&gt;

8046c6c 0x2bf66683 &lt;anonymous&gt; (as ReadStream._emitData) (3a84f56d)
...
</code></pre>
<p>Starting from the bottom, we see that Node is emitting a &#8220;data&#8221; event on a ReadStream object, which invokes my &#8220;leftover&#8221; callback, which calls a couple of my program&#8217;s internal functions, one of which calls resume() on the same read stream.  Then we do it all over again.</p>
<p>We can inspect the arguments to see what data is being emitted.  The program in question was the <a href="https://github.com/trentm/node-bunyan">bunyan</a> log reader, so the data turned out to be random log contents, but I was able to verify that multiple calls to my &#8220;data&#8221; callback were getting the exact same data.  This took me to the resume() function in Node.js, where I found the root cause:</p>
<pre><code>ReadStream.prototype.resume = function() {
  this.paused = false;

  if (this.buffer) {
    this._emitData(this.buffer);
    this.buffer = null;
  }
</code></pre>
<p>On resume, we emit any buffered data, and <em>then</em> remove it.  But if emitting the data causes us to call resume() again (as it did here), then we emit the same data again and end up in this infinite loop until we run out of stack space and crash.  I reported the <a href="https://github.com/joyent/node/issues/3258">issue</a> on Friday, and koichik immediately fixed the bug.  (Thanks!)</p>
<p>This turned out to be a very minor bug in Node core, but the consequences for my program were pretty serious: fatal failure, with almost no information left behind to debug it.  If I hadn&#8217;t had this experimental V8 around, I probably would have resorted to commenting out half my code at a time, binary searching until I found the bad code.  I could also have tried the debugger, but I&#8217;d still have to have some idea where to set breakpoints, which boils down to the same search problem.  (Stepping through doesn&#8217;t work, since the problem is triggered by an asynchronous event.)  This makes me wonder: how do people debug RangeErrors today?</p>
<p>While the V8 change to dump core on an uncaught exception is actually <a href="http://www.mail-archive.com/v8-users@googlegroups.com/msg04948.html">quite simple</a>, I&#8217;ve been putting it off while I consider the best way to expose it without breaking other V8 users.  After this experience, I&#8217;m thinking it&#8217;s worth trying to do sooner rather than later!</p>
]]></content:encoded>
			<wfw:commentRss>http://dtrace.org/blogs/dap/2012/05/14/debugging-rangeerror-from-a-core-dump/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>illumos Hardware Compatibility List</title>
		<link>http://dtrace.org/blogs/rm/2012/05/10/illumos-hardware-compatibility-list/</link>
		<comments>http://dtrace.org/blogs/rm/2012/05/10/illumos-hardware-compatibility-list/#comments</comments>
		<pubDate>Fri, 11 May 2012 01:00:04 +0000</pubDate>
		<dc:creator>rm</dc:creator>
				<category><![CDATA[Miscellaneous]]></category>

		<guid isPermaLink="false">http://dtrace.org/blogs/rm/?p=288</guid>
		<description><![CDATA[One of the challenges when using any Operating System is answering the question &#8216;Is my hardware supported?&#8217;. To track this down, you often have to scour Internet sites, hoping someone else has already asked the question, or do other, more horrible machinations &#8211; or ask someone like me. If you&#8217;re running on an illumos-based system [...]]]></description>
			<content:encoded><![CDATA[<p>One of the challenges when using any Operating System is answering the question &#8216;Is my hardware supported?&#8217;. To track this down, you often have to scour Internet sites, hoping someone else has already asked the question, or do other, more horrible machinations &#8211; or ask someone like me. If you&#8217;re running on an illumos-based system like <a href="http://www.smartos.org">SmartOS</a>, <a href="http://omnios.omniti.com/">OmniOS</a>, or <a href="http://openindiana.org/">OpenIndiana</a>, this just got a lot easier: I&#8217;ve created the list. Better yet, I&#8217;ve created a tool to automatically create the list.</p>
<h3>The List</h3>
<p>
illumos now has a hardware compatibility list (HCL) available at <a href="http://illumos.org/hcl">http://illumos.org/hcl</a>.</p>
<p>
This list contains all the PCI and PCI Express devices that should work. If your PCI device isn&#8217;t listed there, don&#8217;t fret, it may still work. This list is a first strike at the problem of hardware compatibility, so things like specific motherboards aren&#8217;t listed there.
</p>
<h3>How it&#8217;s generated</h3>
<p>
The great thing about this list is that it&#8217;s automatically generated from the source code in illumos itself. Each driver on the system has a manifest that specifies what PCI IDs it supports. We parse each of these manifests and look up the names using the <a href="http://pci-ids.ucw.cz/">PCI ID Database</a>, using a small library that I wrote. From there, we automatically generate the static web page that can be deployed. Thanks to <a href="https://twitter.com/#!/kadamwhite">K. Adam White</a> for his invaluable help to stop me from fumbling around too much with front end web code and the others who have already come in and improved it.
</p>
<p>All the code is available on <a href="http://github.com/rmustacc/illumos-hcl">github</a>. The goal for all of this is to eventually be a part of the illumos-gate itself. If you have improvements or want to make the web page more visually appealing, we&#8217;d all welcome the contribution.</p>
]]></content:encoded>
			<wfw:commentRss>http://dtrace.org/blogs/rm/2012/05/10/illumos-hardware-compatibility-list/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>dtrace.conf 2012 videos</title>
		<link>http://dtrace.org/blogs/brendan/2012/05/08/dtrace-conf-2012-videos/</link>
		<comments>http://dtrace.org/blogs/brendan/2012/05/08/dtrace-conf-2012-videos/#comments</comments>
		<pubDate>Tue, 08 May 2012 23:06:51 +0000</pubDate>
		<dc:creator>Brendan Gregg</dc:creator>
				<category><![CDATA[DTrace]]></category>
		<category><![CDATA[video]]></category>
		<category><![CDATA[visualizations]]></category>

		<guid isPermaLink="false">http://dtrace.org/blogs/brendan/?p=3026</guid>
		<description><![CDATA[Last month was dtrace.conf 2012, the 2nd DTrace unconference. This is a meetup of DTrace practitioners and developers, where we discuss the latest uses, developments and future directions of the technology. It was great to see old friends of DTrace, and to put new faces to names. See the video list for the sessions, which, [...]]]></description>
			<content:encoded><![CDATA[<p>Last month was <a href="http://wiki.smartos.org/display/DOC/dtrace.conf">dtrace.conf 2012</a>, the 2nd <a href="http://dtrace.org/blogs/about/">DTrace</a> unconference.  This is a meetup of DTrace practitioners and developers, where we discuss the latest uses, developments and future directions of the technology.  It was great to see old friends of DTrace, and to put new faces to names.  See the <a href="http://wiki.smartos.org/display/DOC/dtrace.conf+Schedule">video list</a> for the sessions, which, for fun, includes all the attendees <a href="http://smartos.org/2012/04/05/a-carousel-of-dtrace/">riding a carousel</a>.</p>
<p>I gave a short talk on DTrace-based visualizations, which is on <a href="http://www.youtube.com/watch?v=XD5hdaWnQM4">youtube</a>:</p>
<p><iframe width="600" height="338" src="http://www.youtube.com/embed/XD5hdaWnQM4" frameborder="0" allowfullscreen></iframe></p>
<p>The visualizations I discussed included:</p>
<ul>
<li><a href="http://dtrace.org/blogs/brendan/2010/06/05/visualizing-system-latency/">Visualizing System Latency</a> as heat maps: which included <a href="http://dtrace.org/blogs/brendan/2009/06/12/latency-art-x-marks-the-spot/">X Marks the Spot</a> and the <a href="http://dtrace.org/blogs/brendan/2009/03/12/latency-art-rainbow-pterodactyl/">Rainbow Pterodactyl</a>.  These use DTrace to trace start and end events, associate timestamps, and populate lquantize() aggregations so that the full distribution can be efficiently captured and examined.</li>
<li><a href="http://dtrace.org/blogs/brendan/2012/03/26/subsecond-offset-heat-maps/">Subsecond Offset Heat Maps</a>, which use DTrace to either sample on-CPU thread execution, or trace events such as system calls, to show the timing of these events within a second.</li>
<li><a href="http://dtrace.org/blogs/brendan/2012/02/12/visualizing-process-execution/">Visualizing Process Execution</a>, which uses DTrace to trace process creation and destruction events.</li>
<li><a href="http://dtrace.org/blogs/brendan/2011/12/16/flame-graphs/">Flame Graphs</a>, which uses DTrace to sample full stack traces, either user-land, kernel or both.  These are then visualized as an interactive SVG.</li>
</ul>
<p>My talk was followed by a couple more on DTrace-based visualizations: Theo Schlossnagle on <a href="http://www.youtube.com/watch?v=3Sqa8mmtnMM">Visualizations and USDT</a>, and <a href="http://www.youtube.com/watch?v=-B6u6wY3Iro">Richard Elling</a>.</p>
<p>The conference ran very well, and was attended by engineers working on DTrace with a variety of operating systems, including <a href="http://smartos.org/2012/04/09/dtrace-conf-2012-dtrace-on-freebsd/">FreeBSD</a>, Mac OS X, <a href="http://www.youtube.com/watch?v=6NqV_Uj8Ba4">Sony PlayStation</a>, and <a href="http://www.youtube.com/watch?v=NElog3MvUC8">Linux</a>.  Thanks to all who helped, especially <a href="http://www.beginningwithi.com/comments">Deirdré Straughan</a> for organizing it and videoing the sessions.</p>
<p>For more about the conference, see Adam&#8217;s <a href="http://dtrace.org/blogs/blog/2012/04/09/dtrace-conf12-wrap-up/">wrap-up</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://dtrace.org/blogs/brendan/2012/05/08/dtrace-conf-2012-videos/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Debugging node.js memory leaks</title>
		<link>http://dtrace.org/blogs/bmc/2012/05/05/debugging-node-js-memory-leaks/</link>
		<comments>http://dtrace.org/blogs/bmc/2012/05/05/debugging-node-js-memory-leaks/#comments</comments>
		<pubDate>Sat, 05 May 2012 20:07:27 +0000</pubDate>
		<dc:creator>bmc</dc:creator>
		
		<guid isPermaLink="false">http://dtrace.org/blogs/bmc/?p=356</guid>
		<description><![CDATA[Part of the value of dynamic and interpreted environments is that they handle the complexities of dynamic memory allocation. In particular, one needn&#8217;t explicitly free memory that is no longer in use: objects that are no longer referenceable are found automatically and destroyed via garbage collection. While garbage collection simplifies the program&#8217;s relationship with memory, [...]]]></description>
			<content:encoded><![CDATA[<p>Part of the value of dynamic and interpreted environments is that they handle the complexities of dynamic memory allocation.  In particular, one needn&#8217;t explicitly free memory that is no longer in use: objects that are no longer referenceable are found automatically and destroyed via garbage collection.  While garbage collection simplifies the program&#8217;s relationship with memory, it not mean the end of all memory-based pathologies: if an application retains a reference to an object that is ultimately rooted in a global scope, that object won&#8217;t be considered garbage and the memory associated with it will not be returned to the system.  If enough such objects build up, allocation will ultimately fail (memory is, after all, finite) and the program will (usually) fail along with it.  While this is not strictly &#8212; from the native code perspective, anyway &#8212; a memory leak (the application has not leaked memory so much as neglected to unreference a semantically unused object), the effect is nonetheless the same and the same nomenclature is used.
<p>While all garbage collected environments create the potential to create such leaks, it can be particularly easy in JavaScript: closures create implicit references to variables within their scopes &#8212; references that might not be immediately obvious to the programmer.  And node.js adds a new dimension of peril with its strictly asynchronous interface with the system:  if backpressure from slow upstream services (I/O, networking, database services, etc.) isn&#8217;t carefully passed to downstream consumers, memory will begin to fill with the intermediate state.  (That is, what one gains in concurrency of operations one may pay for in memory.)  And of course, node.js is on the server &#8212; where the long-running nature of services means that the effect of a memory leak is much more likely to be felt and to affect service.  Take all of these together, and you can easily see why virtually anyone who has stood up node.js in production will identify memory leaks as their most significant open issue.
<p>The state of the art for node.js memory leak detection &#8212; as concisely described by <a href="https://twitter.com/#!/felixge">Felix</a> in his <a href="https://github.com/felixge/node-memory-leak-tutorial">node memory leak tutorial</a> &#8212; is to use the v8-profiler and <a href="https://twitter.com/#!/blinkenbyte">Danny Coates&#8217;</a> <a href="https://github.com/dannycoates/node-inspector">node-inspector</a>.  While this method is a lot better than nothing, it&#8217;s very much oriented to the developer in development.  This is a debilitating shortcoming because memory leaks are often observed only after days (hours, if you&#8217;re unlucky) of running in production.  At Joyent, we have long wanted to tackle the problem of providing the necessary tooling to identify node.js memory leaks in production.  When we are so lucky as to have these in native code (that is, node.js add-ons), we can use <a href="http://dtrace.org/blogs/ahl/2004/07/13/number-11-of-20-libumem/">libumem and ::findleaks</a> &#8212; a technology that we have used to find thousands of production memory leaks over the years.  But this is (unfortunately) not the common case:  the common case is a leak in JavaScript &#8212; be it node.js or application code &#8212; for which our existing production techniques are useless.
<p>So for node.js leaks in production, one has been left with essentially a Gedanken experiment: consider your load and peruse source code looking for a leak.  Given how diabolical these leaks can be, it&#8217;s amazing to me that anyone ever finds these leaks.  (And it&#8217;s less surprising that they take an excruciating amount of effort over a long period of time.) That&#8217;s been the (sad) state of things for quite some time, but recently, this issue boiled over for us: <a href="http://voxer.com">Voxer</a>, a Joyent customer and <a href="http://dtrace.org/blogs/bmc/2010/08/11/the-node-js-demographic/">long in the vanguard of node.js development</a>, was running into a nasty memory leak:  a leak sufficiently acute that the service in question was having to be restarted on a regular basis, but not so much that it was reproducible in development.  With the urgency high on their side, we again kicked around this seemingly hopeless problem, focussing on the concrete data that we had: core dumps exhibiting the high memory usage, obtained at our request via gcore(1) before service restart.  Could we do anything with those?
<p>As an aside, a few months ago, Joyent&#8217;s <a href="https://twitter.com/#!/dapsays">Dave Pacheco</a> added some unbelievable <a href="http://dtrace.org/blogs/dap/2012/01/13/playing-with-nodev8-postmortem-debugging/">postmortem debugging support for node</a>.  (If postmortem debugging is new to you, you might be interested in checking out <a href="http://www.infoq.com/presentations/Debugging-Production-Systems">my presentation on debugging production systems</a> from QCon San Francisco.  I had the privilege of demonstrating Dave&#8217;s work in that presentation &#8212; and if you listen very closely at 43:26, you can hear the audience gasp when I demonstrate the amazing ::jsprint.) But Dave hasn&#8217;t built the infrastructure for walking the data structures of the V8 heap from an MDB dmod &#8212; and it was clear that doing so would be brittle and error-prone.
<p>As Dave, <a href="https://twitter.com/#!/brendangregg">Brendan</a> and I were kicking around increasingly desperate ideas (including running strings(1) on the dump &#8212; an idea not totally without merit &#8212; and some wild visualization ideas), a much simpler idea collectively occurred to us:  given that we understood via Dave&#8217;s MDB V8 support how a given object is laid out, and given that an object needed quite a bit of self-consistency with referenced but otherwise orthogonal structures, what about just iterating over all of the anonymous memory in the core and looking for objects?  That is, iterate over every single address, and see if that address could be interpreted as an object.  On the one hand, it was terribly brute force &#8212; but given the level of consistency required of an object in V8, it seemed that it wouldn&#8217;t pick up too many false positives.  The more we discussed it, the more plausible it became, but with Dave fully occupied (on another saucy project we have cooking at Joyent &#8212; more on that later), I roped up and headed into his MDB support for V8&#8230;
<p>The result &#8212; <a href="https://github.com/joyent/illumos-joyent/commit/e6446071ecdbdc60eda295956dbf3e1778663264">::findjsobjects</a> &#8212; takes only a few minutes to run on large dumps, but provides some important new light on the problem.  The output of ::findjsobjects consists of representative objects, the number of instances of that object and the number of properties on the object &#8212; followed by the constructor and first few properties of the objects.  For example, here is the dump on a gcore of a (non-pathological) Joyent node-based facility:
<pre>
&gt; <b>::findjsobjects -v</b>
findjsobjects:         elapsed time (seconds) =&gt; 20
findjsobjects:                   heap objects =&gt; 6079488
findjsobjects:             JavaScript objects =&gt; 4097
findjsobjects:              processed objects =&gt; 1734
findjsobjects:                 unique objects =&gt; 161
OBJECT   #OBJECTS #PROPS CONSTRUCTOR: PROPS
fc4671fd        1      1 Object: flags
fe68f981        1      1 Object: showVersion
fe8f64d9        1      1 Object: EventEmitter
fe690429        1      1 Object: Credentials
fc465fa1        1      1 Object: lib
fc46300d        1      1 Object: free
fc4efbb9        1      1 Object: close
fc46c2f9        1      1 Object: push
fc46bb21        1      1 Object: uncaughtException
fe8ea871        1      1 Object: _idleTimeout
fe8f3ed1        1      1 Object: _makeLong
fc4e7c95        1      1 Object: types
fc46bae9        1      1 Object: allowHalfOpen
...
fc45e249       12      4 Object: type, min, max, default
fc4f2889       12      4 Object: index, fields, methods, name
fd2b8ded       13      4 Object: enumerable, writable, configurable, value
fc0f68a5       14      1 SlowBuffer: length
fe7bac79       18      3 Object: path, fn, keys
fc0e9d21       20      5 Object: _onTimeout, _idleTimeout, _idlePrev, ...
fc45facd       21      4 NativeModule: loaded, id, exports, filename
fc45f571       23      8 Module: paths, loaded, id, parent, exports, ...
fc4607f9       35      1 Object: constructor
fc0f86c9       56      3 Buffer: length, offset, parent
fc0fc92d       57      2 Arguments: length, callee
fe696f59       91      3 Object: index, fields, name
fc4f3785       91      4 Object: fields, name, methodIndex, classIndex
fe697289      246      2 Object: domain, name
fc0f87d9      308      1 Buffer:
</pre>
<p>Now, any one of those objects can be printed with ::jsprint.  For example, let&#8217;s take fc45e249 from the above output:
<pre>
<b>&gt; fc45e249::jsprint</b>
{
    type: number,
    min: 10,
    max: 1000,
    default: 300,
}
</pre>
<p>Note that that&#8217;s only a <i>representative</i> object &#8212; there are (in the above case) 12 objects that have that same property signature.  ::findjsobjects can get you all of them when you specify the address of the reference object:
<pre>
&gt; <b>fc45e249::findjsobjects</b>
fc45e249
fc46fd31
fc467ae5
fc45ecb5
fc45ec99
fc45ec11
fc45ebb5
fc45eb41
fc45eb25
fc45e3d1
fc45e3b5
fc45e399
</pre>
<p>And because MDB is the debugger Unix was meant to have, that output can be piped to ::jsprint:
<pre>
&gt; <b>fc45e249::findjsobjects | ::jsprint</b>
{
    type: number,
    min: 10,
    max: 1000,
    default: 300,
}
{
    type: number,
    min: 0,
    max: 5000,
    default: 5000,
}
{
    type: number,
    min: 0,
    max: 5000,
    default: undefined,
}
...
</pre>
<p>Okay, fine &#8212; but where are these objects referenced?  ::findjsobjects has an option for that:
<pre>
&gt; <b>fc45e249::findjsobjects -r</b>
fc45e249 referred to by fc45e1e9.height
</pre>
<p>This tells us (or tries to) who is referencing that first (representative) object.  Printing that out (with the &#8220;-a&#8221; option to show the addresses of the objects):
<pre>
&gt; <b>fc45e1e9::jsprint -a</b>
fc45e1e9: {
    ymax: fe78e061: undefined,
    hues: fe78e061: undefined,
    <b>height: fc45e249: {
        type: fe78e361: number,
        min: 14: 10,
        max: 7d0: 1000,
        default: 258: 300,
    }</b>,
    selected: fc45e3fd: {
        type: fe7a2465: array,
        default: fc45e439: [...],
    },
    ...
</pre>
<p>So if we want to learn where all of these objects are referenced, we can again use a pipeline within MDB:
<pre>
&gt; <b>fc45e249::findjsobjects | ::findjsobjects -r</b>
fc45e249 referred to by fc45e1e9.height
fc46fd31 referred to by fc46b159.timeout
fc467ae5 is not referred to by a known object.
fc45ecb5 referred to by fc45eadd.ymax
fc45ec99 is not referred to by a known object.
fc45ec11 referred to by fc45eadd.nbuckets
fc45ebb5 referred to by fc45eadd.height
fc45eb41 referred to by fc45eadd.ymin
fc45eb25 referred to by fc45eadd.width
fc45e3d1 referred to by fc45e1e9.nbuckets
fc45e3b5 referred to by fc45e1e9.ymin
fc45e399 referred to by fc45e1e9.width
</pre>
<p>Of course, the proof of a debugger is in the debugging; would ::findjsobjects actually be of use on the Voxer dumps that served as its motivation?  Here is the (elided) output from running it on a big Voxer dump:
<pre>
&gt; <b>::findjsobjects -v</b>
findjsobjects:         elapsed time (seconds) =&gt; 292
findjsobjects:                   heap objects =&gt; 8624128
findjsobjects:             JavaScript objects =&gt; 112501
findjsobjects:              processed objects =&gt; 100424
findjsobjects:                 unique objects =&gt; 241
OBJECT   #OBJECTS #PROPS CONSTRUCTOR: PROPS
fe806139        1      1 Object: Queue
fc424131        1      1 Object: Credentials
fc424091        1      1 Object: version
fc4e3281        1      1 Object: message
fc404f6d        1      1 Object: uncaughtException
...
fafcb229     1007     23 ClientRequest: outputEncodings, _headerSent, ...
fafc5e75     1034      5 Timing: req_start, res_end, res_bytes, req_end, ...
fafcbecd     1037      3 Object: aborted, data, end
 8045475     1060      1 Object:
fb0cee9d     1220      9 HTTPParser: socket, incoming, onHeadersComplete, ...
fafc58d5     1271     25 Socket: _connectQueue, bytesRead, _httpMessage, ...
fafc4335     1311     16 ServerResponse: outputEncodings, statusCode, ...
fafc4245     1673      1 Object: slab
fafc44d5     1702      5 Object: search, query, path, href, pathname
fafc440d     1784     14 Client: buffered_writes, name, timing, ...
fafc41c5     1796      3 Object: error, timeout, close
fafc4469     1811      3 Object: address, family, port
fafc42a1     2197      2 Object: connection, host
fbe10625     2389      2 Arguments: callee, length
fafc4251     2759     15 IncomingMessage: statusCode, httpVersionMajor, ...
fafc42ad     3652      0 Object:
fafc6785    11746      1 Object: oncomplete
fb7abc29    15155      1 Object: buffer
fb7a6081    15155      3 Object: , oncomplete, cb
fb121439    15378      3 Buffer: offset, parent, length
</pre>
<p>This immediately confirmed a hunch that <a>Matt</a> had had that this was a buffer leak.  And for <a href="https://twitter.com/#!/izs">Isaac</a> &#8212; who had been working this issue from the Gedanken side and was already zeroing in on certain subsystems &#8212; this data was surprising in as much as it was so confirming:  he was already on the right path.  In short order, he <a href="https://github.com/joyent/node/commit/d1effbb338fe5916e5a25fc3d3587016d0725761">nailed it</a>, and the fix is in <a href="http://blog.nodejs.org/2012/05/04/version-0-6-17-stable/">node 0.6.17</a>.
<p>The fix was low risk, so Voxer redeployed with it immediately &#8212; and for the first time in quite some time, memory utilization was flat.  This was a huge win &#8212; and was the reason for Matt&#8217;s <a href="https://twitter.com/#!/mranney/status/198125651495096321">tantalizing tweet</a>.  The advantages of this approach are that it requires absolutely no modification to one&#8217;s node.js programs &#8212; no special flags and no different options.  And it operates purely postmortem.  Thanks to help from gcore(1), core dumps can be taken over time for a single process, and those dumps can then be analyzed off-line.
<p>Even with ::findjsobjects, debugging node.js memory leaks is still tough.  And there are certainly lots of improvements to be made here &#8212; there are currently some objects that we do not know how to correctly interpret, and we know that we know that we can improve our algorithm for finding object references &#8212; but this shines a bright new light into what had previously been a black hole!
<p>If you want to play around with this, you&#8217;ll need <a href="http://smartos.org">SmartOS</a> or <a href="http://wiki.illumos.org/display/illumos/Distributions">your favorite illumos distro</a> (which, it must be said, you can get simply by provisioning on <a href="http://www.joyentcloud.com/">the Joyent cloud</a>).  You&#8217;ll need an updated v8.so &#8212; which you can either build yourself from <a href="https://github.com/joyent/illumos-joyent">illumos-joyent</a> or you can <a href="http://dtrace.org/blogs/bmc/files/2012/05/v8.so_.tar_.gz">download a binary</a>.  From there, follow <a href="http://dtrace.org/blogs/dap/2012/01/13/playing-with-nodev8-postmortem-debugging/">Dave&#8217;s instructions</a>.  Don&#8217;t hesitate to ping <a href="https://twitter.com/#!/bcantrill">me</a> if you get stuck or have requests for enhancement &#8212; and here&#8217;s to flat memory usage on your production node.js service!<br />
]]></content:encoded>
			<wfw:commentRss>http://dtrace.org/blogs/bmc/2012/05/05/debugging-node-js-memory-leaks/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Profiling Node.js</title>
		<link>http://dtrace.org/blogs/dap/2012/04/25/profiling-node-js/</link>
		<comments>http://dtrace.org/blogs/dap/2012/04/25/profiling-node-js/#comments</comments>
		<pubDate>Wed, 25 Apr 2012 17:21:45 +0000</pubDate>
		<dc:creator>dap</dc:creator>
				<category><![CDATA[DTrace]]></category>
		<category><![CDATA[joyent]]></category>
		<category><![CDATA[Node.js]]></category>
		<category><![CDATA[Production]]></category>
		<category><![CDATA[smartos]]></category>

		<guid isPermaLink="false">http://dtrace.org/blogs/dap/?p=1027</guid>
		<description><![CDATA[(For returning readers, this is basically a &#8220;tl;dr&#8221; version of my previous post on Node.js performance. The post below also appears on the Node.js blog.) It&#8217;s incredibly easy to visualize where your Node program spends its time using DTrace and node-stackvis (a Node port of Brendan Gregg&#8217;s FlameGraph tool): Run your Node.js program as usual. [...]]]></description>
			<content:encoded><![CDATA[<p>(For returning readers, this is basically a &#8220;tl;dr&#8221; version of my previous post on <a href="http://dtrace.org/blogs/dap/2012/04/25/profiling-node-js/dtrace.org/blogs/dap/2012/01/05/where-does-your-node-program-spend-its-time/">Node.js performance</a>. The post below also appears on the <a href="http://blog.nodejs.org/2012/04/25/profiling-node-js/">Node.js blog</a>.)</p>
<p>It&#8217;s incredibly easy to visualize where your Node program spends its time using DTrace and <a href="http://github.com/davepacheco/node-stackvis">node-stackvis</a> (a Node port of Brendan Gregg&#8217;s <a href="http://github.com/brendangregg/FlameGraph/">FlameGraph</a> tool):</p>
<ol>
<li>Run your Node.js program as usual.</li>
<li>In another terminal, run:
<pre>
$ dtrace -o stacks.out -n 'profile-97/execname == "node" &#038;&#038; arg1/{
    @[jstack(100, 8000)] = count(); } tick-60s { exit(0); }'</pre>
<p>        This will sample about 100 times per second for 60 seconds and emit results to stacks.out. <strong>Note that this will sample all running programs called &#8220;node&#8221;.  If you want a specific process, replace <code>execname == "node"</code> with <code>pid == 12345</code> (the process id).</strong>
    </li>
<li>Use the &#8220;stackvis&#8221; tool to transform this directly into a flame graph. First, install it:
<pre>$ npm install -g stackvis</pre>
<p>        then use <code>stackvis</code> to convert the DTrace output to a flamegraph:</p>
<pre>$ stackvis dtrace flamegraph-svg &lt; stacks.out &gt; stacks.svg</pre>
</li>
<li>Open stacks.svg in your favorite browser.</li>
</ol>
<p>You&#8217;ll be looking at something like this:</p>
<p><a href="http://www.cs.brown.edu/~dap/helloworld.svg"><img src="http://dtrace.org/blogs/dap/files/2012/04/helloworld-flamegraph-550x366.png" alt="" title="helloworld-flamegraph" width="550" height="366" class="aligncenter size-large wp-image-1047" /></a></p>
<p>This is a visualization of all of the profiled call stacks. This example is from the &#8220;hello world&#8221; HTTP server on the <a href="http://nodejs.org">Node.js</a> home page under load. Start at the bottom, where you have &#8220;main&#8221;, which is present in most Node stacks because Node spends most on-CPU time in the main thread. Above each row, you have the functions called by the frame beneath it. As you move up, you&#8217;ll see actual JavaScript function names. The boxes in each row are not in chronological order, but their width indicates how much time was spent there. When you hover over each box, you can see exactly what percentage of time is spent in each function. This lets you see at a glance where your program spends its time.</p>
<p>That&#8217;s the summary. There are a few prerequisites:</p>
<ul>
<li>You must gather data on a system that supports DTrace with the Node.js ustack helper. For now, this pretty much means <a href="http://illumos.org/">illumos</a>-based systems like <a href="http://smartos.org/">SmartOS</a>, including the Joyent Cloud. <strong>MacOS users:</strong> OS X supports DTrace, but not ustack helpers. The way to get this changed is to contact your Apple developer liason (if you&#8217;re lucky enough to have one) or <strong>file a bug report at bugreport.apple.com</strong>. I&#8217;d suggest referencing existing bugs 5273057 and 11206497. More bugs filed (even if closed as dups) show more interest and make it more likely Apple will choose to fix this.</li>
<li>You must be on 32-bit Node.js 0.6.7 or later, built <code>--with-dtrace</code>. The helper doesn&#8217;t work with 64-bit Node yet. On illumos (including SmartOS), development releases (the 0.7.x train) include DTrace support by default.</li>
</ul>
<p>There are a few other notes:</p>
<ul>
<li>You can absolutely profile apps <strong>in production</strong>, not just development, since compiling with DTrace support has very minimal overhead. You can start and stop profiling without restarting your program.</li>
<li>You may want to run the stacks.out output through <code>c++filt</code> to demangle C++ symbols. Be sure to use the <code>c++filt</code> that came with the compiler you used to build Node. For example:
<pre>c++filt &lt; stacks.out &gt; demangled.out</pre>
<p>    then you can use demangled.out to create the flamegraph.
    </li>
<li>If you want, you can filter stacks containing a particular function.  The best way to do this is to first collapse the original DTrace output, then grep out what you want:
<pre>
$ stackvis dtrace collapsed &lt; stacks.out | grep SomeFunction &gt; collapsed.out
$ stackvis collapsed flamegraph-svg &lt; collapsed.out &gt; stacks.svg</pre>
</li>
<li>If you&#8217;ve used Brendan&#8217;s FlameGraph tools, you&#8217;ll notice the coloring is a little different in the above flamegraph. I ported his tools to Node first so I could incorporate it more easily into other Node programs, but I&#8217;ve also been playing with different coloring options. The current default uses hue to denote stack depth and saturation to indicate time spent. (These are also indicated by position and size.) Other ideas include coloring by module (so V8, JavaScript, libc, etc. show up as different colors.)
    </li>
</ul>
<p>For more on the underlying pieces, see my <a href="http://dtrace.org/blogs/dap/2012/01/05/where-does-your-node-program-spend-its-time/">previous post on Node.js profiling</a> and <a href="http://dtrace.org/blogs/brendan/2011/12/16/flame-graphs/">Brendan&#8217;s post on Flame Graphs</a>.</p>
<p>.</p>
]]></content:encoded>
			<wfw:commentRss>http://dtrace.org/blogs/dap/2012/04/25/profiling-node-js/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>BTrace: DTrace for Java… ish</title>
		<link>http://dtrace.org/blogs/ahl/2012/04/24/btrace-dtrace-for-java-ish/</link>
		<comments>http://dtrace.org/blogs/ahl/2012/04/24/btrace-dtrace-for-java-ish/#comments</comments>
		<pubDate>Tue, 24 Apr 2012 07:29:16 +0000</pubDate>
		<dc:creator>ahl</dc:creator>
				<category><![CDATA[BTrace]]></category>
		<category><![CDATA[DTrace]]></category>
		<category><![CDATA[Java]]></category>

		<guid isPermaLink="false">http://dtrace.org/blogs/ahl/?p=1139</guid>
		<description><![CDATA[DTrace first peered into Java in early 2005 thanks to an early prototype by Jarod Jenson that led eventually to the inclusion of USDT probes in the HotSpot JVM. If you want to see where, say, the java.net.SocketOutputStream.write() method is called, you can simply run this DTrace script: hotspot$target:::method-entry /copyinstr(arg1, arg2) == "java/net/SocketOutputStream" &#038;&#038; copyinstr(arg3, [...]]]></description>
			<content:encoded><![CDATA[<p>DTrace <a href="http://dtrace.org/blogs/ahl/2005/04/18/dtracing-java/">first peered into Java</a> in early 2005 thanks to an early prototype by Jarod Jenson that led eventually to the inclusion of USDT probes in the <a href="http://en.wikipedia.org/wiki/HotSpot">HotSpot JVM</a>. If you want to see where, say, the java.net.SocketOutputStream.write() method is called, you can simply run this DTrace script:</p>
<pre>hotspot$target:::method-entry
/copyinstr(arg1, arg2) == "java/net/SocketOutputStream" &amp;&amp;
 copyinstr(arg3, arg4) == "write"/
{
        jstack(50, 8000);
}</pre>
<p>And that will work as long as you rememember to start your JVM with the -XX:+ExtendedDTraceProbes option or you use the jinfo utility to enable it after the fact. And as long as you don&#8217;t mind a crippling performance penalty (hint: you probably do).</p>
<p>Inspired by <a href="http://dtrace.org/blogs/ahl/2012/04/09/dtrace-conf12-wrap-up/">dtrace.conf</a> a few weeks ago, I wanted to sketch out what the real Java provider would look like:</p>
<pre>java$target:java.net.SocketOutputStream:write:entry
{
        jstack(50,8000);
}</pre>
<p>And check it out:</p>
<pre># <strong>jdtrace.pl -p $(pgrep java) -n 'java$target:java.net.SocketOutputStream::entry{ jstack(50,8000); }'</strong>
dtrace: script '/tmp/jdtrace.19092/jdtrace.d' matched 0 probes
CPU     ID                    FUNCTION:NAME
0  64991 Java_com_sun_btrace_BTraceRuntime_dtraceProbe0:event
libbtrace.so`Java_com_sun_btrace_BTraceRuntime_dtraceProbe0+0xbb
com/sun/btrace/BTraceRuntime.dtraceProbe0(Ljava/lang/String;Ljava/lang/String;II)I
com/sun/btrace/BTraceRuntime.dtraceProbe(Ljava/lang/String;Ljava/lang/String;II)I
com/sun/btrace/BTraceUtils$D.probe(Ljava/lang/String;Ljava/lang/String;II)I
com/sun/btrace/BTraceUtils$D.probe(Ljava/lang/String;Ljava/lang/String;)I
java/net/SocketOutputStream.$btrace$jdtrace$probe1(Ljava/lang/String;Ljava/lang/String;)V
java/net/SocketOutputStream.write([BII)V
sun/nio/cs/StreamEncoder.writeBytes()V
sun/nio/cs/StreamEncoder.implFlushBuffer()V
sun/nio/cs/StreamEncoder.implFlush()V
sun/nio/cs/StreamEncoder.flush()V
java/io/OutputStreamWriter.flush()V
java/io/BufferedWriter.flush()V
java/io/PrintWriter.newLine()V
java/io/PrintWriter.println()V
java/io/PrintWriter.println(Ljava/lang/String;)V
com/delphix/appliance/server/ham/impl/HAMonitorServerThread.run()V
java/util/concurrent/ThreadPoolExecutor$Worker.runTask(Ljava/lang/Runnable;)V
java/util/concurrent/ThreadPoolExecutor$Worker.run()V
java/lang/Thread.run()V
StubRoutines (1)
libjvm.so`__1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v_+0x21d
libjvm.so`__1cCosUos_exception_wrapper6FpFpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v2468_v_+0x27
libjvm.so`__1cJJavaCallsMcall_virtual6FpnJJavaValue_nGHandle_nLKlassHandle_nMsymbolHandle_5pnGThread__v_+0x149
libjvm.so`__1cMthread_entry6FpnKJavaThread_pnGThread__v_+0x113
libjvm.so`__1cKJavaThreadDrun6M_v_+0x2c6
libjvm.so`java_start+0x1f2
libc.so.1`_thrp_setup+0x9b
libc.so.1`_lwp_start</pre>
<p>Obviously there's something fishy going on. First, we're using perl -- the shibboleth of fake-o-ware -- and there's this BTrace stuff in the output.</p>
<h3>Faking it with BTrace</h3>
<p><a href="http://kenai.com/projects/btrace"> BTrace is a dynamic instrumentation tool for Java</a>; it is both inspired by DTrace and contains some DTrace integration. The perl script above takes the DTrace syntax and generates a DTrace script and a BTrace-enabled Java source file.</p>
<p>Like DTrace, BTrace lets you specify the points of instrumentation in your Java program as well as the actions to take. Here's what our generated source file looks like.</p>
<pre>import com.sun.btrace.annotations.*;
import static com.sun.btrace.BTraceUtils.*;
@BTrace
public class jdtrace {
        @OnMethod(clazz="java.net.SocketOutputStream", method="write", location=@Location(Kind.ENTRY))
        public static void probe1(@ProbeClassName String c,
            @ProbeMethodName String m) {
                String name = "entry";
                String p = Strings.strcat(c, Strings.strcat(":",
                    Strings.strcat(m, Strings.strcat(":", name))));
                D.probe(p, "");
        }
}</pre>
<p>Note that we specify where to trace (this can be a regular expression), and then take the action of joining the class, method, and "entry" string into a single string that we pass to the D.probe() method that causes a BTrace USDT probe to fire.</p>
<p>Here's what the D script looks like:</p>
<pre>btrace$target:::event
{
        this-&gt;__jd_arg = copyinstr(arg0);
        this-&gt;__jd_mod = strtok(this-&gt;__jd_arg, ":");
        this-&gt;__jd_func = strtok(NULL, ":");
        this-&gt;__jd_name = strtok(NULL, ":");
}

btrace$target:::event
/((this-&gt;__jd_mod == "java.net.SocketOutputStream" &amp;&amp;
 this-&gt;__jd_func == "write" &amp;&amp;
 this-&gt;__jd_name == "entry"))/
{
        jstack(50,8000);
}</pre>
<p>It's pretty simple. We parse the string that was passed to D.probe(), and disassemble it into the DTrace notion of module, function, and name. We then use that information so that the specified actions are executed as appropriate (we could have specified different Java methods to probe, and different actions to take for each). <a href="https://github.com/adamleventhal/jdtrace/blob/master/jdtrace/jdtrace.pl">Here's the code</a> if you're interested.</p>
<p>This isn't the real Java provider, but is it close enough? Unfortunately not. The most glaring problem is that BTrace sometimes renders my Java process unresponsive. Other times it leaves instrumentation behind with no way of extracting it. The word "safe" appears as the third word on the BTrace website ("BTrace is safe"), but apparently there's still some way to go to achieve the requisite level of safety.</p>
<h3>A Better BTrace</h3>
<p>BTrace is an interesting tool for examining Java programs, but one obvious obstacle is that the programs are pretty cumbersome to write. With BTrace, we should be able to write a simple one-liner to see where we are when the java.net.SocketOutputStream.write() method is called, but instead we have to write a fairly cumbersome program:</p>
<pre>import com.sun.btrace.annotations.*;
import static com.sun.btrace.BTraceUtils.*;
@BTrace
public class TraceWrite {
        @OnMethod(clazz="java.net.SocketOutputStream", method="write", location=@Location(Kind.ENTRY))
        public static void onWrite() {
                jstack();
        }
}</pre>
<p>DTrace-inspired syntax would let users iterate much more quickly:</p>
<pre>$ <strong>dbtrace -p $(pgrep -n java) -n 'java.net.SocketOutputStream:write:entry{ jstack(); }'</strong>
java.net.SocketOutputStream.write(SocketOutputStream.java)
sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)
sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272)
sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:276)
sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:122)
java.io.OutputStreamWriter.flush(OutputStreamWriter.java:212)
java.io.BufferedWriter.flush(BufferedWriter.java:236)
java.io.PrintWriter.newLine(PrintWriter.java:438)
java.io.PrintWriter.println(PrintWriter.java:585)
java.io.PrintWriter.println(PrintWriter.java:696)
com.delphix.appliance.server.ham.impl.HAMonitorServerThread.run(HAMonitorServerThread.java:56)
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:662)</pre>
<p>With BTrace, you can trace nearly arbitrary information about a program's state, but instead of doing something like this:</p>
<pre>dbtrace -p $(pgrep -n java) -n 'java.net.SocketOutputStream:write:entry{ printFields(this.impl); }'</pre>
<p>You have to do this:</p>
<pre>import com.sun.btrace.annotations.*;
import com.sun.btrace.AnyType;
import static com.sun.btrace.BTraceUtils.Reflective.*;
@BTrace
public class TraceWrite {
        @OnMethod(clazz="java.net.SocketOutputStream", method="write", location=@Location(Kind.ENTRY))
        public static void onWrite(@Self Object self) {
                Object impl = get(field(classOf(self), "impl"), self);
                printFields(impl);
        }
}</pre>
<pre>$ <strong>./bin/btrace $(pgrep -n java) TraceWrite.java</strong>
{server=null, port=1080, external_address=null, useV4=false, cmdsock=null, cmdIn=null, cmdOut=null, applicationSetProxy=false, timeout=0, trafficClass=0, shut_rd=false, shut_wr=false, socketInputStream=java.net.SocketInputStream@9993a1, fdUseCount=0, fdLock=java.lang.Object@ab5443, closePending=false, CONNECTION_NOT_RESET=0, CONNECTION_RESET_PENDING=1, CONNECTION_RESET=2, resetState=0, resetLock=java.lang.Object@292936, fd1=null, anyLocalBoundAddr=null, lastfd=-1, stream=false, socket=Socket[addr=/127.0.0.1,port=38832,localport=8765], serverSocket=null, fd=java.io.FileDescriptor@50abcc, address=/127.0.0.1, port=38832, localport=8765, }</pre>
<p>BTrace needs a language that enables rapid iteration &#8212; piggybacking on Java is holding it back &#8212; and it needs some hard safety guarantees. With those, many developers and support engineers would use BTrace as part of their daily work &#8212; we certainly would here at <a href="http://www.delphix.com">Delphix</a>.</p>
<p>Back to DTrace. Even with a useable solution for Java only, the ability to have lightweight and focused tracing for Java (and other dynamic languages) could be highly valuable. We&#8217;ll see how far BTrace can take us.</p>
]]></content:encoded>
			<wfw:commentRss>http://dtrace.org/blogs/ahl/2012/04/24/btrace-dtrace-for-java-ish/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>dtrace.conf(12) wrap-up</title>
		<link>http://dtrace.org/blogs/ahl/2012/04/09/dtrace-conf12-wrap-up/</link>
		<comments>http://dtrace.org/blogs/ahl/2012/04/09/dtrace-conf12-wrap-up/#comments</comments>
		<pubDate>Mon, 09 Apr 2012 18:03:09 +0000</pubDate>
		<dc:creator>ahl</dc:creator>
				<category><![CDATA[BryanCantrill]]></category>
		<category><![CDATA[DavePacheco]]></category>
		<category><![CDATA[DTrace]]></category>
		<category><![CDATA[dtrace.conf]]></category>
		<category><![CDATA[EricSchrock]]></category>
		<category><![CDATA[GeorgeWilson]]></category>
		<category><![CDATA[KrisvanHees]]></category>
		<category><![CDATA[MattAhrens]]></category>
		<category><![CDATA[OEL]]></category>

		<guid isPermaLink="false">http://dtrace.org/blogs/ahl/?p=1114</guid>
		<description><![CDATA[For the second time in as many quadrennial dtrace.confs, I was impressed at how well the unconference format worked out. Sharing coffee with the DTrace community, it was great to see some of the oldest friends of DTrace &#8212; Jarod Jenson, Stephen O&#8217;Grady, Jonathan Adams to name a few &#8212; and to put faces to [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://dtrace.org/blogs/ahl/files/2012/04/dtrace.conf12_tee.png"><img class="alignright size-full wp-image-1124" title="dtrace.conf(12) tee" src="http://dtrace.org/blogs/ahl/files/2012/04/dtrace.conf12_tee.png" alt="" width="96" height="96" /></a>For the <a href="http://dtrace.org/blogs/ahl/2008/05/05/dtrace-conf-post-post-mortem/">second time</a> in as many quadrennial <a href="http://wiki.smartos.org/display/DOC/dtrace.conf">dtrace.conf</a>s, I was impressed at how well the unconference format worked out. Sharing coffee with the DTrace community, it was great to see some of the oldest friends of DTrace &#8212; <a href="http://dl.acm.org/citation.cfm?id=1117399">Jarod Jenson</a>, <a href="https://twitter.com/#!/sogrady">Stephen O&#8217;Grady</a>, <a href="https://blogs.oracle.com/jwadams/">Jonathan Adams</a> to name a few &#8212; and to put faces to names &#8212; <a href="https://twitter.com/#!/slfritchie">Scott Fritchie</a>, <a href="https://twitter.com/#!/dlsspy">Dustin Sallings</a>, Blake Irvin, etc &#8212; of the many new additions to the DTrace community. You can see all the <a href="http://wiki.smartos.org/display/DOC/dtrace.conf+Schedule">slides and videos</a>; these are my thoughts and notes on the day.</p>
<p><iframe width="224" height="126" src="http://www.youtube.com/embed/l_7v7Fn7uMQ" frameborder="0" class="alignright"></iframe><a href="http://dtrace.org/blogs/bmc/">Bryan</a> provided a typically eloquent review of the state of the community. DTrace development is alive and well &#8212; after a lull while Oracle&#8217;s acquisition of Sun settled in &#8212; with new support for a variety of languages and runtimes, and new products that rely heavily on DTrace as a secret sauce. Bryan laid out some important development goals, areas where many have started straying from the edges of the completed DTrace features into the partially complete or starkly missing. We all then set to work hammering out a loose schedule for the day; I&#8217;ll admit that at first I was worried that we&#8217;d have too many listeners and not enough presenters, but the schedule quickly filled &#8212; and with more topics than we&#8217;d end up having time to cover.</p>
<h3>User-land CTF and Dynamic Translators</h3>
<p><iframe width="224" height="126" src="http://www.youtube.com/embed/0QF04ivO_WE" frameborder="0" class="alignright"></iframe>DTrace, from its inception, has been a systemic analysis tool, but the earliest development focused on kernel observability &#8212; not a surprise since Bryan, Mike, and I developed it while working in the Solaris kernel development. After its use spread (quickly) beyond the kernel team, use shifted more and more to features focused on understanding C and C++ applications in user-land, and then to applications written in a variety of higher-level languages &#8212; Java, Ruby, Perl, Javascript, Erlang, etc. <a href="http://dtrace.org/blogs/dap/2011/12/13/usdt-providers-redux/">User-land Statically Defined Tracing</a> (USDT) is the DTrace facility that enables rich tracing of higher-level languages. It was a relatively late addition to DTrace (integrated in 2004, well after the initial integration in 2003), and since then we&#8217;ve learned a lot about what we got right, what we got wrong, and where it&#8217;s rough &#8212; <a href="https://twitter.com/#!/bcantrill/status/187246955464884226">in some cases very rough</a> &#8212; around the edges.</p>
<p>In his opening remarks, Bryan identified USDT improvements as a key area for the community&#8217;s focus. In DTrace development we tried to focus on making the impossible possible rather than making the possible easier. In its current form, some things are still impossible with DTrace, namely consumption of type structures from user-land programs; stable, non-privileged use of DTrace; and support for different runtime versions. <a href="http://dtrace.org/blogs/dap/">Dave Pacheco</a> and I took the first  slot on the schedule and spoke (at length &#8212; sorry) about solutions to these problems.</p>
<p>While others had the benefit of a bit more time to prepare, I did have the advantage of spending many years idly contemplating the problem space and possible solutions. On the subject of user-land type information (in the form of CTF), I identified the key parts of the code that would would need some work. For the USDT enhancements, we discussed dynamic translators &#8212; D code that would be linked and executed at runtime, contrasted with today&#8217;s static translators that are compiled into a D program &#8212; how they would address the problem, and how these ideas could be extended to the kernel (for once, user-land is actually a bit ahead).</p>
<p>I&#8217;ll go into the details of our off the cuff proposals, and delve into the code to firm up those ideas in a future blog post. Beyond the extensive implementation work we laid out, the next step is to gather the most complicated, extant USDT providers and proposals for other providers, and figure out what they should look like in the new, dynamic translator world.</p>
<h3>The D Language</h3>
<p><iframe width="224" height="126" src="http://www.youtube.com/embed/1NM7lAvCxFc" frameborder="0" class="alignright"></iframe>Next up, my long-time colleague, DTrace contributor, <a href="http://dtrace.org/blogs/eschrock/">Eric Schrock</a> led the discussion on D language additions. The format of a D program is heavily tied to DTrace&#8217;s implementation: all clauses must trace a fixed amount of data, and infinite loops are forbidden. For this reason, D lacks the backward branches needed for traditional looping, subroutines for common code, and if/else clauses for control flow. Each of these has a work-alike &#8212; unrolled loops, macros, and predicates or the ternary operator &#8212; but their absence renders D confusing to some &#8212; especially those unaware of the motivation. Further, the D language need not necessarily hold the underlying implementation so central.</p>
<p>Eric discussed some proposals for how each might be addressed, and I noted that it would be possible to create a prototype environment where we could try out these &#8220;D++&#8221; features by compiling into D work-alikes. The next step is to identify the most complicated D scripts, and see what they might look like for various incarnations of those language features.</p>
<h3>Work with DTrace</h3>
<p>The next few sessions focused not on changes to DTrace, but interesting work done using DTrace:</p>
<p>John Thompson of Sony talked about their port of DTrace to the Playstation Vita (!). Sony developers are given access to DTrace, but found it to be unfamiliar and unapproachable. John spoke his attempts to remedy this by replacing D with a C++-like interface which he implemented by replacing the D compiler with Clang.</p>
<p>My Fishworks colleague, <a href="https://twitter.com/brendangregg">Brendan Gregg</a>, showed some of beautiful visualizations they&#8217;ve been developing at Joyent, and talked about the analyses those visualizations enabled. As always, it was fascinating stuff. If you don&#8217;t read <a href="http://dtrace.org/blogs/brendan">Brendan&#8217;s blog</a>, you really should. Long-time DTrace advocate, <a href="http://twitter.com/postwait">Theo Schlossnagle</a>, talked about the visualizations they&#8217;re doing in <a href="http://circonus.com/">Circonus</a> &#8212; also fascinating stuff for anyone thinking about how to present system activity in comprehensible ways. <a href="http://twitter.com/richardelling">Richard Elling</a> showed the DTrace-based visualizations Nexenta used at VMworld to rave reviews.</p>
<p><a href="https://twitter.com/mcavage">Mark Cavage</a> <a href="http://mcavage.github.com/presentations/dtrace_conf_2012-04-03/">presented</a> Joyent&#8217;s work bringing DTrace to node.js; <a href="http://twitter.com/slfritchie">Scott Fritchie</a> <a href="http://www.snookles.com/scott/publications/dtrace.conf-2012.erlang-vm.pdf">talked about</a> DTrace for Erlang. Both were useful sources of ideas for how we could improve USDT.</p>
<p>Ryan Stone presented the state of DTrace on FreeBSD. That DTrace is not enabled in the build by default remains a key obstacle for adoption. I hope that Ryan et al. are able to persuade the FreeBSD leadership that their licensing fears are misguided.</p>
<h3>DTrace for OEL</h3>
<p><iframe width="224" height="126" src="http://www.youtube.com/embed/NElog3MvUC8" frameborder="0" class="alignright"></iframe>I was delighted that Kris van Hees was able to attend to present the Oracle port to Linux. DTrace for OEL was <a href="http://dtrace.org/blogs/ahl/2011/10/05/dtrace-for-linux-2/">announced at Oracle Open World 2011</a>, but the initial beta <a href="http://dtrace.org/blogs/ahl/2011/10/10/oel-this-is-not-dtrace/">didn&#8217;t live up to its billing</a> at OOW. As is often the case, this was more a failure of messaging than of engineering. Kris and his team are making steady progress. While it&#8217;s not yet in the public beta, they have the kernel function boundary tracing provider (fbt) implemented. Most heartening of all, Oracle intends to keep DTrace for OEL moving forward as the community evolves and improves DTrace &#8212; rather than forking it. How that plays out, and what that means for DTrace on Oracle Solaris will be interesting to see, but it&#8217;s great to hear that Kris sees the value of DTrace ubiquity and DTrace compatibility.</p>
<p>As was remarked several times, having DTrace available on the fastest growing deployment platform will be the single most significant accelerator for DTrace adoption. The work Kris and his team at Oracle are doing is probably the most important in the DTrace ecosystem, and I think that I speak for the entire DTrace community in offering to assist in any way possible.</p>
<h3>A ZFS DTrace Provider</h3>
<p>Matt Ahrens and George Wilson &#8212; respectively the co-inventor of ZFS, and the preeminent SPA developer &#8212; presented a <a href="https://docs.google.com/a/delphix.com/document/d/1wOxlXX6nLm56fccIUPS6iD1pgkX57OwdD78YhQWC8oQ/edit">proposal for a DTrace provider for ZFS</a>. ZFS is a highly sophisticated filesystem, but one that is also difficult to understand. Building in rich instrumentation is going to be a tremendous step forward for anyone using ZFS (for example, our mutual employer, Delphix).</p>
<h3>Whither DTrace?</h3>
<p>Jarod Jenson &#8212; the first DTrace user outside of Sun &#8212; took the stage in the final session to talk about DTrace adoption. Jarod has made DTrace a significant part of his business for many years. What continues to amazing him, despite numerous presentations, demonstrations, and lessons, is the relatively low level of DTrace adoption. DTrace is a tool that comes alive in the hands of a skilled, scientific, incisive practitioner &#8212; and in all of those, Jarod is superlative &#8212; but it can have a high bar of entry. There were many concrete suggestions for how to improve DTrace adoption. Most of them didn&#8217;t hold water for me &#8212; different avenue for education, further documentation, community outreach, higher level tools, visualizations, etc. &#8212; but two were quite compelling: DTrace for Linux, and DTrace on <a href="http://stackoverflow.com">stackoverflow.com</a> (and the like). I don&#8217;t know how much room there is to participate in the former, but by all means if there are DTrace one-liners that solve problems (on Mac OS X for example), post them, and get people covertly using DTrace.</p>
<p><iframe width="224" height="126" src="http://www.youtube.com/embed/rq4eR9NJMmU" frameborder="0" class="alignright"></iframe>The core DTrace community is growing. It was great to see old friends like Steve Peters who worked on porting DTrace to Mac OS X in the same room as Kris van Hees as he spoke about his port to Linux. It was inspiring to see so many new members of the community, eager to use, build and improve DTrace. And personally it inspired me to get back into the code to finish up some projects I had in flight, and to chart out the course for some of the projects we discussed.</p>
<p>Thanks to everyone who attended dtrace.conf in person or online. And thanks especially to Deirdre Straughan who made it happen.</p>
]]></content:encoded>
			<wfw:commentRss>http://dtrace.org/blogs/ahl/2012/04/09/dtrace-conf12-wrap-up/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Subsecond Offset Heat Maps</title>
		<link>http://dtrace.org/blogs/brendan/2012/03/26/subsecond-offset-heat-maps/</link>
		<comments>http://dtrace.org/blogs/brendan/2012/03/26/subsecond-offset-heat-maps/#comments</comments>
		<pubDate>Mon, 26 Mar 2012 18:16:16 +0000</pubDate>
		<dc:creator>Brendan Gregg</dc:creator>
				<category><![CDATA[cloud analytics]]></category>
		<category><![CDATA[DTrace]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[subsecond]]></category>
		<category><![CDATA[visualizations]]></category>

		<guid isPermaLink="false">http://dtrace.org/blogs/brendan/?p=2826</guid>
		<description><![CDATA[&#8220;Wow, that&#8217;s weird!&#8221;. My subsecond offset visualization type looked great, but others found it weird and unfamiliar. I developed it for inclusion in Joyent&#8217;s Cloud Analytics tool for the purposes of workload characterization. Given that it was so unfamiliar, I had some explaining to do. Voxer, a company that makes a walkie-talkie application for smart [...]]]></description>
			<content:encoded><![CDATA[<p>&#8220;Wow, that&#8217;s weird!&#8221;.  My subsecond offset visualization type looked great, but others found it weird and unfamiliar.  I developed it for inclusion in Joyent&#8217;s <a href="http://dtrace.org/blogs/dap/2011/03/01/welcome-to-cloud-analytics/">Cloud Analytics</a> tool for the purposes of <i>workload characterization</i>.  Given that it was so unfamiliar, I had some explaining to do.</p>
<div style="float:right;padding-top:0px;padding-left:8px"><a href="http://dtrace.org/blogs/brendan/files/2012/03/subsec-010-crop.png"><img src="http://dtrace.org/blogs/brendan/files/2012/03/subsec-010-250.png" alt="" title="subsec-010-250" width="250" height="251" class="alignnone size-full wp-image-2828" /></a></div>
<p><a href="http://www.voxer.com">Voxer</a>, a company that makes a walkie-talkie application for smart phones, had been seeing a performance issue with their Riak database.  The issue appeared to be related to TCP listen drops &#8211; when SYNs are dropped as the application can&#8217;t keep up with the accept() queue.  Voxer has millions of users whose numbers are growing fast, so I expected to see Riak hit 100% CPU usage when these drops occurred.  The subsecond offset heat map (top on the right) painted a different story, which led to an operating system kernel fix.</p>
<p>Weird but wonderful, this heat map helped solve a hard problem, and I was left with some interesting screenshots to help explain this visualization type.</p>
<p>In this post, I&#8217;ll explain subsecond offset heat maps using the Voxer issue as a case study, then show various other interesting examples from a production cloud environment.  This environment is a single datacenter that includes 200 physical servers and thousands of OS instances.  The heat maps are all generated by Joyent Cloud Analytics, which uses <a href="http://dtrace.org/blogs/about/">DTrace</a> to fetch the data.</p>
<h2>Description</h2>
<div style="float:right;padding-left:5px;padding-bottom:5px"><a href="http://dtrace.org/blogs/brendan/files/2012/03/axis-crop.png"><img src="http://dtrace.org/blogs/brendan/files/2012/03/axis-126.png" alt="" title="axis-126" width="126" height="139" class="alignnone size-full wp-image-2969" /></a></div>
<p>The subsecond offset heat map puts time on two axes.  The x-axis shows the passage of time, with each column representing one second.  The y-axis shows the <i>time within a second</i>, spanning from 0.0s to 1.0s (time offsets).  The z-axis (color) show the count of samples or events, quantized into x- and y-axis ranges (&#8220;buckets&#8221;), with the color darkness reflecting the event count (darker == more).  This relationship is shown to the right.</p>
<p>I previously explained the use of quantized heat maps in section 11 of <a href="http://dtrace.org/blogs/brendan/2011/12/18/visualizing-device-utilization/">Visualizing Device Utilization</a>.  I use them to show event <a href="http://dtrace.org/blogs/brendan/2011/06/03/file-system-latency-part-5/">latency</a> as well.</p>
<h2>Time on Two Axes</h2>
<div style="float:left;padding-top:0px;padding-right:8px"><a href="http://dtrace.org/blogs/brendan/files/2012/03/wolfram.png"><img src="http://dtrace.org/blogs/brendan/files/2012/03/wolfram-350.png" alt="" title="wolfram-350" width="350" height="211" class="alignnone size-full wp-image-2837" /></a></div>
<p>Heat maps aren&#8217;t the weird part.  What&#8217;s weird is putting time on more than one axis.  Stephen Wolfram recently posted <a href="http://blog.stephenwolfram.com/2012/03/the-personal-analytics-of-my-life/">The Personal Analytics of My Life</a>, which included an amazing scatter plot (on the left).  This has time on both x- and y- axes.  I&#8217;ve included it as it may be a much easier example to grasp at first glance, before the subsecond offset heat maps.</p>
<p>His is at a much longer time scale: the x-axis shows days, and the y-axis shows offset within a day.  Using similar terminology, this could be called a &#8220;subday-offset&#8221; or &#8220;24hr-offset&#8221; scatter plot.  Each point on his plot shows when Wolfram sent an email, revealing his sleeping habits as the white gap in the morning.</p>
<p>Scatter plots are limited in the density of the points they can display, and don&#8217;t compress the data set (x &amp; y coordinates are kept for each event).   Heat maps solve both issues, allowing them to scale, which is especially important for the cloud computing uses that follow.  These use the subsecond offset scale, but other ranges are possible as well (minute-offset, hour-offset, day-offset).</p>
<h2>That&#8217;s No Artifact</h2>
<p>The screenshot at the top of this page (click any for full-res) used a subsecond offset heat map for CPU thread samples &#8211; showing when applications were on-CPU during the second.  The sampling was at 99 Hertz across all CPUs, to minimize overhead (instead of, say, 1000 Hz), and to avoid lockstep (with any power-of-10 Hz task).  These CPU samples are then quantized into the buckets seen as pixels.</p>
<p>The heat map revealed that CPU usage dropped at the same time as the TCP listen drops.  I was expecting the opposite.</p>
<p>By selecting Riak (as &#8220;beam.smp&#8221;, the Erlang VM it uses) and &#8220;Isolate selected&#8221;, only Riak is shown:</p>
<p><a href="http://dtrace.org/blogs/brendan/files/2012/03/subsec-020-crop.png"><img src="http://dtrace.org/blogs/brendan/files/2012/03/subsec-020-600.png" alt="" title="subsec-020-600" width="600" height="294" class="alignnone size-full wp-image-2858" /></a></p>
<p>Left of center shows two columns, each with about 40% of the offsets colored white.  Assuming no sampling issue, it means that the Riak database was entirely off-CPU for hundreds of consecutive milliseconds.  This is similar to the white gaps showing when Wolfram was asleep &#8212; except that we aren&#8217;t expecting the Riak database to take naps!  This was so bizarre that I first thought that something was wrong with the instrumentation, and that the white gaps were an artifact.</p>
<p>Application threads normally spend time off-CPU when blocked on I/O or waiting for work.  What&#8217;s odd here is that for so long the number of running Riak threads is zero, when normally it varies more quickly.  And this event coincided with TCP listen drops.</p>
<h2>The Shoe That Fits</h2>
<p>In Cloud Analytics, heat maps can be clicked to reveal details at that point.  I clicked inside the white gap, which revealed that a process called &#8220;zoneadmd&#8221; was running; isolating it:</p>
<p><a href="http://dtrace.org/blogs/brendan/files/2012/03/subsec-030-crop.png"><img src="http://dtrace.org/blogs/brendan/files/2012/03/subsec-030-600.png" alt="" title="subsec-030-600" width="600" height="369" class="alignnone size-full wp-image-2850" /></a></p>
<p>This fits the white gap closely, and a similar relationship was observed at other times as well.  This pointed suspicion to zoneadmd, which other observability tools had missed.  Some tools sampled the running processes every few seconds or minutes, and usually missed the short-lived zoneadmd completely.  Even watching every second was difficult: Riak&#8217;s CPU usage dropped for two seconds, at a different rate to what zoneadmd consumed (Riak is multi-threaded, so it can consume more CPU in the same interval than the single-threaded zoneadmd).  The subsecond offset heat map showed the clearest correlation: the duration of these events was similar, and the starting and ending points were nearby.</p>
<p>If zoneadmd was somehow blocking Riak from executing, it would explain the off-CPU gap and also the TCP listen drops &#8211; as Riak wouldn&#8217;t be running to accept the connections.</p>
<h2>Kernel Fix</h2>
<p>Investigation on the server using DTrace quickly found that Riak was getting blocked as it waited for an address space lock (as_lock) during mmap()/munmap() calls from its bitcask storage engine.  That lock was being held by zoneadmd for hundreds of milliseconds (see the Artifacts section later for a longer description).  zoneadmd enforces multi-tenant memory limits, and every couple of minutes checked the size of Riak.  It did this via kernel calls which scan memory pages while holding as_lock.  This scan took time, as Riak was tens of Gbytes in size.</p>
<p>We found other applications exhibiting the same behavior, including Riak&#8217;s &#8220;memsup&#8221; memory monitor.  All of these were blocking Riak, and with Riak off-CPU unable to accept() connections, the TCP backlog queue often hit its limit resulting in TCP listen drops (tcpListenDrop).  Jerry Jelinek of Joyent has been fixing these codepaths via kernel changes.</p>
<h2>Patterns</h2>
<p>The previous heat map included a &#8220;Distribution details&#8221; box at the bottom, summarizing the quantized bucket that I clicked on.  It shows that &#8220;zoneadmd&#8221; and &#8220;ipmitool&#8221; were running, each sampled twice in the range 743 &#8211; 763 ms (consistent with them being single threaded and sampled at 99 Hertz).</p>
<div style="float:right;padding-left:8px"><a href="http://dtrace.org/blogs/brendan/files/2012/03/subsec-040-crop.png"><img src="http://dtrace.org/blogs/brendan/files/2012/03/subsec-040-300.png" alt="" title="subsec-040-300" width="300" height="147" class="alignnone size-full wp-image-2918" /></a><br /><center><i>ipmitool and zabbix_agentd</i></center></div>
<p>To check whether ipmitool was an issue, I isolated its on-CPU usage and found that it often did not coincide with Riak off-CPU time.  While checking this, I found a interesting pattern caused by zabbix_agentd.  These are shown on the right: ipmitool is highlighted in yellow, and zabbix_agentd in red.</p>
<p>Just based on the heat map, it would appear that zabbix_agentd is a single thread (or process) that wakes up every second to perform a small amount of work.  It then sleeps for an entire second.  Repeat.  This causes the diagonal rising line, the slope of which is relative to time zabbix_agentd worked before sleeping for the next full second: with greater slopes (approaching 90 degrees) reflecting more work was performed before the next sleep.</p>
<p>zabbix_agentd is part of the Zabbix monitoring software.  If it is supposed to perform work roughly every second, then it should be ok.  But if it is supposed to perform work <i>exactly</i> once a second, such as reading system counters to calculate the statistics it is monitoring, then there could be problems.</p>
<h2>Cloud Scale</h2>
<p>The previous examples showed CPU thread samples on a single server (each title included &#8220;predicated by server hostname == &#8230;&#8221;).  Cloud Analytics can show these for the entire cloud &#8211; which may be hundreds of systems (virtualized operating system instances).  I&#8217;ll show this with a different heat map type: instead of CPU thread samples, which shows the CPU usage of applications, I&#8217;ll show subsecond offset of system calls (syscalls), which paints a different picture &#8211; one better reflecting the I/O behavior.  Tracing syscalls can reveal more processes than by sampling, which can miss short-lived events.</p>
<p>The two images that follow show subsecond offsets for syscalls across an entire datacenter (200 physical servers, thousands of OS instances).  On the left are syscalls by &#8220;httpd&#8221; (Apache web server), and the right are those by the &#8220;ls&#8221; command:</p>
<table border=0 width="100%">
<tr>
<td><a href="http://dtrace.org/blogs/brendan/files/2012/03/subsec-050-crop.png"><img src="http://dtrace.org/blogs/brendan/files/2012/03/subsec-050-300.png" alt="" title="subsec-050-300" width="300" height="147" class="alignnone size-full wp-image-2899" /></a><br /><center><i>httpd</i></center></td>
<td><a href="http://dtrace.org/blogs/brendan/files/2012/03/subsec-120-crop.png"><img src="http://dtrace.org/blogs/brendan/files/2012/03/subsec-120-300.png" alt="" title="subsec-120-300" width="300" height="148" class="alignnone size-full wp-image-2909" /></a><br /><center><i>ls</i></center></td>
</tr>
</table>
<p>Neither of these may be very surprising.  The httpd syscalls will arrive at random times based on the client workload, and combining them for dozens of busy web servers results in a heat map with random color intensities (which have been enhanced due to the <a href="http://dtrace.org/blogs/dap/2011/06/20/heatmap-coloring/">rank-based</a> default color map).</p>
<p>Sometimes the heat maps are surprising.  The next two show zeus.flipper (web load balancing software), on the left for the entire cloud, and on the right for a single server:</p>
<table border=0 width=100%>
<tr>
<td><a href="http://dtrace.org/blogs/brendan/files/2012/03/subsec-060-crop.png"><img src="http://dtrace.org/blogs/brendan/files/2012/03/subsec-060-300.png" alt="" title="subsec-060-300" width="300" height="147" class="alignnone size-full wp-image-2900" /></a><br /><center><i>zeus.flipper</i></center></td>
<td><a href="http://dtrace.org/blogs/brendan/files/2012/03/subsec-070-crop.png"><img src="http://dtrace.org/blogs/brendan/files/2012/03/subsec-070-300.png" alt="" title="subsec-070-300" width="300" height="147" class="alignnone size-full wp-image-2901" /></a><br /><center><i>zeus.flipper (single)</i></center></td>
</tr>
</table>
<p>The cloud-wide heat map does show that there is a pattern present, which has been isolated for a single server on the right.  It appears that multiple threads are present: many waking up more than once a second (the two large bands), and others waking up every two (<img src="http://dtrace.org/blogs/brendan/files/2012/03/clip-2sec.png" alt="" title="clip-2sec" width="48" height="11" class="alignnone size-full wp-image-2987" />), five (<img src="http://dtrace.org/blogs/brendan/files/2012/03/clip-5sec-d.png" alt="" title="clip-5sec-d" width="48" height="11" class="alignnone size-full wp-image-2988" />) and ten seconds (<img src="http://dtrace.org/blogs/brendan/files/2012/03/clip-10sec-d.png" alt="" title="clip-10sec-d" width="52" height="10" class="alignnone size-full wp-image-2989" />).</p>
<h2>Cloud Wide vs Single Server</h2>
<p>Here are some other examples comparing an entire cloud vs a single server (click for full screenshot).  These are also syscall subsecond offsets:</p>
<table border=0 width="100%">
<tr>
<td><a href="http://dtrace.org/blogs/brendan/files/2012/03/subsec-240-crop.png"><img src="http://dtrace.org/blogs/brendan/files/2012/03/subsec-240-300.png" alt="" title="subsec-240-300" width="300" height="140" class="alignnone size-full wp-image-2943" /></a><br /><center><i>node.js</i></center></td>
<td><a href="http://dtrace.org/blogs/brendan/files/2012/03/subsec-260-crop.png"><img src="http://dtrace.org/blogs/brendan/files/2012/03/subsec-260-300.png" alt="" title="subsec-260-300" width="300" height="140" class="alignnone size-full wp-image-2954" /></a><br /><center><i>node.js (single)</i></center></td>
</tr>
</table>
<table border=0 width="100%">
<tr>
<td><a href="http://dtrace.org/blogs/brendan/files/2012/03/subsec-090-crop.png"><img src="http://dtrace.org/blogs/brendan/files/2012/03/subsec-090-300.png" alt="" title="subsec-090-300" width="300" height="140" class="alignnone size-full wp-image-2896" /></a><br /><center><i>Java</i></center></td>
<td><a href="http://dtrace.org/blogs/brendan/files/2012/03/subsec-100-crop.png"><img src="http://dtrace.org/blogs/brendan/files/2012/03/subsec-100-300.png" alt="" title="subsec-100-300" width="300" height="140" class="alignnone size-full wp-image-2897" /></a><br /><center><i>Java (single)</i></center></td>
</tr>
</table>
<table border=0 width="100%">
<tr>
<td><a href="http://dtrace.org/blogs/brendan/files/2012/03/subsec-140-crop.png"><img src="http://dtrace.org/blogs/brendan/files/2012/03/subsec-140-300.png" alt="" title="subsec-140-300" width="300" height="140" class="alignnone size-full wp-image-2906" /></a><br /><center><i>Python</i></center></td>
<td><a href="http://dtrace.org/blogs/brendan/files/2012/03/subsec-150-crop.png"><img src="http://dtrace.org/blogs/brendan/files/2012/03/subsec-150-300.png" alt="" title="subsec-150-300" width="300" height="140" class="alignnone size-full wp-image-2908" /></a><br /><center><i>Python (single)</i></center></td>
</tr>
</table>
<p>I&#8217;ve just shown one single server example for node.js, Java, and Python, however each server can look quite different based on its use and workload.  Applications such as zeus.flipper are more likely to look similar as they serve the same function on every server.</p>
<h2>Cloud Identification Chart</h2>
<p>Some other cloud-wide examples, using syscall subsecond offsets:</p>
<table border=0 width="100%">
<tr>
<td><a href="http://dtrace.org/blogs/brendan/files/2012/03/subsec-200-crop.png"><img src="http://dtrace.org/blogs/brendan/files/2012/03/subsec-200-300.png" alt="" title="subsec-200-300" width="300" height="140" class="alignnone size-full wp-image-2935" /></a><br /><center><i>awk</i></center></td>
<td><a href="http://dtrace.org/blogs/brendan/files/2012/03/subsec-210-crop.png"><img src="http://dtrace.org/blogs/brendan/files/2012/03/subsec-210-300.png" alt="" title="subsec-210-300" width="300" height="140" class="alignnone size-full wp-image-2937" /></a><br /><center><i>bash</i></center></td>
</tr>
</table>
<table border=0 width="100%">
<tr>
<td><a href="http://dtrace.org/blogs/brendan/files/2012/03/subsec-130-crop.png"><img src="http://dtrace.org/blogs/brendan/files/2012/03/subsec-130-300.png" alt="" title="subsec-130-300" width="300" height="140" class="alignnone size-full wp-image-2905" /></a><br /><center><i>kstat</i></center></td>
<td><a href="http://dtrace.org/blogs/brendan/files/2012/03/subsec-110-crop.png"><img src="http://dtrace.org/blogs/brendan/files/2012/03/subsec-110-300.png" alt="" title="subsec-110-300" width="300" height="140" class="alignnone size-full wp-image-2903" /></a><br /><center><i>munin-node</i></center></td>
</tr>
</table>
<table border=0 width="100%">
<tr>
<td><a href="http://dtrace.org/blogs/brendan/files/2012/03/subsec-220-crop.png"><img src="http://dtrace.org/blogs/brendan/files/2012/03/subsec-220-300.png" alt="" title="subsec-220-300" width="300" height="140" class="alignnone size-full wp-image-2939" /></a><br /><center><i>Perl</i></center></td>
<td><a href="http://dtrace.org/blogs/brendan/files/2012/03/subsec-230-crop.png"><img src="http://dtrace.org/blogs/brendan/files/2012/03/subsec-230-300.png" alt="" title="subsec-230-300" width="300" height="140" class="alignnone size-full wp-image-2941" /></a><br /><center><i>php-fpm</i></center></td>
</tr>
</table>
<p>The munin-node heat map has several lines of dots <img src="http://dtrace.org/blogs/brendan/files/2012/03/clip-munin10-d.png" alt="" title="clip-munin10-d" width="50" height="11" class="alignnone size-full wp-image-2992" />, each dot two seconds apart.  Can you guess what those might be?</p>
<h2>Color Maps</h2>
<p>The colors chosen for the heat map can either be rank-based or linear-based, which select color saturation differently.  The selected type for the previous heat maps can be seen after &#8220;COLOR BY:&#8221; in the full screenshots (click images)</p>
<p>This shows node.js processes across the entire cloud, to compare the color maps side-by-side:</p>
<table border=0 width="100%">
<tr>
<td><a href="http://dtrace.org/blogs/brendan/files/2012/03/subsec-240-crop.png"><img src="http://dtrace.org/blogs/brendan/files/2012/03/subsec-240-300.png" alt="" title="subsec-240-300" width="300" height="140" class="alignnone size-full wp-image-2943" /></a><br /><center><i>node.js (rank)</i></center></td>
<td><a href="http://dtrace.org/blogs/brendan/files/2012/03/subsec-250-crop.png"><img src="http://dtrace.org/blogs/brendan/files/2012/03/subsec-250-300.png" alt="" title="subsec-250-300" width="300" height="140" class="alignnone size-full wp-image-2945" /></a><br /><center><i>node.js (linear)</i></center></td>
</tr>
</table>
<p>The rank-based heat map highlights subtle variation well.  The linear colored heat map reflects reality.  This is an extreme example; often the heat maps look much more similar.  For a longer description of rank vs linear, see Dave Pacheco&#8217;s <a href="http://dtrace.org/blogs/dap/2011/06/20/heatmap-coloring/">heat map coloring</a> post, and the Saturation section in my <a href="http://dtrace.org/blogs/brendan/2011/12/18/visualizing-device-utilization/">Visualizing Device Utilization</a> post.</p>
<h2>Artifacts</h2>
<div style="float:right;padding-left:8px"><a href="http://dtrace.org/blogs/brendan/files/2012/03/artifact01-crop.png"><img src="http://dtrace.org/blogs/brendan/files/2012/03/artifact01-small.png" alt="" title="artifact01-small" width="319" height="154" class="alignnone size-full wp-image-2960" /></a></div>
<p>The first example I showed featured the Riak database being blocked by zoneadmd.  The blocking event was continuous, and lasted for almost a full second.  It was shown twice in the first subsecond offset column due to the way the data is currently collected by Joyent Cloud Analytics &#8211; resulting in an &#8220;artifact&#8221;.</p>
<p>This is shown on the right.  The time that a column of data is collected from the server does not occur at the 0.0 offset, but rather some other offset during the second.  This means that an activity that is in-flight will suddenly jump to the next column, as has happened here (at the &#8220;3&#8243; mark).  It also means that an activity at the top of the column can wrap and continue at the bottom of the same column (at the &#8220;2&#8243; mark), before the column switch occurs.  I think this is fixable by recalculating offsets relative to the data collection time, so the switch happens at offset 0.0.  (It hasn&#8217;t been fixed already since it usually isn&#8217;t annoying, and didn&#8217;t noticeably interfere with the many other examples I&#8217;ve shown.)</p>
<div style="float:left;padding-right:8px"><a href="http://dtrace.org/blogs/brendan/files/2012/03/artifact02-crop.png"><img src="http://dtrace.org/blogs/brendan/files/2012/03/artifact02-small.png" alt="" title="artifact02-small" width="206" height="104" class="alignnone size-full wp-image-2964" /></a></div>
<p>On the left is a different type of artifact, one caused when data collection is delayed.  To minimize overhead, data is aggregated in-kernel, then read at a gentle rate (usually once per second) by a user-land process.  This problem occurs when the user-land process is delayed slightly for some reason, and the kernel aggregations include extra data (overlapping offsets) by the time they are read.  Those offsets are then missing from the next column, on the right.</p>
<h2>Thread Scheduling</h2>
<p>I intended to include a &#8220;checkerboard&#8221; heat map of CPU samples, like those Robert Mustacchi showed in his <a href="http://dtrace.org/blogs/rm/2011/08/16/visualizing-kvm/">Visualizing KVM</a> post.  This involves running two threads (or processes) that share one CPU, each performing the same CPU-bound workload.  When each is highlighted in different colors it should look like a checkerboard, as the kernel scheduler evenly switches between running them.</p>
<p>Robert was testing on the Linux kernel under KVM, and used DTrace to inspect running threads from the SmartOS host (by checking the VM MMU context).  I performed the experiment on SmartOS directly, which resulted in the following heat map:</p>
<p><a href="http://dtrace.org/blogs/brendan/files/2012/03/subsec-160-crop.png"><img src="http://dtrace.org/blogs/brendan/files/2012/03/subsec-160-600.png" alt="" title="subsec-160-600" width="600" height="280" class="alignnone size-full wp-image-2888" /></a></p>
<p>This breaks my head.  Instead of a neat checkerboard, this is messy &#8211; showing uneven runtimes for the identical threads.  One column in particular is entirely red, which if true (not a sampling or instrumentation error) meant that the scheduler left the same thread running for an entire second, while another was in the ready-to-run state on the CPU dispatcher queue.  This is much longer than the maximum runtime quantum set by the scheduler (110 ms for the FSS class).  I confirmed this behavior using two different means (DTrace, and thread microstate accounting), and saw even worse instances &#8211; threads blocked for many seconds when they should have been running.</p>
<p>Jerry Jelinek has been wading into the scheduler code, finding evidence that this is a kernel bug (in code that hasn&#8217;t changed since Solaris) and developing the fix.  Fortunately, not many of our customers have hit this since it requires CPUs running at saturation (which isn&#8217;t normal for us). </p>
<h3>UPDATE (April 2nd)</h3>
<p>Jerry has fixed the code, which was a bug with how thread priorities were updated in the scheduler.  The following screenshot shows the same workload post-fix:</p>
<p><a href="http://dtrace.org/blogs/brendan/files/2012/03/subsec-170.png"><img src="http://dtrace.org/blogs/brendan/files/2012/03/subsec-170-600.png" alt="" title="subsec-170-600" width="600" height="280" class="alignnone size-full wp-image-3012" /></a></p>
<p>This looks much better.  There are no longer any full seconds where one thread hogs the CPU, with the other thread waiting.  Looking more closely, there appear to be cases where the thread has switched early &#8211; which is much better than switching late.</p>
<p>We also found that the bug was indeed hurting a customer along with a confluence of other factors.</p>
<h2>Conclusion</h2>
<p>The subsecond offset heat map provides new visibility for software execution time, which can be used for workload characterization and performance analysis.  These are currently available in Joyent Cloud Analytics, from which I included screenshots of these heat maps for production environments.</p>
<p>Using these heat maps I identified two kernel scheduling issues, one of which was causing dropped TCP connections for a large scale cloud-based service.  Kernel fixes are being developed for both.  I also showed various applications running on single servers and the cloud, which produced fascinating patterns &#8211; providing a new way of understanding application runtime.</p>
<p>The examples I included here were based on sampled thread runtime, and traced system call execution times.  Other event sources can be visualized in this way, and these could also be produced on other time frames: sub-minute, sub-hour, etc.</p>
<h2>Acknowledgements</h2>
<ul>
<li><a href="http://dtrace.org/blogs/dap">Dave Pacheco</a> leads the Joyent Cloud Analytics project.  All of the heat maps shown here (except Wolfram&#8217;s) are from Cloud Analytics.</li>
<li>The Cloud Analytics team with whom I discussed this visualization.</li>
<li><a href="http://dtrace.org/blogs/bmc">Bryan Cantrill</a> wrote the prototype Cloud Analytics tool which I hacked to test out this idea in production, and he came up with the name &#8220;subsecond offset&#8221;.  He also fathered DTrace, which is used to fetch all the data shown by the heat maps.</li>
<li><a href="http://dtrace.org/blogs/rm">Robert Mustacchi</a> developed the Meta-D language for Cloud Analytics instrumentations, which made implementing subsecond offset heat maps trivial, and put them to use for KVM.</li>
<li><a href="https://blogs.oracle.com/jerrysblog/entry/writing_the_opensolaris_bible">Jerry Jelinek</a>, for fixing the tricky kernel bugs that these heat maps have recently been unearthing.</li>
<li><a href="http://blog.stephenwolfram.com/">Stephen Wolfram</a>&#8216;s recent <a href="http://blog.stephenwolfram.com/2012/03/the-personal-analytics-of-my-life/">post</a> was great timing, as it provided an intuitive example of a graph having time on both axes.</li>
<li><a href="http://www.edwardtufte.com/tufte/">Edward Tufte</a> for the idea of high definition images in text, and for inspiration to try harder in general (see any of his texts).</li>
<li><a href="http://www.beginningwithi.com/comments">Deirdré Straughan</a> for edits and suggestions.</li>
</ul>
<p>Thanks to the folk at <a href="http://www.voxer.com">Voxer</a> for realizing (earlier than I did) that something more than just normal bursts of load was causing the tcpListenDrops, and pushing for the real answer.</p>
]]></content:encoded>
			<wfw:commentRss>http://dtrace.org/blogs/brendan/2012/03/26/subsecond-offset-heat-maps/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Linux Kernel Performance: Flame Graphs</title>
		<link>http://dtrace.org/blogs/brendan/2012/03/17/linux-kernel-performance-flame-graphs/</link>
		<comments>http://dtrace.org/blogs/brendan/2012/03/17/linux-kernel-performance-flame-graphs/#comments</comments>
		<pubDate>Sat, 17 Mar 2012 16:24:25 +0000</pubDate>
		<dc:creator>Brendan Gregg</dc:creator>
				<category><![CDATA[flamegraphs]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[perf_events]]></category>
		<category><![CDATA[systemtap]]></category>
		<category><![CDATA[visualizations]]></category>

		<guid isPermaLink="false">http://dtrace.org/blogs/brendan/?p=2756</guid>
		<description><![CDATA[To get the most out of your systems, you want detailed insight into what the operating system kernel is doing. A typical approach is to sample stack traces; however, the data collected can be time consuming to read or navigate. Flame Graphs are a new way to visualize sampled stack traces, and can be applied [...]]]></description>
			<content:encoded><![CDATA[<p>To get the most out of your systems, you want detailed insight into what the operating system kernel is doing.  A typical approach is to sample stack traces; however, the data collected can be time consuming to read or navigate.  <a href="http://dtrace.org/blogs/brendan/2011/12/16/flame-graphs/">Flame Graphs</a> are a new way to visualize sampled stack traces, and can be applied to the Linux kernel for some useful (and stunning!) visualizations.</p>
<p>I&#8217;ve used these many times to help solve other kernel and application issues.  Since I posted <a href="https://github.com/brendangregg/FlameGraph">the source</a> to github, others have been using it too (eg, for <a href="http://dtrace.org/blogs/dap/2012/01/05/where-does-your-node-program-spend-its-time/">node.js</a> and <a href="http://smartos.org/2012/02/28/using-flamegraph-to-solve-ip-scaling-issue-dce/">IP scaling</a>).  I&#8217;ve recently been using them to investigate Linux kernel performance issues (under KVM), which I&#8217;ll demonstrate in this post using a couple of different profiling tools: perf_events and SystemTap.</p>
<h2>Linux Perf Events</h2>
<p>This flame graph shows a network workload for the 3.2.9-1 Linux kernel, running as a KVM instance:</p>
<div style="padding-bottom:8px"><a href="http://www.beginningwithi.com/brendan/perf-kernel.svg"><img src="http://dtrace.org/blogs/brendan/files/2012/03/perf-kernel-600.png" alt="" title="perf-kernel-600" width="600" height="314" class="alignnone size-full wp-image-2768" /></a><br />
<center><i>click image for interactive SVG; larger PNG <a href="http://dtrace.org/blogs/brendan/files/2012/03/perf-kernel.png">here</a></i></center></div>
<p>Flame Graphs show the sample population across the x-axis, and stack depth on the y-axis.  Each function (stack frame) is drawn as a rectangle, with the width relative to the number of samples.  See my <a href="http://dtrace.org/blogs/brendan/2011/12/16/flame-graphs/">previous post</a> for the full description of how these work.</p>
<p>You can use the mouse to explore where kernel CPU time is spent, quickly quantifying code-paths and determining where performance tuning efforts are best spent.  This example shows that most time was spent in the vp_notify() code-path, spending 70.52% of all on-CPU samples performing iowrite16().</p>
<p>This was generated using perf_events and the <a href="https://github.com/brendangregg/FlameGraph">FlameGraph tools</a>:</p>
<pre>
# <b>perf record -a -g -F 1000 sleep 60</b>
# <b>perf script | ./stackcollapse-perf.pl > out.perf-folded</b>
# <b>cat out.perf-folded | ./flamegraph.pl > perf-kernel.svg</b>
</pre>
<p>The first command runs perf in sampling mode (polling) at 1000 Hertz (-F 1000; more on this later) across all CPUs (-a), capturing stack traces so that a call graph (-g) of function ancestry can be generated later.  The samples are saved in a perf.data file:</p>
<pre>
# <b>ls -lh perf.data</b>
-rw-------. 1 root root 15M Mar 12 20:13 perf.data
</pre>
<p>This can be processed in a variety of ways.  On recent versions, the <tt>perf report</tt> command launches an ncurses navigator for call graph inspection.  Older versions of perf (or if you pipe the new version) print the call graph as a tree, annotated with percentages:</p>
<pre>
# <b>perf report | more</b>
# ========
# captured on: Wed Mar 14 00:09:59 2012
# hostname : fedora1
# os release : 3.2.9-1.fc16.x86_64
# perf version : 3.2.9-1.fc16.x86_64
# arch : x86_64
# nrcpus online : 1
# nrcpus avail : 1
# cpudesc : QEMU Virtual CPU version 0.14.1
# cpuid : GenuineIntel,6,2,3
# total memory : 1020560 kB
# cmdline : /usr/bin/perf record -a -g -F 1000 sleep 60
# event : name = cycles, type = 1, config = 0x0, config1 = 0x0, config2 = 0x0, excl_usr = ...
# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# ========
#
# Events: 60K cpu-clock
#
# Overhead          Command          Shared Object                               Symbol
# ........  ...............  .....................  ...................................
#
    72.18%            iperf  [kernel.kallsyms]      [k] iowrite16
                      |
                      --- iowrite16
                         |
                         |--99.53%-- vp_notify
                         |          virtqueue_kick
                         |          start_xmit
                         |          dev_hard_start_xmit
                         |          sch_direct_xmit
                         |          dev_queue_xmit
                         |          ip_finish_output
                         |          ip_output
                         |          ip_local_out
                         |          ip_queue_xmit
                         |          tcp_transmit_skb
                         |          tcp_write_xmit
                         |          |
                         |          |--98.16%-- tcp_push_one
                         |          |          tcp_sendmsg
                         |          |          inet_sendmsg
                         |          |          sock_aio_write
                         |          |          do_sync_write
                         |          |          vfs_write
                         |          |          sys_write
                         |          |          system_call
                         |          |          0x369e40e5cd
                         |          |
                         |           --1.84%-- __tcp_push_pending_frames
[...]
</pre>
<p>This tree starts with the on-CPU functions and works back through the ancestry (this is a &#8220;callee based call graph&#8221;).  This follows the flame graph when reading the flame graph top-down.  (This behavior can be flipped by using the &#8220;caller&#8221; option to -g/&#8211;call-graph, instead of the &#8220;callee&#8221; default, generating a tree that follows the flame graph when read bottom-up.)  The hottest (most frequent) stack trace in the flame graph (@70.52%) can be seen in this perf call graph as the product of the top three nodes (72.18% x 99.53% x 98.16%, which are relative rates).  <tt>perf report</tt> can also be run with &#8220;-g graph&#8221; to show absolute overhead rates, in which case &#8220;70.52%&#8221; is directly displayed on the node.</p>
<p>The perf report tree (and the ncurses navigator) do an excellent job at presenting this information as text.  However, with text there are limitations.  The output often does not fit in one screen (you could say it doesn&#8217;t need to, if the bulk of the samples are identified on the first page).  Also, identifying the hottest code paths requires reading the percentages.  With the flame graph, all the data is on screen at once, and the hottest code-paths are immediately obvious as the widest functions.</p>
<p>For generating the flame graph, the <tt>perf script</tt> command (a newer addition to perf) was used to dump the stack samples, which are then aggregated by stackcollapse-perf.pl and folded into single lines per-stack.  That output is then converted by flamegraph.pl into the SVG.  I included a gratuitous &#8220;cat&#8221; command to make it clear that flamegraph.pl can process the output of a pipe, which could include Unix commands to filter or preprocess (grep, sed, awk).</p>
<p>The last two commands could be connected via a pipe:</p>
<pre>
# <b>perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > perf-kernel.svg</b>
</pre>
<p>In practice I don&#8217;t do this, as I often re-run flamegraph.pl multiple times, and this one-liner would execute everything multiple times.  The output of <tt>perf script</tt> can be many Mbytes, taking many seconds to process.  By writing stackcollapse-perf.pl to a file, you&#8217;ve cached the slowest step, and can also edit the file (vi) to delete unimportant stacks. The one-line-per-stack output of stackcollapse-perf.pl is also great food for grep(1).  Eg:</p>
<pre>
# perf script | ./stackcollapse-perf.pl > out.perf-folded
# <b>grep -v cpu_idle</b> out.perf-folded | ./flamegraph.pl > nonidle.svg
# <b>grep ext4</b> out.perf-folded | ./flamegraph.pl > ext4internals.svg
# <b>egrep 'system_call.*sys_(read|write)'</b> out.perf-folded | ./flamegraph.pl > rw.svg
</pre>
<p>Note that it would be a little more efficient to process the output of <tt>perf report</tt> instead of <tt>perf script</tt>; better still, <tt>perf report</tt> could have a report style (eg, &#8220;-g folded&#8221;) that output folded stacks directly, obviating the need for stackcollapse-perf.pl.  There could even be a perf mode that output the SVG directly (which wouldn&#8217;t be the first one; see perf-timechart), although, that would miss the value of being able to grep the folded stacks (which I use frequently).  </p>
<p>If you&#8217;ve never used perf_events before, you may want to test before production use (it has had <a href="http://web.eecs.utk.edu/~vweaver1/projects/perf-events/kernel_panics.html">kernel panic</a> bugs in the past).  My experience has been a good one (no panics).</p>
<h2>SystemTap</h2>
<p>SystemTap can also sample stack traces via the timer.profile probe, which fires at the system clock rate (CONFIG_HZ).  Unlike perf, which dumps samples to a file for later aggregation and reporting, SystemTap can do the aggregation in-kernel and pass a (much smaller) report to user-land.  The data collected and output generated can be customized much further via its scripting language.  The examples here were generated on Fedora 16 (where it works much better than <a href="http://dtrace.org/blogs/brendan/2011/10/15/using-systemtap/">Ubuntu/CentOS</a>).</p>
<p><a href="http://www.beginningwithi.com/brendan/stap-kernel.svg"><img src="http://dtrace.org/blogs/brendan/files/2012/03/stap-kernel-600.png" alt="" title="stap-kernel-600" width="600" height="345" class="alignnone size-full wp-image-2764" /></a></p>
<p>The commands for SystemTap version 1.6 are:</p>
<pre>
# <b>stap -s 32 -D MAXTRACE=100 -D MAXSTRINGLEN=4096 -D MAXMAPENTRIES=10240 \
    -D MAXACTION=10000 -D STP_OVERLOAD_THRESHOLD=5000000000 --all-modules \
    -ve 'global s; probe timer.profile { s[backtrace()] <<< 1; }
    probe end { foreach (i in s+) { print_stack(i);
    printf("\t%d\n", @count(s[i])); } } probe timer.s(60) { exit(); }' \
    > out.stap-stacks</b>
# <b>./stackcollapse-stap.pl out.stap-stacks > out.stap-folded</b>
# <b>cat out.stap-folded | ./flamegraph.pl > stap-kernel.svg</b>
</pre>
<p>The six options used (-s 32, -D &#8230;) increase various SystemTap limits.  The only ones really necessary for flame graphs are &#8220;-D MAXTRACE=100 -D MAXSTRINGLEN=4096&#8243;, so that stack traces aren&#8217;t truncated; the others became necessary when sampling for long periods (in this case, 60 seconds) on busy workloads, since you can get errors such as:</p>
<pre>
WARNING: There were 233 transport failures.

ERROR: Array overflow, check MAXMAPENTRIES near identifier 's' at &lt;input&gt;:1:33

MAXACTION:
ERROR: MAXACTION exceeded near operator '{' at &lt;input&gt;:1:87

STP_OVERLOAD_THRESHOLD:
ERROR: probe overhead exceeded threshold
</pre>
<p>The &#8220;transport failures&#8221; is fixed by increasing the buffer size (-s); the other messages include the names of the tunables that need to be increased.</p>
<p>Also, be sure you have the fix for the <a href="http://sourceware.org/bugzilla/show_bug.cgi?id=13714">#13714</a> kernel panic (which led to <a href="https://bugzilla.redhat.com/show_bug.cgi?id=795913">CVE-2012-0875</a>), or the latest version of SystemTap.</p>
<p>On SystemTap v1.7 (latest):</p>
<pre>
# <b>stap -s 32 -D MAXBACKTRACE=100 -D MAXSTRINGLEN=4096 -D MAXMAPENTRIES=10240 \
    -D MAXACTION=10000 -D STP_OVERLOAD_THRESHOLD=5000000000 --all-modules \
    -ve 'global s; probe timer.profile { s[backtrace()] <<< 1; }
    probe end { foreach (i in s+) { print_stack(i);
    printf("\t%d\n", @count(s[i])); } } probe timer.s(60) { exit(); }' \
    > out.stap-stacks</b>
# <b>./stackcollapse-stap.pl out.stap-stacks > out.stap-folded</b>
# <b>cat out.stap-folded | ./flamegraph.pl > stap-kernel.svg</b>
</pre>
<p>The only difference is that MAXTRACE became MAXBACKTRACE.</p>
<p>The &#8220;-v&#8221; option is used to provide details on what SystemTap is doing.  When running this one-liner for the first time, it printed:</p>
<pre>
Pass 1: parsed user script and 82 library script(s) using 200364virt/23076res/2996shr kb, in 100usr/10sys/260real ms.
Pass 2: analyzed script: 3 probe(s), 3 function(s), 0 embed(s), 1 global(s) using 200892virt/23868res/3228shr kb, in 0usr/0sys/9real ms.
Pass 3: translated to C into "/tmp/stapllG8kv/stap_778fac70871457bfb540977b1ef376d3_2113_src.c" using 361936virt/48824res/16640shr kb, in 710usr/90sys/1843real ms.
Pass 4: compiled C into "stap_778fac70871457bfb540977b1ef376d3_2113.ko" in 7630usr/560sys/19155real ms.
Pass 5: starting run.
</pre>
<p>This provides timing details for each initialization stage.  Compilation took over 18 seconds, during which the performance of the system dropped by 45%.  Fortunately, this only occurs on the first invocation.  SystemTap caches the compiled objects under ~/.systemtap, which subsequent executions use.  I haven&#8217;t tried, but I suspect it&#8217;s possible to compile on one machine (eg, lab, to test for panics), then transfer the cached objects to the target for execution &#8211; avoiding the compilation step.</p>
<h2>1000 Hertz</h2>
<p>The above examples both used 1000 Hertz, so that I could show them both doing the same thing.  Ideally, I&#8217;d sample at 997 Hertz (or something similar) to avoid sampling in lock-step with timed tasks (which can lead to over-sampling or under-sampling, misrepresenting what is actually happening).  With perf_events, the frequency can be set with -F; for example, &#8220;-F 997&#8243;.</p>
<p>For SystemTap, sampling at 997 Hertz (or anything other than CONFIG_HZ) is currently difficult: the timer.hz(997) probe fires at the correct rate, but can&#8217;t read stack backtraces.  It&#8217;s possible that it can be done via the perf probes based on CPU reference cycle counts (eg, &#8220;probe perf.type(0).config(0). sample(N)&#8221;, where N = CPU_MHz * 1000000 / Sample_Hz).  See <a href="http://sourceware.org/bugzilla/show_bug.cgi?id=13820">#13820</a> for the status on this.</p>
<h2>File System</h2>
<p>As an example of a different workload, this shows the kernel CPU time while an ext4 file system was being archived:</p>
<p><a href="http://www.beginningwithi.com/brendan/perf-ext4.svg"><img src="http://dtrace.org/blogs/brendan/files/2012/03/perf-ext4-600.png" alt="" title="perf-ext4-600" width="600" height="209" class="alignnone size-full wp-image-2789" /></a></p>
<p>This used perf_events (<a href="http://dtrace.org/blogs/brendan/files/2012/03/perf-ext4.png">PNG</a> version); the SystemTap version looks almost identical (<a href="http://www.beginningwithi.com/brendan/stap-ext4.svg">SVG</a>, <a href="http://dtrace.org/blogs/brendan/files/2012/03/stap-ext4.png">PNG</a>).</p>
<p>This shows how the file system was being read and where kernel CPU time was spent.  Most of the kernel time is in sys_newfstatat() and sys_getdents() &#8211; metadata work as the file system is walked.  sys_openat() is on the right, as files are opened to be read, which are then mmap()d (look to the right of sys_getdents(), these are in alphabetical order), and finally page faulted into user-space (see the page_fault() mountain on the left).  The actual work of moving bytes is then spent in user-land on the mmap&#8217;d segments (and not shown in this kernel flame graph).  Had the archiver used the read() syscall instead, this flame graph would look very different, and have a large sys_read() component.</p>
<h2>Short Lived Processes</h2>
<p>For this flame graph, I executed a workload of short-lived processes to see where kernel time is spent creating them (<a href="http://dtrace.org/blogs/brendan/files/2012/03/perf-execs.png">PNG</a> version):</p>
<p><a href="http://www.beginningwithi.com/brendan/perf-execs.svg"><img src="http://dtrace.org/blogs/brendan/files/2012/03/perf-execs-600.png" alt="" title="perf-execs-600" width="600" height="208" class="alignnone size-full wp-image-2787" /></a></p>
<p>Apart from performance analysis, this is also a great tool for learning the internals of the Linux kernel.</p>
<h2>oprofile</h2>
<p>Before anyone asks, oprofile could also be used for stack sampling.  I haven&#8217;t written a stackcollapse.pl version for oprofile yet.</p>
<h2>Notes</h2>
<p>All of the above flame graphs were generated on the Linux 3.2.9 kernel (Fedora 16 guest) running under KVM (Ubuntu host), with one virtual CPU.  Some code paths and sample ratios will be very different on bare-metal:  networking won&#8217;t be processed via the virtio-net driver, for a start.  On systems with a high degree of idle time, the flame graph can be dominated by the idle task, which can be filtered using &#8220;grep -v cpu_idle&#8221; of the folded stacks.  Note that by default the flame graph aggregates samples from multiple CPUs; with some shell scripting, you could aggregate samples from multiple hosts as well.  Although, it&#8217;s sometimes useful to generate separate flame graphs for individual CPUs: I&#8217;ve done this for mapped hardware interrupts, for example.</p>
<h2>Conclusion</h2>
<p>With the Flame Graph visualization, CPU time in the Linux kernel can be quickly understood and inspected.  In this post, I showed Flame Graphs for different workloads: networking, file system I/O, and process execution.  As a SVG in the browser, they can be navigated with the mouse to inspect element details, revealing percentages so that performance issues or tuning efforts can be quantified.</p>
<p>I used perf_events and SystemTap to sample stack traces, one task out of many that these powerful tools can do.  It shouldn&#8217;t be too hard to use oprofile to provide the data for Flame Graphs as well.</p>
<h2>References</h2>
<ul>
<li>The original Flame Graph <a href="http://dtrace.org/blogs/brendan/2011/12/16/flame-graphs/">post</a> and github <a href="https://github.com/brendangregg/FlameGraph">repo</a></li>
<li>Linux perf_events <a href="https://perf.wiki.kernel.org/">wiki</a></li>
<li>The Unofficial Linux Perf Events <a href="http://web.eecs.utk.edu/~vweaver1/projects/perf-events/">Web-Page</a></li>
<li>The SystemTap <a href="http://sourceware.org/systemtap/">homepage</a></li>
<li>My <a href="http://dtrace.org/blogs/brendan/2012/03/07/the-use-method-linux-performance-checklist/">Linux Performance Checklist</a>, which includes perf_events, SystemTap and other tools</li>
</ul>
<p>Thanks to those using flame graphs and putting it to use in other <a href="http://dtrace.org/blogs/dap/2012/01/05/where-does-your-node-program-spend-its-time/">new areas</a>, and to the SystemTap engineers for answering questions and fixing bugs.</p>
]]></content:encoded>
			<wfw:commentRss>http://dtrace.org/blogs/brendan/2012/03/17/linux-kernel-performance-flame-graphs/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

