<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>dtrace.org</title>
	<atom:link href="http://dtrace.org/blogs/feed/" rel="self" type="application/rss+xml" />
	<link>http://dtrace.org/blogs</link>
	<description></description>
	<lastBuildDate>Sun, 19 May 2013 21:56:16 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Revealing Hidden Latency Patterns</title>
		<link>http://dtrace.org/blogs/brendan/2013/05/19/revealing-hidden-latency-patterns/</link>
		<comments>http://dtrace.org/blogs/brendan/2013/05/19/revealing-hidden-latency-patterns/#comments</comments>
		<pubDate>Sun, 19 May 2013 21:56:16 +0000</pubDate>
		<dc:creator>Brendan Gregg</dc:creator>
				<category><![CDATA[heatmaps]]></category>
		<category><![CDATA[latency]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[visualizations]]></category>

		<guid isPermaLink="false">http://dtrace.org/blogs/brendan/?p=3411</guid>
		<description><![CDATA[Latency Heat Map Response time &#8211; or latency &#8211; is crucial to understand in detail, but many of the common presentations of this data hide important details and patterns. Latency heat maps are an effective way to reveal these. I often use tools that provide heat maps directly, but sometimes I have separate trace output [...]]]></description>
			<content:encoded><![CDATA[<div style="float:right;padding-left:20px;padding-bottom:1px"><a href="http://dtrace.org/blogs/brendan/2013/05/19/revealing-hidden-latency-patterns/#heatmap"><img src="http://dtrace.org/blogs/brendan/files/2013/05/latency-heatmap-300.png" alt="" title="latency-heatmap-300" width="300" height="152" class="alignnone size-full wp-image-3415" /></a><br /><center><i>Latency Heat Map</i></center></div>
<p>Response time &#8211; or latency &#8211; is crucial to understand in detail, but many of the common presentations of this data hide important details and patterns. Latency heat maps are an effective way to reveal these. I often use tools that provide heat maps directly, but sometimes I have separate trace output that I&#8217;d like to convert into a heat map. To answer this need, I just wrote <a href="https://github.com/brendangregg/HeatMap">trace2heatmap.pl</a>, which generates interactive SVGs.</p>
<p>I explained how latency heat maps work in the 2010 article &#8220;Visualizing System Latency&#8221; (<a href="http://queue.acm.org/detail.cfm?id=1809426">ACMQ</a>, <a href="http://cacm.acm.org/magazines/2010/7/95062-visualizing-system-latency/fulltext">CACM</a>). I&#8217;ve previously shared interesting examples in: <a href="http://dtrace.org/blogs/brendan/2009/03/12/latency-art-rainbow-pterodactyl/">Rainbow Pterodactyl</a>, <a href="http://dtrace.org/blogs/brendan/2009/06/12/latency-art-x-marks-the-spot/">Icy Lake</a>, <a href="http://dtrace.org/blogs/brendan/2009/01/30/l2arc-screenshots/">ZFS L2ARC</a>.</a></p>
<h2>Problem</h2>
<p>I whipped up a simple example to explain this, using disk I/O latency (I have plenty of real-world examples, but explaining them can get sidetracked). This is a single disk system, with a single process performing a sequential synchronous write workload.</p>
<p>Using iostat(1M) to examine average latency (asvc_t):</p>
<pre>
$ iostat -xnz 1
[...]
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  220.0    0.0 9635.8  0.0  1.0    0.0    4.6   0  99 c1d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  203.0    0.0 8976.2  0.0  1.0    0.0    5.1   0  99 c1d0
</pre>
<p>I could plot average latency (as many monitoring products do), but the average is seriously misleading, and doesn&#8217;t explain what&#8217;s really happening. And since latency is so important for performance, I want to know exactly what <i>is</i> happening.</p>
<p>I had &#8220;<a href="http://www.brendangregg.com/dtrace.html">iosnoop</a> -Dots&#8221; running, which collected two minutes of per-I/O latency and other details:</p>
<pre>
# iosnoop -Dots &gt; out.iosnoop
^C
# more out.iosnoop
STIME(us)   TIME(us)    DELTA(us) DTIME(us) UID   PID D    BLOCK   SIZE      COMM PATHNAME
9339835115  9339835827  712       730       100 23885 W 253757952 131072    odsync &lt;none&gt;
9339835157  9339836008  850       180       100 23885 W 252168272   4096    odsync &lt;none&gt;
9339926948  9339927672  723       731       100 23885 W 251696640 131072    odsync &lt;none&gt;
[...15,000 lines truncated...]
</pre>
<p>I/O latency is the &#8220;DELTA(us)&#8221; column. This file was thousands of lines long &#8211; too much to read.</p>
<h2>Latency Histogram: With Outliers</h2>
<p>The latency distribution can be examined as a histogram (using R, and a subset of the trace file):</p>
<p><a href="http://dtrace.org/blogs/brendan/files/2013/05/iosnoop_hist_full6-600.png"><img src="http://dtrace.org/blogs/brendan/files/2013/05/iosnoop_hist_full6-600.png" alt="" title="iosnoop_hist_full6-600" width="600" height="279" class="alignnone size-full wp-image-3417" /></a></p>
<p>This shows that the average has been dragged up by latency <i>outliers</i>: I/O with very high latency.</p>
<p>This is a fairly common occurrence, and it&#8217;s very useful to know when it has occurred. Those outliers may be individually causing problems, and can be easily be plucked from the trace file for further analysis; eg:</p>
<pre>
# awk '$3 &gt; 50000' out.iosnoop_marssync01
STIME(us)   TIME(us)    DELTA(us) DTIME(us) UID   PID D    BLOCK   SIZE      COMM PATHNAME
9343218579  9343276502  57922     57398      0      0 W 142876112  4096      sched &lt;none&gt;
9343218595  9343276605  58010     103        0      0 W 195581749  5632      sched &lt;none&gt;
9343278072  9343328860  50788     50091      0      0 W 195581794  4608      sched &lt;none&gt;
[...]
</pre>
<p>Most of the I/O in the histogram was in a single column on the left.</p>
<h2>Latency Histogram: Zoomed</h2>
<p>Zooming in, by generating a histogram of the 0 &#8211; 2 ms range:</p>
<p><a href="http://dtrace.org/blogs/brendan/files/2013/05/iosnoop_hist_2000-600.png"><img src="http://dtrace.org/blogs/brendan/files/2013/05/iosnoop_hist_2000-600.png" alt="" title="iosnoop_hist_2000-600" width="600" height="277" class="alignnone size-full wp-image-3416" /></a></p>
<p>The I/O distribution is bimodal. This also commonly occurs for latency or response time in different subsystems. Eg, the application has a &#8220;fast path&#8221; and a &#8220;slow path&#8221;, or a resource has cache hits vs cache misses, etc.</p>
<p>But there is still more hidden here. The average latency reported by iostat hinted that there was per-second variance. This histogram is reporting the entire two minutes of iosnoop output.</p>
<h2>Latency Histogram: Animation</h2>
<p>I rendered the iosnoop output as per-second histograms, and generated the following animation (a subset of the frames):</p>
<p><a href="http://dtrace.org/blogs/brendan/files/2013/05/latency-histogram-animation.gif"><img src="http://dtrace.org/blogs/brendan/files/2013/05/latency-histogram-animation.gif" alt="" title="latency-histogram-animation" width="600" height="320" class="alignnone size-full wp-image-3420" /></a></p>
<p>Not only is this bimodal, but <i>the modes move over time</i>. This had been obscured by rendering all data as a single histogram.</p>
<p><a name="heatmap"></a></p>
<h2>Heat Map</h2>
<p>Using <a href="https://github.com/brendangregg/HeatMap">trace2heatmap.pl</a> to generate a heat map from the iosnoop output.</p>
<p><a href="http://www.beginningwithi.com/brendan/example-heatmap.svg"><img src="http://dtrace.org/blogs/brendan/files/2013/05/latency-heatmap-600.png" alt="" title="latency-heatmap-600" width="600" height="305" class="alignnone size-full wp-image-3414" /></a></p>
<p>Click for an <a href="http://www.beginningwithi.com/brendan/example-heatmap.svg">interactive SVG version</a>, and compare to the animation above.</p>
<p>The command used was:</p>
<pre>
$ awk '{ print $2, $3 }' out.iosnoop | ./trace2heatmap.pl --unitstime=us \
    --unitslatency=us --maxlat=2000 --grid &gt; heatmap.svg
</pre>
<p>Without &#8220;&#8211;grid&#8221;, the grid lines are not drawn (making it more Tufte-friendly); see the <a href="http://www.beginningwithi.com/brendan/example-heatmap-nogrid.svg">example</a>.</p>
<p>trace2heatmap.pl gets the job done, but it&#8217;s probably a bit buggy &#8211; I spent three hours writing it (and more than three hours writing this post about it!), really for just the trace files I don&#8217;t already have heat maps for.</p>
<h2>Heat Maps Explained</h2>
<p>It may already be obvious how these work. Each frame of the histogram animation becomes a column in a latency heat map, with the histogram bar height represented as a color:</p>
<p><a href="http://dtrace.org/blogs/brendan/files/2013/05/heatmap-explained-1200.png"><img src="http://dtrace.org/blogs/brendan/files/2013/05/heatmap-explained-600.png" alt="" title="heatmap-explained-600" width="600" height="423" class="alignnone size-full wp-image-3439" /></a></p>
<p>Click for higher resolution.</p>
<h2>Production Use</h2>
<p>If you want to add heat maps to your monitoring solution, then great! However, note that tracing per-event latency can be expensive to perform.  DTrace minimizes the overheads as much as possible using per-CPU buffers and asynchronous kernel-user transfers; other tools (eg, strace, tcpdump) are expected to have higher overhead.  This can cause problems for production use: you want to understand the overhead, including when using DTrace, before tracing events.  </p>
<p>Heat maps have been used successfully in production &ndash; and recorded at a one-second granularity 24x7x365 &ndash; by some products built upon DTrace.  These use the DTrace aggregating feature to pass a quantized summary of latency, instead of every event, to user-level, reducing the data transfer by a large factor (eg, 1000x). This summary may consist of a per-second array with about 200 elements for different latency ranges, each containing the count of events, and is from the DTrace aggregating actions @quantize, @lquantize, or @llquantize (best).  This array is then resampled (downsampled) to the resolution desired for the heat map (usually down to 30 or so levels).</p>
<p>Examples of products that can generate heat maps in a production-efficient way, are the Oracle ZFS Storage Appliance, <a href="http://dtrace.org/blogs/dap/2011/03/01/welcome-to-cloud-analytics/">Joyent Cloud Analytics</a>, and <a href="http://www.circonus.com/blog/understanding-data-with-histograms/">Circonus</a>.</p>
<p>There is also <a href="http://www.tracelytics.com/">Tracelytics</a>, although I can&#8217;t comment on its efficiency, as I don&#8217;t know its internals yet.</p>
<h2>Other Uses</h2>
<p>Heat maps (and trace2heatmap.pl) can be used to examine metrics other than latency, such as offset, I/O size, and utilization. For examples, see the heat maps in <a href="http://dtrace.org/blogs/brendan/2011/12/18/visualizing-device-utilization/">Visualizing Device Utilization</a>, and <a href="http://dtrace.org/blogs/brendan/2012/03/26/subsecond-offset-heat-maps/">Subsecond Offset Heat Maps</a>.</p>
<h2>Background</h2>
<p><a href="http://dtrace.org/blogs/bmc">Bryan</a> and I developed latency heat maps in 2008 for the ZFS Storage Appliance. For more background, see <a href="http://queue.acm.org/detail.cfm?id=1809426">Visualizing System Latency</a>.</p>
<p>Thanks to <a href="http://twitter.com/DeirdreS">Deirdre</a> for helping with another post!</p>
]]></content:encoded>
			<wfw:commentRss>http://dtrace.org/blogs/brendan/2013/05/19/revealing-hidden-latency-patterns/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Delphix and Flash</title>
		<link>http://dtrace.org/blogs/ahl/2013/05/06/delphix-and-flash/</link>
		<comments>http://dtrace.org/blogs/ahl/2013/05/06/delphix-and-flash/#comments</comments>
		<pubDate>Mon, 06 May 2013 04:28:33 +0000</pubDate>
		<dc:creator>ahl</dc:creator>
				<category><![CDATA[Delphix]]></category>
		<category><![CDATA[flash]]></category>

		<guid isPermaLink="false">http://dtrace.org/blogs/ahl/?p=1269</guid>
		<description><![CDATA[I started working with flash in 2006 &#8212; fortunate timing as flash was just starting to take hold in the enterprise. I started asking customers I&#8217;d visit about flash. I&#8217;ll always remember the response from an early adopter when I asked about how he planned on using the new, expensive storage, &#8220;We just bought it, [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignright size-full wp-image-1270" title="delphix" src="http://dtrace.org/blogs/ahl/files/2013/04/dellight.png" alt="" width="191" height="65" /></p>
<p>I started <a href="https://blogs.oracle.com/ahl/entry/fishworks_ssds">working with flash in 2006</a> &#8212; fortunate timing as flash was just starting to <a href="http://queue.acm.org/detail.cfm?id=1364796">take hold</a> in the enterprise. I started asking customers I&#8217;d visit about flash. I&#8217;ll always remember the response from an early adopter when I asked about how he planned on using the new, expensive storage, &#8220;We just bought it, and we have no idea.&#8221; It was a solution in search of a problem &#8212; the <a href="http://www.cua.uam.mx/biblio/ueaarticulos/AGarbagecanmodel.pdf">garbage can model</a> at play.</p>
<p>Flash has evolved significantly since then from a raw material used on its own to a component in systems of increasing complexity. I wrote recently about the <a href="http://queue.acm.org/detail.cfm?id=2463636">various techniques</a> being employed to get the most out of flash; all share the basic idea of trading compute and IOPS (abundant commodities) for capacity (still more expensive for flash than hard drives). The ideal use cases are the ones that benefit most from that trade-off, ones where compression and dedup consume cheap compute cycles rather than expensive space on the NAND flash. <strong>Flash storage is best with data that contains high degrees of redundancy that clever software can squeeze out.</strong> With those loose criteria, it&#8217;s been amazing to me how flash storage vendors have flocked to the <a href="http://en.wikipedia.org/wiki/Desktop_virtualization">VDI</a> use case. It&#8217;s certainly well-suited &#8212; big on IOPS with nearly identical data from different Windows installs that&#8217;s easily compressed and deduped &#8212; but seemingly <em><strong>every</strong></em> flash vendor has decided that it&#8217;s one part &#8212; if not <strong><em>the</em></strong> part &#8212; of the market they want to address. Take a look at the information on VDI from various flash storage vendors: <a href="http://www.fusionio.com/overviews/fusion-io-virtual-desktop-infrastructure-vdi-overview/">Fusion</a>, <a href="http://www.nimblestorage.com/solutions/vdi.php">Nimble</a>, <a href="http://www.purestorage.com/applications/vdi.html">Pure</a> <a href="https://twitter.com/PureStorage/status/330072189594398721">Storage</a>, <a href="http://www.tegile.com/blog/5-things-to-consider-when-deploying-vdi-storage">Tegile</a>, <a href="http://www.tintri.com/solutions/vdi">Tintri</a>, <a href="http://www.viadex.com/pdf/Violin_Memory_VDI_whitepaper_Final_June_2012_US.pdf">Violin</a>, <a href="http://www.virident.com/resources/virident-webinar-series/leveraging-flash-storage-for-vdi/">Virident</a>, <a href="http://whiptail.com/solutions/application-workloads/virtual-desktop/">Whiptail</a> &#8212; <a href="https://www.google.com/search?q=flash+storage+vdi">the list goes on and on</a>.</p>
<p>I worked extensively with flash until leaving Oracle in 2010 when I decided to leave for a start up. I ended up not sticking with flash precisely because it was &#8212; and is &#8212; such a crowded space. I&#8217;d happily bet on the space, but it was harder to pick <strong><em>one</em></strong> winner. One of the things that drew me to Delphix though was precisely its compatibility with flash. At Delphix we create virtual database copies by sharing blocks; think of it as dedup before the fact, or dedup but without the runtime tax. Creating a virtual copy happens almost instantaneously saving tremendous amounts of administration time, unblocking developers, and accelerating projects &#8212; hence our credo of agile data. Unlike storage-based snapshots, Delphix virtual copies are database aware, provisioning is fully integrated and automated. Those virtual copies also take up much less physical space, but with as many or more IOPS hitting the aggregate of those virtual copies. Sound familiar yet? One tenth the capacity with the same workload &#8212; let&#8217;s call it 10x greater IOPS intensity &#8212; is ideally suited for flash storage.</p>
<p><strong>Flash storage is best when clever software can squeeze out redundancies; Delphix is that clever software for databases.</strong> Delphix customers are starting to combine our product with their flash storage purchases. An all-flash array that&#8217;s 5x the $/TB as disk storage suddenly becomes half the price of disk storage when combined with Delphix &#8212; with substantially better performance. We as an industry still haven&#8217;t realized the full potential of flash storage. Agile data through Delphix fills in another big piece of the flash picture.</p>
]]></content:encoded>
			<wfw:commentRss>http://dtrace.org/blogs/ahl/2013/05/06/delphix-and-flash/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Agile Data Technology</title>
		<link>http://dtrace.org/blogs/eschrock/2013/04/21/the-power-of-agile-data/</link>
		<comments>http://dtrace.org/blogs/eschrock/2013/04/21/the-power-of-agile-data/#comments</comments>
		<pubDate>Sun, 21 Apr 2013 20:00:26 +0000</pubDate>
		<dc:creator>eschrock</dc:creator>
				<category><![CDATA[Delphix]]></category>

		<guid isPermaLink="false">http://dtrace.org/blogs/eschrock/?p=170</guid>
		<description><![CDATA[Applications are the nexus of the modern enterprise. They simplify operations, speed execution, and drive competitive advantage. Accelerating the application lifecycle means accelerating the business. Increasingly, organizations turn to public and private clouds, SaaS offerings, and outsourcing to hasten development and reduce risk, only to find themselves held hostage by their data. Applications are nothing [...]]]></description>
			<content:encoded><![CDATA[<p dir="ltr">Applications are the nexus of the modern enterprise. They simplify operations, speed execution, and drive competitive advantage. Accelerating the application lifecycle means accelerating the business. Increasingly, organizations turn to public and private clouds, SaaS offerings, and outsourcing to hasten development and reduce risk, only to find themselves held hostage by their data.</p>
<p dir="ltr">Applications are nothing without data. Enterprise applications have data anchored in infrastructure, tied down by production requirements, legacy architecture, and security regulations. But projects demand fresh data under the control of their developers and testers, requiring processes to work around these impediments. The suboptimal result leads to cost overruns, schedule delays, and poor quality.</p>
<p dir="ltr">Agile development requires agile data. Agile data empowers developers and testers to control their data on their schedule. It unburdens IT by efficiently providing data where it is needed independent of underlying infrastructure. And it accelerates application delivery by providing fresh and complete data whenever necessary. It grants its users <a title="The Power of Agile DAta" href="http://blog.delphix.com/blog/2013/03/08/the-power-of-agile-data/">super powers</a>.</p>
<p dir="ltr">Many technologies can solve part of the agile data problem, but a partial solution still leaves you with suboptimal processes that impede your business. A complete agile data solution must embrace the following attributes.</p>
<p dir="ltr"><strong>Non Disruptive Synchronization</strong></p>
<p dir="ltr">Production data is sensitive. The environment has been highly optimized and secured, and its continued operation is critical to the success of the business &#8211; introducing risk is unacceptable. An agile data solution must automatically synchronize with production data such that it can provide fresh and relevant data copies, but it cannot mandate changes to how the production environment is managed, nor can its operation jeopardize the performance or success of business critical activities.</p>
<p dir="ltr"><strong>Service Provisioning</strong></p>
<p dir="ltr">Data is more than just a sequence of bits. Projects access data through relational databases, NoSQL databases, REST APIs, or other APIs. An agile data solution must move beyond copying the physical representation of the data by instantiating and configuring the systems to access that data. Leaving this process to the end users induces delays and amplifies risk.</p>
<p dir="ltr"><strong>Source Freedom</strong></p>
<p dir="ltr">Data is pervasive. Efforts to mandate a single data representation, be it a particular relational or NoSQL system, rarely succeed and limit the ability of projects to choose the data representation most appropriate for their needs. As projects needs diversify the data landscape, the ability to manage all data through a single experience becomes essential. This unified agile experience necessitates a solution not tied to a single data source.</p>
<p dir="ltr"><strong>Platform Independence</strong></p>
<p dir="ltr">The premier storage, virtualization, and compute platforms of today may be next year’s legacy architecture. Solutions limited to a single platform inhibit the ability of organizations to capitalize on advances in the industry, be it a high performance flash array or new private cloud software. Agility over time requires a solution that is not tied to the implementation of a particular hardware or software platform.</p>
<p dir="ltr"><strong>Efficient Copies</strong></p>
<p dir="ltr">Storage costs money, and time costs the business. Agile development requires a proliferation of data copies for each developer and tester, magnifying these effects. Working around the issue with partial data leads to costly errors that are caught late in the application lifecycle, if at all. An agile solution must be able to create, refresh, and rollback copies of production data in minutes while consuming a fraction of the space required for a full copy.</p>
<p dir="ltr"><strong>Workflow Customization</strong></p>
<p dir="ltr">Each development environment has its own application lifecycle workflow. Data may need to be masked, projects may need multiple branches with different schemas, or developers may need to restart services as data is refreshed. Pushing responsibility to the end user is error prone and impedes application delivery. An agile solution must provide stable interfaces for automation and customization such that it can adapt to any development workflow.</p>
<p dir="ltr"><strong>Self Service Data</strong></p>
<p>Developers and testers dictate the pace of their assigned tasks, and each task affects the data. Agile development mandates that developers have the ability to transform, refresh, and roll back their data without interference. This experience should shield the user from the implementation details of the environment to limit confusion and reduce opportunity for error.</p>
<p><strong>Resource Management</strong></p>
<p dir="ltr"><strong></strong>Each data copy consumes resources through storage, memory, and compute. Once developers experience the power of agile data, they will want more \copies, run workloads on them for which they were not designed, and forget to delete them when they are through. As these resources become scarce, the failure modes (such as poor performance) become more expensive to diagnose and repair. Combatting this data sprawl requires visibility into performance and capacity usage, accountability through auditing and reports, and proactive resource constraints.</p>
<p>&nbsp;</p>
<p>Delphix is the agile data platform of the future. You can sync to your production database, instantly provision virtual databases where they are needed using miniscule amount space, and provide each developer their own copy of the data that can be refreshed and rolled back on demand. This platform will become only more powerful over time as we add new data sources, provide richer workflows targeting specific applications and use cases, and streamline the self service model. An enterprise data strategy without Delphix is just a path to more data anchors, necessitating suboptimal processes that continue to slow application development and your business.</p>
]]></content:encoded>
			<wfw:commentRss>http://dtrace.org/blogs/eschrock/2013/04/21/the-power-of-agile-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Enterprise Software Hackathons</title>
		<link>http://dtrace.org/blogs/eschrock/2013/02/28/enterprise-software-hackathons/</link>
		<comments>http://dtrace.org/blogs/eschrock/2013/02/28/enterprise-software-hackathons/#comments</comments>
		<pubDate>Thu, 28 Feb 2013 21:43:23 +0000</pubDate>
		<dc:creator>eschrock</dc:creator>
				<category><![CDATA[Delphix]]></category>

		<guid isPermaLink="false">http://dtrace.org/blogs/eschrock/?p=166</guid>
		<description><![CDATA[At Delphix, we just concluded one of our recurring Engineering Kickoff events where we get everyone together for a few days of collaboration, discussion, idea sharing, and fun. In this case it included, for the first time, an all-day hackathon event. To be honest, it was a bit of an experiment and one where we [...]]]></description>
			<content:encoded><![CDATA[<p>At Delphix, we just concluded one of our recurring Engineering Kickoff events where we get everyone together for a few days of collaboration, discussion, idea sharing, and fun. In this case it included, for the first time, an all-day hackathon event. To be honest, it was a bit of an experiment and one where we were unsure of how it would be received. We had all read about, participated in, or hear praise of, hackathons at other companies, but these companies were always more consumer-focused or had technologies that were more easily assembled into different creations. As an enterprise software company, we were concerned that even the simplest projects would be too complex to turn around over the course of a day. Given the potential benefit, however, it was clearly something we wanted to experiment with &#8211; any failure would also be a learning opportunity.</p>
<p>Some companies go big or go home when it comes to hackathons &#8211; week long activities, physical hacks, etc. We wanted to preserve freedom but be a little more targeted. The directive was simple: spend a day doing something unrelated to your normal day job that in some way connects to the business. People volunteered ideas and mentorship ahead of time so that even the newest engineers could meaningfully participate. The result was a resounding success. Whether people were able to give a demo, sketch on a whiteboard, or just speak to their ideas and the challenges they faced, everyone pushed themselves in new directions and walked away having learned something through the experience.</p>
<p>The set of activities covered a wide swath of engineering, including:</p>
<ul>
<li>Using D3.js for visualizing analytics data</li>
<li>&#8220;zero copy&#8221; iSCSI in illumos</li>
<li>web portal for customer data analysis</li>
<li>&#8220;zpool dump&#8221; to store pool metadata for offline zdb(1M) use</li>
<li>Real time engineering dashboard to aggregate commits, bugs, reviews, and more</li>
<li>&#8220;D++&#8221; DTrace syntactic sugar: function elapsed time, unrolled while loops, callers array</li>
<li>Mobile application to monitor Delphix alerts and faults</li>
<li>Global symbol tab completion for MDB</li>
<li>Network performance tool</li>
<li>Speeding up unit tests</li>
<li>Browser usage analytics</li>
<li>&#8216;zfs send&#8217; to a POSIX filesystem</li>
<li>BTrace++ (a.k.a. CATrace) to make java tracing safe and easy</li>
<li>New V2P (virtual to physical) mechanisms in Delphix</li>
<li>Tools to more easily deploy changes to VMs</li>
</ul>
<p>For myself, I put together a prototype of a hosted SSH/HTTP proxy for use by our support organization. This was my first real foray into the world of true PaaS cloud software &#8211; running node.js, redis, and cloudAMQP in a heroku instance, and it&#8217;s been incredibly interesting to finally play with all these tools I&#8217;ve read about but never had a reason to use. I will post details (and hopefully code) once I get it into slightly better shape.</p>
<p>Only a fraction of these are really what I would consider a contribution to the product itself, which is where our initial trepidation around a hackathon went awry. No matter how complex your product or how high the barriers to entry , engineers will find a way to build cool things and try out new ideas in a hackathon setting. Everything that people did, from learning how to make changes to our OS to improving our quality of life as engineers to testing new product ideas, will provide real value to the engineering organization. On top of that, it was incredibly fun and a great way to get everyone working together in different ways.</p>
<p>It&#8217;s something we&#8217;ll certainly look at doing again, and I&#8217;d recommend that every company, organization, or group, find some way to get engineers together with the express purpose of working on ideas not directly related to their regular work. You&#8217;ll end up with some cool ideas and prototypes, and everyone will learn new things while having fun doing it.</p>
]]></content:encoded>
			<wfw:commentRss>http://dtrace.org/blogs/eschrock/2013/02/28/enterprise-software-hackathons/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>On Systems Software</title>
		<link>http://dtrace.org/blogs/ahl/2013/02/25/on-systems-software/</link>
		<comments>http://dtrace.org/blogs/ahl/2013/02/25/on-systems-software/#comments</comments>
		<pubDate>Mon, 25 Feb 2013 04:46:00 +0000</pubDate>
		<dc:creator>ahl</dc:creator>
				<category><![CDATA[software]]></category>
		<category><![CDATA[systems]]></category>

		<guid isPermaLink="false">http://dtrace.org/blogs/ahl/?p=1255</guid>
		<description><![CDATA[A prospective new college hire recently related an odd comment from his professor: systems programming is dead. I was nonplussed; what could the professor have meant? Systems is clearly very much alive. Interesting and important projects march under the banner of systems. But as I tried to construct a less emotional rebuttal, I realized I [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://dtrace.org/blogs/ahl/files/2013/02/cctv.jpeg"><img class="alignright size-medium wp-image-1258" title="cctv" src="http://dtrace.org/blogs/ahl/files/2013/02/cctv-278x300.jpg" alt="" width="278" height="300" /></a>A prospective new college hire recently related an odd comment from his professor: systems programming is dead. I was nonplussed; what could the professor have meant? Systems is clearly very much alive. Interesting and important projects march under the banner of systems. But as I tried to construct a less emotional rebuttal, I realized I lacked a crisp definition of what systems programming is.</p>
<p>Wikipedia <a href="http://en.wikipedia.org/wiki/System_software">defines systems software</a> in the narrowest terms: the stuff that interacts with hardware. But that covers a tiny fraction of modern systems. So what is systems software? It depends on <strong>when</strong> you&#8217;re asking the question. At one time, the web server was the application; now it&#8217;s the systems software on which many web-facing applications are built. At one time a database was the application; now it&#8217;s systems software that supports a variety of custom and off-the-shelf applications. Before my time, shells were probably considered a bleeding edge application; now they&#8217;re systems software on which some of the lowest-level plumbing of modern operating systems are built.</p>
<p>Any layer on which people build applications of increasing complexity is systems software. Most software that endures the transition to systems software does so whether its authors intended it or not. People in the software industry often talk about standing on the shoulders of giants; the systems software accumulated and refined over decades are those giants.</p>
<p>Stable interfaces define systems software. The programs that consume those interfaces expect the underlying systems software to be perfect every time. Initially innovation might happen in the interfaces themselves &#8212; the concurrent model of <a href="http://nodejs.org">Node.js</a> is a great example. As software matures, the interfaces become commodified; innovation happens behind those stable interfaces. Systems is only &#8220;dead&#8221; at its edges. Interfaces might be flexible and well-designed, or sclerotic and poorly designed. Regardless, new or improved systems software can increase performance, enhance observability, or simply fit a different economic niche.</p>
<p>There are a few different types of systems software. First there&#8217;s <strong>supporting systems software</strong>, systems software written as necessary foundation for some new application. This is systems software written with a purpose and designed to solve an unsolved &#8212; or poorly solved &#8212; problem. Chronologically, examples include UNIX, compilers, and libraries like jQuery. You write it because you need it, but it&#8217;s solving a problem that&#8217;s likely not unique to your particular application.</p>
<p>Then there&#8217;s <strong>accidental systems software</strong>. Stick everything from Apache to Excel to the Bourne shell in that category. These didn&#8217;t necessarily set out to be the foundation on which increasingly complex software would be written, but they definitely are. I&#8217;m sure there were times when indoctrination into systems-hood was painful, where the authors wanted to change interfaces, but good systems software respects its consumers and carries them forward. Somewhat famously <a href="http://en.wikipedia.org/wiki/Make_(software)">make</a> preserved its arcane syntax because two consumers already existed. JavaScript started as a glorified web design tool; now it sits several layers beneath complex client-side applications. Even more recently, developers of Node.js (itself  JavaScript-based) changed a commonly used interface that broke many applications. Historical mistakes can be annoying to live with, but &#8212; as the Node.js team determined &#8212; <a href="https://github.com/joyent/node/issues/3577">compatibility trumps cleanliness</a>.</p>
<p>The largest bucket is <strong>replacement systems software</strong>. Linux, Java, ZFS, and DTrace fall into this category. At the time of their development, each was a notionally compatible replacement for something that already existed. Linux, of course, reimplemented the UNIX design to provide a free, compatible alternative. Java set about building a better runtime (the stable interface being a binary provided to customers to execute) designed to abstract away the operating system entirely. ZFS represented a completely new way of thinking about filesystems, but it did so within the tight constraints of POSIX interfaces and storage hardware. DTrace added new and unique observability to most of the stable interfaces that applications build on.</p>
<p>Finally, there&#8217;s <strong>intentional systems software</strong>. This is systems software by design, but unlike supporting systems software, there&#8217;s no consumer. Intentional systems software takes an &#8220;if you build it, they will come&#8221; approach. This is highly prone to failure &#8212; without an existence proof that your software solves a problem and exposes the right interfaces, it&#8217;s very difficult to know if you&#8217;re building the right thing.</p>
<p>Why define these categories? Knowing which you&#8217;re working with can inform your decisions. If you&#8217;ve written accidental systems software that has had systems-ness thrust upon it, realize that future versions need to respect the consumers &#8212; or willfully cast them aside. When writing replacement systems software, recognize the constraints on the system, and know exactly where you&#8217;re constrained and where you can innovate (or consider if you don&#8217;t want to use the existing solution). If you&#8217;ve written supporting systems software, know that others will inevitably need solutions to the same problems. Either invest in maintaining it and keeping it best of breed; resign to the fact that it will need to be replaced as others invest in a different solution; or open source it and hope (or advocate) that it becomes that ubiquitous solution.</p>
<p>TL;DR?</p>
<p>What&#8217;s systems software? It is the increasingly complex, increasingly capable, increasingly diverse foundation on which applications are built. It&#8217;s that long and growing tail of the corpus of software at large. The interfaces might be static, but it&#8217;s a rich venue for innovation. As more and more critical applications build on an interface, the more value there is in improving the systems software beneath it. Systems software is defined by the constraints; it&#8217;s a mission and a mindset with unique challenges, and unique potential.</p>
]]></content:encoded>
			<wfw:commentRss>http://dtrace.org/blogs/ahl/2013/02/25/on-systems-software/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Holistic Engineer</title>
		<link>http://dtrace.org/blogs/ahl/2013/02/06/the-holistic-engineer/</link>
		<comments>http://dtrace.org/blogs/ahl/2013/02/06/the-holistic-engineer/#comments</comments>
		<pubDate>Wed, 06 Feb 2013 08:02:00 +0000</pubDate>
		<dc:creator>ahl</dc:creator>
				<category><![CDATA[software]]></category>

		<guid isPermaLink="false">http://dtrace.org/blogs/ahl/?p=1236</guid>
		<description><![CDATA[The idea of the holistic engineer embodies the point of view that an engineer needs to consider the whole system, the whole body of work that makes a product successful. It bears no relation to holistic health &#8212; and it&#8217;s not some even newer age quackery. There are many specialist roles in the software industry &#8212; [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://dtrace.org/blogs/ahl/files/2013/02/erectorsets.jpg"><img class="alignright size-medium wp-image-1242" title="erectorsets" src="http://dtrace.org/blogs/ahl/files/2013/02/erectorsets-300x205.jpg" alt="" width="300" height="205" /></a>The idea of the holistic engineer embodies the point of view that an engineer needs to consider the whole system, the whole body of work that makes a product successful. It bears no relation to <a href="http://en.wikipedia.org/wiki/Holistic_health">holistic health</a> &#8212; and it&#8217;s not some even newer age quackery. There are many specialist roles in the software industry &#8212; marketing, product management, project management, documentation, education, support, etc. &#8212; but the best software engineers are generalists who can assume a portion of each specialty. Further, some software is particularly well-suited for generalists who can combine a deep understanding of the market, the technology, and the implementation.</p>
<p>Software products are born of many different types of organizations, and even within similar organizations roles might have different names. Here&#8217;s a generic example with some names on the roles. New products and features start with product managers. Their role is to talk to customers and sales, educate themselves on the market, and determine the right product or enhancement. The handoff to engineering takes the form of a product requirements document (PRD) &#8212; it might sound like jargon, but the term is more or less universal. Software engineers execute against that PRD; QA engineers design tests that assert conformance to the PRD while developers steer the product from point A to point B as described by product management. Documentation writers and learning services take the PRD and the software to generate collateral that teaches customers how to use it. Product marketing makes the PowerPoints; sales presents them to customers.</p>
<p>And that&#8217;s where babies come from.</p>
<p>It&#8217;s not a perfect process, but it&#8217;s birthed many successful products. The shortcoming is that it can bury engineers under filters. Instead of learning about actual customer problems, engineers hear some processed form of what the customer said. Instead of raw critique of a new feature, engineers hear a softened and truncated form. The more technical the product and the market, the more those filters impede innovation and hamper the trajectory of the product.</p>
<p>The holistic engineer augments the jobs of those specialists, participating in each phase of product development. They join in those early conversations with customers, and share the responsibility of market comprehension. They partner to construct the requirements and design that those engineers will then implement. Along the way, engineers of course validate decisions with sales and customers &#8212; this is <a href="http://en.wikipedia.org/wiki/Agile_software_development">Agile</a> writ large &#8212; but engineers also participate in the outbound documentation, training, and marketing activities.</p>
<p>From start to finish, the process is designed to fuel innovation by arming creative engineers with data and understanding. Customers often tell you what they want; they rarely tell you what they need. The more technical or disruptive the product, the more value an engineer has in those conversations, extracting the essence of the problem from the noise of preconceptions. The relationships with customer and the full context around their problems keeps engineers grounded as the inevitable gaps emerge in the product specification. Holistic engineers also help to educate the rest of the company and the rest of the world about new products and features. The process of explaining technology advises the way engineers design and build products. When we&#8217;re having a hard time explaining a feature or presenting a product, we need to revise our design. We&#8217;ve all heard engineering accused of building a product that was too complicated for the market, or engineers complain that a product failed because it was poorly marketed; both are symptoms of poor coordinating. Giving engineers holistic responsibility guards against this problem &#8212; if the product is failing the onus is on them to solve it.</p>
<p>Most important though are the feelings of ownership and agency associated with the whole-body approach. The holistic engineer is explicitly tasked with making a product succeed. That&#8217;s not to say that he or she goes it alone &#8212; specialists in all functions have major roles &#8212; rather the engineer is empowered to move the product through all stages; the other side of that coin is that there&#8217;s no opportunity to shrug off a responsibility as belonging to someone else.</p>
<p>In this model, everyone in every role at the company has the opportunity to engage in product management. Indeed, there&#8217;s still value in explicit product management. Channels of communication need to be easy and open for people with ideas to connect to people who will distill them into implementation. And it&#8217;s not enough to just create the right environment; hiring processes need to identify broad thinkers, and mentorship needs to nurture and reward holistic execution. Not every engineer can &#8212; or wants to &#8212; take on those additional responsibilities, but the best thrive with market and technology awareness, unencumbered by filters. They want responsibility and authority to make their ideas succeed.</p>
<p>The idea of the holistic engineer isn&#8217;t theoretical, it&#8217;s the model we stumbled into in the Solaris Kernel Group, and later implemented deliberately at <a href="http://dtrace.org/blogs/bmc/2008/11/10/fishworks-now-it-can-be-told/">Fishworks</a>. There, a small team took on wide ranging responsibilities to build a product that&#8217;s now doing $400m/year for Oracle. At Delphix we&#8217;re again inculcating and hiring for holistic thinking. At all three I&#8217;ve seen engineers develop new products and features that address customer needs that would have otherwise never emerged from customers&#8217; initial requests. It&#8217;s not easy to find the right kind of engineers, but if a company can empower the right engineers in the right ways &#8212; and they can live up to the responsibility &#8212; the payoff is a better product, built more efficiently.</p>
]]></content:encoded>
			<wfw:commentRss>http://dtrace.org/blogs/ahl/2013/02/06/the-holistic-engineer/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Virtualization Performance: Zones, KVM, Xen</title>
		<link>http://dtrace.org/blogs/brendan/2013/01/11/virtualization-performance-zones-kvm-xen/</link>
		<comments>http://dtrace.org/blogs/brendan/2013/01/11/virtualization-performance-zones-kvm-xen/#comments</comments>
		<pubDate>Fri, 11 Jan 2013 23:58:13 +0000</pubDate>
		<dc:creator>Brendan Gregg</dc:creator>
				<category><![CDATA[Cloud]]></category>
		<category><![CDATA[DTrace]]></category>
		<category><![CDATA[KVM]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[xen]]></category>
		<category><![CDATA[zones]]></category>

		<guid isPermaLink="false">http://dtrace.org/blogs/brendan/?p=3328</guid>
		<description><![CDATA[At Joyent we run a high-performance public cloud based on two different virtualization technologies: Zones and KVM. We have historically run Xen as well, but have phased it out for KVM on SmartOS. My job is to make things go fast, which often means using DTrace to analyze the kernel, applications, and those virtualization technologies. [...]]]></description>
			<content:encoded><![CDATA[<p>At Joyent we run a high-performance public cloud based on two different virtualization technologies: <a href="http://en.wikipedia.org/wiki/Solaris_Zones">Zones</a> and <a href="http://en.wikipedia.org/wiki/Kernel-based_Virtual_Machine">KVM</a>. We have historically run <a href="http://en.wikipedia.org/wiki/Xen">Xen</a> as well, but have phased it out for KVM on <a href="http://smartos.org">SmartOS</a>. My job is to make things <i>go fast</i>, which often means using <a href="http://www.slideshare.net/brendangregg/dtracecloud2012">DTrace</a> to analyze the kernel, applications, and those virtualization technologies. In this post I&#8217;ll summarize their performance in four ways: characteristics, block diagrams, internals, and results.</p>
<table border=1>
<tr>
<th width="19%">Attribute</th>
<th width="25%">Zones</th>
<th width="28%">Xen</th>
<th width="27%">KVM</th>
</tr>
<tr>
<td>CPU Performance</td>
<td bgcolor="80f080">high</td>
<td bgcolor="80f080">high (with CPU support)</td>
<td bgcolor="80f080">high (with CPU support)</td>
</tr>
<tr>
<td>CPU Allocation</td>
<td bgcolor="80f080">flexible (FSS + &#8220;bursting&#8221;)</td>
<td>fixed to VCPU limit</td>
<td>fixed to VCPU limit</td>
</tr>
<tr>
<td>I/O Throughput</td>
<td bgcolor="80f080">high (no intrinsic overhead)</td>
<td>low or medium (with paravirt)</td>
<td>low or medium (with paravirt)</td>
</tr>
<tr>
<td>I/O Latency</td>
<td bgcolor="80f080">low (no intrinsic overhead)</td>
<td>some (I/O proxy overhead)</td>
<td>some (I/O proxy overhead)</td>
</tr>
<tr>
<td>Memory Access Overhead</td>
<td bgcolor="80f080">none</td>
<td>some (EPT/NPT or shadow page tables)</td>
<td>some (EPT/NPT or shadow page tables)</td>
</tr>
<tr>
<td>Memory Loss</td>
<td bgcolor="80f080">none</td>
<td>some (extra kernels; page tables)</td>
<td>some (extra kernels; page tables)</td>
</tr>
<tr>
<td>Memory Allocation</td>
<td bgcolor="80f080">flexible (unused guest memory used for file system cache)</td>
<td>fixed (and possible double-caching)</td>
<td>fixed (and possible double-caching)</td>
</tr>
<tr>
<td>Resource Controls</td>
<td bgcolor="d0f0d0">many (depends on OS)</td>
<td>some (depends on hypervisor)</td>
<td bgcolor="80f080">most (OS + hypervisor)</td>
</tr>
<tr>
<td>Observability: from the host</td>
<td bgcolor="80f080">highest (see everything)</td>
<td>low (resource usage, hypervisor statistics)</td>
<td bgcolor="d0f0d0">medium (resource usage, hypervisor statistics, OS inspection of hypervisor)</td>
</tr>
<tr>
<td>Observability: from the guest</td>
<td bgcolor="d0f0d0">medium (see everything permitted, incl. some physical resource stats)</td>
<td>low (guest only)</td>
<td>low (guest only)</td>
</tr>
<tr>
<td>Hypervisor Complexity</td>
<td bgcolor="80f080">low (OS partitions)</td>
<td>high (complex hypervisor)</td>
<td>medium</td>
</tr>
<tr>
<td>Different OS Guests</td>
<td>usually no (sometimes possible with syscall translation)</td>
<td bgcolor="80f080">yes</td>
<td bgcolor="80f080">yes</td>
</tr>
</table>
<p></p>
<p>There are variations with how these can be configured, and details in this table may vary. At the very least, this can serve as a checklist of characteristics to confirm, which may also be helpful if you are considering other technologies (eg, VMWare). Wikipedia also has a <a href="http://en.wikipedia.org/wiki/Operating_system-level_virtualization#Implementations">table</a> of general characteristics.</p>
<p>The three in this table represent different types: <a href="http://en.wikipedia.org/wiki/Operating_system-level_virtualization">OS Virtualization</a> (Zones), and <a href="http://en.wikipedia.org/wiki/Hardware_virtualization">Hardware Virtualization</a> of both <a href="http://en.wikipedia.org/wiki/Hypervisor#Classification">Type 1</a> (Xen) and <a href="http://en.wikipedia.org/wiki/Hypervisor#Classification">Type 2</a> (KVM) varieties.</p>
<p>The delivered performance of these is critical. In general, we use fast server hardware, 10 GbE networks, <a href="http://en.wikipedia.org/wiki/ZFS">ZFS</a> for all file systems, <a href="http://dtrace.org/blogs/about/">DTrace</a> for performance analysis, and <a href="http://en.wikipedia.org/wiki/Solaris_Zones">Zones</a> wherever possible. We also performed our own port of <a href="http://dtrace.org/blogs/bmc/2011/08/15/kvm-on-illumos">KVM to illumos</a>, and run KVM instances <i>inside</i> Zones, providing additional resource controls than can be applied, and improved security (&#8220;double-hulled virtualization&#8221;).</p>
<p>There are many characteristics I&#8217;d like to discuss in more detail. In this post, I&#8217;ll look at the I/O path (network, disk) and its overhead.</p>
<h2>I/O Path</h2>
<p>How does I/O differ between traditional Unix and Zones?</p>
<p><a href="http://dtrace.org/blogs/brendan/files/2013/01/virtualization_unix_zones.png"><img src="http://dtrace.org/blogs/brendan/files/2013/01/virtualization_unix_zones-600.png" alt="" title="virtualization_unix_zones-600" width="600" height="267" class="alignnone size-full wp-image-3333" /></a></p>
<p>Performance is exactly the same &ndash; there is no overhead. Zones partition the OS in the same way that chroot isolates a process in the file system. There isn&#8217;t necessarily an extra layer in the software stack to make this work.</p>
<p>Now for Xen and KVM (simplified!):</p>
<p><a href="http://dtrace.org/blogs/brendan/files/2013/01/virtualization_xen_kvm.png"><img src="http://dtrace.org/blogs/brendan/files/2013/01/virtualization_xen_kvm-600.png" alt="" title="virtualization_xen_kvm-600" width="600" height="285" class="alignnone size-full wp-image-3335" /></a></p>
<p>GK is Guest Kernel, and domU on Xen runs the guest OS. Some of these arrows are indicating the <i>control-path</i>, where components inform each other, either synchronously or asynchronously, that more data is ready to transfer. The <i>data-path</i> may be implemented in some cases by shared memory and ring buffers. There are also different ways this can be configured. For example, Xen can use Isolated Driver Domains (IDD), or stub-domains, to run the I/O proxies in isolation. </p>
<p>With <b>Xen</b>, the hypervisor performs CPU scheduling for the domains, and then each domain has its own OS kernel for thread scheduling. The hypervisor supports different CPU scheduling classes, including Borrowed Virtual Time (BVT), Simple Earliest Deadline First (SEDF), and Credit-Based. The domains use the OS kernel scheduler, and whatever regular scheduler classes and policies they provide.</p>
<p>The extra overhead of multiple schedulers costs performance. Having multiple schedulers can also create complex issues with how they interact, adding CPU latency in the wrong situations. Debugging this can be very difficult, especially since the Xen hypervisor is running out of reach of the usual OS performance tools (try xentrace instead).</p>
<p>Sending I/O via the I/O proxy processes (which are usually qemu) involves context-switching and more overhead. There has been lots of work to minimize this, including shared memory transports, buffering, I/O coalescing, and paravirtualization drivers.</p>
<p>With <b>KVM</b>, the hypervisor is a kernel module (kvm) which is scheduled by the OS scheduler. It can be tuned using the usual OS kernel scheduler classes, policies and priorities. The I/O path takes fewer steps than Xen. (The original Qumranet KVM <a href="http://www.linuxinsight.com/files/kvm_whitepaper.pdf">paper</a> described it as five steps vs ten, although this description isn&#8217;t including paravirtualization.)</p>
<p>With <b>Zones</b>, there&#8217;s no comparison. The I/O path &ndash; which for high-speed networking is very sensitive &ndash; has none of these extra steps. While this has been well known in the Solaris community for years (Zones being a Solaris technology), and also the FreeBSD community (as Zones are based on FreeBSD jails), the Linux community is still learning about them and developing their own version: <b>Linux Containers</b>. Glauber Costa described them in his talk &#8220;<a href="http://events.linuxfoundation.org/images/stories/pdf/lcna_co2012_costa.pdf">The failure of Operating Systems, and how we can fix it</a>&#8221; for Linuxcon 2012, and listed various use cases where KVM was currently used. Many of the use cases could be served by Containers, and didn&#8217;t actually need KVM.</p>
<p>Sometimes you (and our customers) really do need Hardware Virtualization, as their applications depend on a particular version of the Linux kernel, or Windows. We provide this with KVM (we&#8217;ve phased out Xen).</p>
<h2>Internals</h2>
<p>Some deeper insights into how these work (often using DTrace).</p>
<h3>Network I/O, Zones</h3>
<p>The following two stack traces show how a network packet is transmitted from the global zone (the host, which is the same as a bare-metal install) and from a zone (the guest):</p>
<pre>
  <b>Global Zone:                            Zone:</b>
  mac`mac_tx+0xda                         mac`mac_tx+0xda
  dld`str_mdata_fastpath_put+0x53         dld`str_mdata_fastpath_put+0x53
  ip`ip_xmit+0x82d                        ip`ip_xmit+0x82d
  ip`ire_send_wire_v4+0x3e9               ip`ire_send_wire_v4+0x3e9
  ip`conn_ip_output+0x190                 ip`conn_ip_output+0x190
  ip`tcp_send_data+0x59                   ip`tcp_send_data+0x59
  ip`tcp_output+0x58c                     ip`tcp_output+0x58c
  ip`squeue_enter+0x426                   ip`squeue_enter+0x426
  ip`tcp_sendmsg+0x14f                    ip`tcp_sendmsg+0x14f
  sockfs`so_sendmsg+0x26b                 sockfs`so_sendmsg+0x26b
  sockfs`socket_sendmsg+0x48              sockfs`socket_sendmsg+0x48
  sockfs`socket_vop_write+0x6c            sockfs`socket_vop_write+0x6c
  genunix`fop_write+0x8b                  genunix`fop_write+0x8b
  genunix`write+0x250                     genunix`write+0x250
  genunix`write32+0x1e                    genunix`write32+0x1e
  unix`_sys_sysenter_post_swapgs+0x14     unix`_sys_sysenter_post_swapgs+0x14
</pre>
<p>I spent (way) too much time double-checking that I didn&#8217;t switch these two stacks by accident, since they are <i>identical</i>. The stack on the right shows the same code path taken.</p>
<p>You could configure Zones in a way that it does have overhead, just like on a normal system. For example, enabling a firewall for network I/O, or mounting file systems via lofs instead of directly. These are optional, and may be worth the extra performance overhead for certain use cases.</p>
<h3>Network I/O, KVM</h3>
<p>The full code path for performing network I/O is complex.</p>
<p>The first part is the guest process writing to its driver. In this case, I&#8217;m demonstrating a <b>Linux</b> Fedora guest with <a href="https://github.com/dtrace4linux/linux">DTrace-for-Linux</a>, and tracing the paravirt driver:</p>
<pre>
guest# <b>dtrace -n 'fbt:virtio_net:start_xmit:entry { @[stack(100)] = count(); }'</b>
dtrace: description 'fbt:virtio_net:start_xmit:entry ' matched 1 probe
^C
[...]
              kernel`start_xmit+0x1
              kernel`dev_hard_start_xmit+0x322
              kernel`sch_direct_xmit+0xef
              kernel`dev_queue_xmit+0x184
              kernel`eth_header+0x3a
              kernel`neigh_resolve_output+0x11e
              kernel`nf_hook_slow+0x75
              kernel`ip_finish_output
              kernel`ip_finish_output+0x17e
              kernel`ip_output+0x98
              kernel`__ip_local_out+0xa4
              kernel`ip_local_out+0x29
              kernel`ip_queue_xmit+0x14f
              kernel`tcp_transmit_skb+0x3e4
              kernel`__kmalloc_node_track_caller+0x185
              kernel`sk_stream_alloc_skb+0x41
              kernel`tcp_write_xmit+0xf7
              kernel`__alloc_skb+0x8c
              kernel`__tcp_push_pending_frames+0x26
              kernel`tcp_sendmsg+0x895
              kernel`inet_sendmsg+0x64
              kernel`sock_aio_write+0x13a
              kernel`do_sync_write+0xd2
              kernel`security_file_permission+0x2c
              kernel`rw_verify_area+0x61
              kernel`vfs_write+0x16d
              kernel`sys_write+0x4a
              kernel`sys_rt_sigprocmask+0x84
              kernel`system_call_fastpath+0x16
             2015
</pre>
<p>That&#8217;s the Linux 3.2.6 network transmit path.</p>
<p>Control is passed by KVM to the qemu I/O proxy, which then transmits it on the host OS via the usual means (native driver). Here is the <b>SmartOS</b> stack in this case:</p>
<pre>
host# <b>dtrace -n 'fbt::igb_tx:entry { @[stack()] = count(); }'</b>
dtrace: description 'fbt::igb_tx:entry ' matched 1 probe
^C
[...]
              igb`igb_tx_ring_send+0x33
              mac`mac_hwring_tx+0x1d
              mac`mac_tx_send+0x5dc
              mac`mac_tx_single_ring_mode+0x6e
              mac`mac_tx+0xda
              dld`str_mdata_fastpath_put+0x53
              ip`ip_xmit+0x82d
              ip`ire_send_wire_v4+0x3e9
              ip`conn_ip_output+0x190
              ip`tcp_send_data+0x59
              ip`tcp_output+0x58c
              ip`squeue_enter+0x426
              ip`tcp_sendmsg+0x14f
              sockfs`so_sendmsg+0x26b
              sockfs`socket_sendmsg+0x48
              sockfs`socket_vop_write+0x6c
              genunix`fop_write+0x8b
              genunix`write+0x250
              genunix`write32+0x1e
              unix`_sys_sysenter_post_swapgs+0x149
             1195
</pre>
<p>Both of these stacks are pretty complex to begin with. Then there is the stuff in-between the <b>Linux kernel</b> and the <b>illumos kernel</b>, which gets even more complicated and involved. Basically, the paravirt code paths allow the two kernel stacks to make intimate love.</p>
<p>When <a href="http://dtrace.org/blogs/rm">Robert Mustacchi</a> of Joyent last investigated these code paths in detail, he drew up some wonderful ASCII diagrams like the following:</p>
<pre>
/*
 *                  GUEST                        #       QEMU
 * #####################################################################
 *                                               #
 *    +----------+                               #
 *    |  start_  | (1)                           #
 *    |  xmit()  |                               #
 *    +----------+                               #
 *         ||                                    #
 *         ||       +-----------+                #
 *         ||------&gt;|free_old_  | (2)            #
 *         ||------&gt;|xmit_skbs()|                #
 *         ||       +-----------+                #
 *         \/                        (3)         #
 *    +---------+        +-------------+     + - #--- PIO write to VNIC
 *    |  xmit_  |-------&gt;|virtqueue_add|     |   #    PCI config space (6)
 *    |  skb()  |-------&gt;|_buf_gfp()   |     |   #
 *    +---------+        +-------------+     |   #
 *        ||                                 |   # +- VM exit
 *        ||         +- iff interrupts       |   # |  KVM driver exit (7)
 *        \/         |  unmasked (4)         |   # |
 *    +---------+    |     +-----------+(5)  |   # |  +---------+
 *    |virtqueue|----*----&gt;|vp_notify()|-----*---#-*-&gt;| handle  | (8)
 *    |_kick()  |----*----&gt;|           |-----*---#-*-&gt;|PIO write|
 *    +---------+          +-----------+         #    +---------+
 *        ||                                     #        ||
 *        ||   (13)                              #        ||
 *        **-----+ iff avail ring                #        \/      (9)
 *        ||       capacity &lt; 20                 # +-----------------+
 *        ||       else return                   # |virtio_net_handle|
 *        ||                                     # |tx_timer()       |
 *        \/   (14)                              # +-----------------+
 *    +----------+                               #  ||
 *    |netif_stop|                               #  ||             (10)
 *    |_queue()  |                               #  ||   +---------+
 *    +----------+                               #  ||--&gt;|qemu_mod_|
 *        ||                                     #  ||--&gt;|timer()  |
 *        ||     (15)                 (16)       #  ||   +---------+
 *    +----------------+     +----------+        #  ||
 *    |virtqueue_enable|----&gt;|unmask    |        #  ||              (11)
 *    |_cb_delayed()   |----&gt;|interrupts|        #  ||  +------------+
 *    +----------------+     +----------+        #  |+-&gt;|virtio_     |
 *        ||                   ||                #  +--&gt;|queue_set_  |
 *        || (18)              ||       (17)     #      |notification|
 *        ||  +-return   +-------------------+   #      +------------+
 *        ||  | iff ----&gt;|check if the number|   #       |
 *        **--+ is false |of unprocessed used|   #       |  disable host
 *        ||             |ring entries is &gt;  |   #       +- interrupts
 *        ||             |3/4s of the avail  |   #          (12)
 *        \/   (19)      |ring index - the   |   #
 *   +-----------+       |last freed used    |   #
 *   |free_old_  |       |ring index         |   #
 *   |xmit_skbs()|       +-------------------+   #
 *   +-----------+                               #
 *        ||                                     #
 *        ||     (20)                            #
 *        **-----+ iff avail ring                #
 *        ||       capacity is                   #
 *        ||       now &gt; 20                      #
 *        \/                                     #
 *   +-----------+                               #
 *   |netif_start| (21)                          #
 *   |_queue()   |                               #
 *   +-----------+                               #
 *        ||                                     #
 *        ||                                     #
 *        \/  (22)               (23)            #
 *   +------------+      +----------+            #
 *   |virtqueue_  |-----&gt;|mask      |            #
 *   |disable_cb()|-----&gt;|interrupts|            #
 *   +------------+      +----------+            #
 *                                               #
 *                                               #
 */
		  <b>Figure II: Guest / Host Packet TX Part 1</b>
</pre>
<p>I included this diagram just to give you a sense of what happens. And that&#8217;s only part <i>1</i>.</p>
<p>In brief, this uses ring buffers in shared memory to transfer the data, and a notification mechanism to inform when data is ready to transfer. When everything is working as intended, performance can be quite reasonable. It isn&#8217;t bare-metal fast (or Zones fast), but it isn&#8217;t terrible either. I&#8217;ve included some numbers later in this post.</p>
<p>The CPU overhead and reduced network performance is one thing. Another is the complexity this introduces, which hampers analysis and performance investigations. With Zones, there is one kernel TCP/IP stack to study and tune. Given its complexity, one is more than enough! With KVM, there are two different kernel TCP/IP stacks, plus KVM and paravirt. Investigating performance can take ten times longer, or so long that it becomes prohibitive. This is why I included &#8220;Observability&#8221; as a key characteristic in my comparison table. If it&#8217;s harder to see, it&#8217;s harder to tune.</p>
<h3>Network I/O, Xen</h3>
<p>The guest transmit and I/O proxy transmit stacks are the same. The in-between bit gets more complex. The hypervisor can&#8217;t be inspected using OS observability and debugging tools, since it&#8217;s running on bare-metal directly. There is xentrace, which looks pretty useful, as it instruments many event types in the Xen scheduler using static probes. (Even if it isn&#8217;t real-time and programmatic like DTrace, and, requires me to learn Yet Another Tracer.)</p>
<h3>/proc, Zones</h3>
<p>While the I/O path may have zero extra overhead by default, there are some overheads with OS Virtualization, usually for administration or observability, and not in the CPU or I/O &#8220;hot path&#8221;.</p>
<p>For example, a Zone cannot see other guests on the same system via /proc, as read by <tt>prstat(1M)</tt>, <tt>top(1)</tt>, etc. This is implemented in usr/src/uts/common/fs/proc/prvnops.c:</p>
<pre>
static int
pr_readdir_procdir(prnode_t *pnp, uio_t *uiop, int *eofp)
{
[...]
        /*
         * Loop until user's request is satisfied or until all processes
         * have been examined.
         */
        while ((error = gfs_readdir_pred(&amp;gstate, uiop, &amp;n)) == 0) {
                uint_t pid;
                int pslot;
                proc_t *p;

                /*
                 * Find next entry.  Skip processes not visible where
                 * this /proc was mounted.
                 */
                mutex_enter(&amp;pidlock);
                while (n &lt; v.v_proc &amp;&amp;
                    ((p = pid_entry(n)) == NULL || p-&gt;p_stat == SIDL ||
                    (zoneid != GLOBAL_ZONEID &amp;&amp; p-&gt;p_zone-&gt;zone_id != zoneid) ||
                    secpolicy_basic_procinfo(CRED(), p, curproc) != 0))
                        n++;
[...]
</pre>
<p>The full list of processes are scanned, and just the local Zone&#8217;s processes are returned. This might sound a bit inefficient &ndash; couldn&#8217;t a linked list be added to <tt>proc_t</tt> so that Zone processes could be walked directly? Sure, but let&#8217;s be data driven.</p>
<p>Here&#8217;s the time to read /proc from a Zone by the <tt>prstat(1M)</tt> command, measuring using DTrace:</p>
<pre>
# <b>dtrace -n 'fbt::pr_readdir_procdir:entry /execname == "prstat"/ {
    self->ts = timestamp; } fbt::pr_readdir_procdir:return /self->ts/ {
    @["ns"] = avg(timestamp - self->ts); self->ts = 0; }'</b>
dtrace: description 'fbt::pr_readdir_procdir:entry ' matched 2 probes
^C
  ns                                                           544584
</pre>
<p>On average, that&#8217;s 544 us (microseconds).</p>
<p>Now with an extra 1000 processes in another Zone (which represents a typical dozen extra guests):</p>
<pre>
# <b>dtrace -n 'fbt::pr_readdir_procdir:entry /execname == "prstat"/ {
   self->ts = timestamp; } fbt::pr_readdir_procdir:return /self->ts/ {
   @["ns"] = avg(timestamp - self->ts); self->ts = 0; }'</b>
dtrace: description 'fbt::pr_readdir_procdir:entry ' matched 2 probes
^C
  ns                                                           594254
</pre>
<p>That added 50 us. For a /proc read &ndash; which shouldn&#8217;t be hot path. If it is, and 50 us matters, we can look at it then.</p>
<p>(While I was here, I also checked <tt>pidlock</tt>, which is, ahem, <i>global</i>. It is not currently a problem. This was also checked using DTrace.)</p>
<h2>Network Throughput Results</h2>
<p>I try not to share performance testing results without triple checking numbers, and I don&#8217;t have time for that right now (this was just supposed to be a quick blog post). I can share some previous numbers from a few months ago, when I did have the time to test carefully and perform <a href="http://dtrace.org/blogs/brendan/2012/10/23/active-benchmarking/">Active Benchmarking</a>.</p>
<p>This was a series of network throughput and IOPS tests using iperf, to test differences with default installations of 1 Gbyte SmartOS Zones and CentOS KVM instances (Xen wasn&#8217;t tested). The client and server were in the same datacenter, but not on the same physical host, so that the full network stack was used.</p>
<p>I should make it clear that these results are not a &#8220;max config&#8221; for our cloud. It&#8217;s a <b>minimum config</b> (1 Gbyte instances). If this were a marketing activity, I&#8217;d probably be compelled to test the max config. Which, for our SmartOS kernel, will be a lot of work, as it can drive multiple 10 GbE ports at line rate, which requires a lot of load-generating clients to perform.</p>
<p>For these results, YMMV based on workload, platform kernel type, and tuning. If you are to use them, think carefully about how they would apply, and to what degree. If you workload is CPU- or File System-bound, then you are probably better off testing their performance than using these network results.</p>
<p>A typical invocation on the <b>server</b>:</p>
<pre>
iperf -s -l 128k
</pre>
<p>And on the <b>client</b>:</p>
<pre>
iperf -c server -l 128k -P 4 -i 1 -t 30
</pre>
<p>The thread count (-P) was varied to investigate limits. The final result &ndash; the average over 30 seconds &ndash; was used.</p>
<h3>Throughput</h3>
<p>Searching for the highest Gbits/sec:</p>
<table border=1>
<tr>
<th>source</th>
<th>dest</th>
<th>threads</th>
<th>result</th>
<th>suspected limiter</th>
</tr>
<tr>
<td>SmartOS 1 GB </td>
<td>SmartOS 1 GB </td>
<td>1 </td>
<td>2.75 Gbits/sec </td>
<td>client iperf @80% CPU, and network latency</td>
</tr>
<tr>
<td>SmartOS 1 GB </td>
<td>SmartOS 1 GB </td>
<td>2 </td>
<td>3.32 Gbits/sec </td>
<td>dest iperf up to 19% LAT, and network latency</td>
</tr>
<tr>
<td>SmartOS 1 GB </td>
<td>SmartOS 1 GB </td>
<td>4 </td>
<td><b>4.54 Gbits/sec</b> </td>
<td>client iperf over 10% LAT, hitting CPU caps</td>
</tr>
<tr>
<td>SmartOS 1 GB </td>
<td>SmartOS 1 GB </td>
<td>8 </td>
<td>1.96 Gbits/sec </td>
<td>client iperf LAT, hitting CPU caps</td>
</tr>
<tr>
<td>KVM CentOS 1 GB </td>
<td>KVM CentOS 1 GB </td>
<td>1 </td>
<td><b>400 Mbits/sec</b> </td>
<td>network/KVM latency (dest 60% of the 1 VCPU)</td>
</tr>
<tr>
<td>KVM CentOS 1 GB </td>
<td>KVM CentOS 1 GB </td>
<td>2 </td>
<td>394 Mbits/sec </td>
<td>network/KVM latency (dest 60% of the 1 VCPU)</td>
</tr>
<tr>
<td>KVM CentOS 1 GB </td>
<td>KVM CentOS 1 GB </td>
<td>4 </td>
<td>388 Mbits/sec </td>
<td>network/KVM latency (dest 60% of the 1 VCPU)</td>
</tr>
<tr>
<td>KVM CentOS 1 GB </td>
<td>KVM CentOS 1 GB </td>
<td>8 </td>
<td>389 Mbits/sec </td>
<td>network/KVM latency (dest 70% of the 1 VCPU)</td>
</tr>
</table>
<p></p>
<p>The peak <b>Zones</b> performance was 4.54 Gbits/sec with 4 threads. More threads hit the CPU caps for the 1 Gbyte (small) instance, with the CPU scheduler latency causing TCP breakdown. Larger SmartOS instances have higher CPU caps, and should be able to take performance further.</p>
<p>For the <b>KVM</b> test, these were default CentOS instances. I know that with a more modern Linux kernel with network stack tuning, we can improve throughput much further. The most I&#8217;ve reached is around 900 Mbits/sec for 1 VCPU KVM Linux (this was after we tuned KVM up from <a href="http://dtrace.org/blogs/brendan/2012/08/09/10-performance-wins/">110 Mbits/sec</a> using a lot of DTrace analysis). Even at 900 Mbits/sec, it&#8217;s still 5x slower than Zones.</p>
<p>Note the &#8220;suspected limiter&#8221; column. This is essential to confirm what was actually tested, and comes from Active Benchmarking. It means I did performance analysis for <i>every single result</i> (including those not listed here to save room). In case you are wondering, it took a full day to perform all tests and analyze each result (again, using DTrace).</p>
<h3>IOPS</h3>
<p>Searching for the highest packets/sec:</p>
<table border=1>
<tr>
<th>source</th>
<th>dest</th>
<th>threads</th>
<th>result</th>
<th>suspected limiter</th>
</tr>
<tr>
<td>SmartOS 1 GB </td>
<td>SmartOS 1 GB </td>
<td>1 </td>
<td>14000 packets/sec </td>
<td>client/dest thread count (each thread about 18% CPU total)</td>
</tr>
<tr>
<td>SmartOS 1 GB </td>
<td>SmartOS 1 GB </td>
<td>2 </td>
<td>23000 packets/sec </td>
<td>client/dest thread count</td>
</tr>
<tr>
<td>SmartOS 1 GB </td>
<td>SmartOS 1 GB </td>
<td>4 </td>
<td>36000 packets/sec </td>
<td>client/dest thread count</td>
</tr>
<tr>
<td>SmartOS 1 GB </td>
<td>SmartOS 1 GB </td>
<td>8 </td>
<td>60000 packets/sec </td>
<td>client/dest thread count</td>
</tr>
<tr>
<td>SmartOS 1 GB </td>
<td>SmartOS 1 GB </td>
<td>16 </td>
<td><b>78000 packets/sec</b> </td>
<td>both client &amp; dest CPU cap</td>
</tr>
<tr>
<td>KVM Centos 1 GB </td>
<td>KVM Centos 1 GB </td>
<td>1 </td>
<td>1180 packets/sec </td>
<td>network/KVM latency, thread count (client thread about 10% CPU)</td>
</tr>
<tr>
<td>KVM Centos 1 GB </td>
<td>KVM Centos 1 GB </td>
<td>2 </td>
<td>2300 packets/sec </td>
<td>network/KVM latency, thread count</td>
</tr>
<tr>
<td>KVM Centos 1 GB </td>
<td>KVM Centos 1 GB </td>
<td>4 </td>
<td>4400 packets/sec </td>
<td>network/KVM latency, thread count</td>
</tr>
<tr>
<td>KVM Centos 1 GB </td>
<td>KVM Centos 1 GB </td>
<td>8 </td>
<td>7900 packets/sec </td>
<td>network/KVM latency, thread count (threads now using about 30% CPU each; plenty idle)</td>
</tr>
<tr>
<td>KVM Centos 1 GB </td>
<td>KVM Centos 1 GB </td>
<td>16 </td>
<td>13500 packets/sec </td>
<td>network/KVM latency, thread count (~50% idle on both)</td>
</tr>
<tr>
<td>KVM Centos 1 GB </td>
<td>KVM Centos 1 GB </td>
<td>32 </td>
<td><b>18000 packets/sec</b> </td>
<td>CPU (dest &gt;90% of the 1 VCPU)</td>
</tr>
</table>
<p></p>
<p>In this case, Zones is 4x the packet rate of KVM. As before, the limiting factor becomes the cloud CPU limits, and I was only testing <i>small</i> 1 Gbyte servers. Bigger servers get higher CPU quotas, and all of these numbers should scale higher.</p>
<h2>Conclusion</h2>
<p>In this post, I summarized performance characteristics of three virtualization technologies &ndash; Zones, Xen, and KVM &ndash; and then investigated the I/O path in more detail. Zones add no overhead, whereas Xen and KVM do, which could limit network throughput to a quarter of what it could be.</p>
<p>By default we encourage customers to deploy on Zones, for reasons of performance, observability, and simplicity (debuggability). This may mean compiling their applications for <a href="http://smartos.org"</a>SmartOS</a> (our illumos-based OS which hosts the Zones) if they aren&#8217;t already in the repo. In cases where they absolutely must have Linux or Windows, and the applications can&#8217;t run elsewhere, then it&#8217;s Hardware Virtualization (KVM). </p>
<p>There are more performance characteristics to consider that I didn&#8217;t explore here, except briefly in the summary table, including how CPU allocation and VCPUs work, how memory allocation works and file system caches, and more. These could be topics for follow up posts.</p>
<p>This post wasn&#8217;t supposed to be so much about DTrace, but it&#8217;s the essential tool in so much of our high-performance work that it would be hard not to mention. We use it to improve overall performance for Zones and KVM, to track down latency outliers, explain benchmark results, study the effects of multi-tenancy, and to improve the performance of applications and the OS.</p>
]]></content:encoded>
			<wfw:commentRss>http://dtrace.org/blogs/brendan/2013/01/11/virtualization-performance-zones-kvm-xen/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>zfsday: ZFS Performance Analysis and Tools</title>
		<link>http://dtrace.org/blogs/brendan/2012/12/29/zfsday-zfs-performance-analysis-and-tools/</link>
		<comments>http://dtrace.org/blogs/brendan/2012/12/29/zfsday-zfs-performance-analysis-and-tools/#comments</comments>
		<pubDate>Sun, 30 Dec 2012 01:04:23 +0000</pubDate>
		<dc:creator>Brendan Gregg</dc:creator>
				<category><![CDATA[performance]]></category>
		<category><![CDATA[slides]]></category>
		<category><![CDATA[talk]]></category>
		<category><![CDATA[video]]></category>
		<category><![CDATA[ZFS]]></category>

		<guid isPermaLink="false">http://dtrace.org/blogs/brendan/?p=3312</guid>
		<description><![CDATA[At zfsday 2012, I gave a talk on ZFS performance analysis and tools, discussing the role of old and new observability tools for investigating ZFS, including many based on DTrace. This was a fun talk &#8211; probably my best so far &#8211; spanning performance analysis from the application level down through the kernel and to [...]]]></description>
			<content:encoded><![CDATA[<p>At <a href="http://zfsday.com/">zfsday</a> 2012, I gave a talk on ZFS performance analysis and tools, discussing the role of old and new observability tools for investigating ZFS, including many based on DTrace. This was a fun talk &ndash; probably my best so far &ndash; spanning performance analysis from the application level down through the kernel and to the storage device level.</p>
<p>My background with ZFS includes leading various performance work for the world&#8217;s first ZFS-based storage appliance at Sun Microsystems and later Oracle, and now further analysis and tuning as Joyent&#8217;s lead performance engineer where we run a public cloud on ZFS. Given the risk of other tenants (noisy neighbors) interfering with your performance, I can&#8217;t imagine running a cloud on anything else. This talk includes the tools and tuning we use to make sure ZFS runs smoothly.</p>
<p>The video is on <a href="http://www.youtube.com/watch?v=xkDqe6rIMa0">youtube</a>:</p>
<p><center><object width="560" height="315"><param name="movie" value="http://www.youtube.com/v/xkDqe6rIMa0?hl=en_US&amp;version=3"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/xkDqe6rIMa0?hl=en_US&amp;version=3" type="application/x-shockwave-flash" width="560" height="315" allowscriptaccess="always" allowfullscreen="true"></embed></object></center></p>
<p>The slides are available on <a href="http://www.slideshare.net/brendangregg/zfsperftools2012">slideshare</a> and as a <a href="http://www.brendangregg.com/Slides/zfsperftools2012.pdf">PDF</a>:</p>
<p><center><iframe src="http://www.slideshare.net/slideshow/embed_code/14562681" width="427" height="356" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC;border-width:1px 1px 0;margin-bottom:5px" allowfullscreen webkitallowfullscreen mozallowfullscreen> </iframe></center></p>
<p>During the talk I referenced several DTrace scripts for ZFS analysis. These are either from the File System chapter of the DTrace book, and are online at <a href="http://www.dtracebook.com/index.php/File_Systems">dtracebook.com</a>, or from the fs directory of the dtrace-cloud-tools project, online at <a href="https://github.com/brendangregg/dtrace-cloud-tools/tree/master/fs">github/brendangregg</a>.</p>
<p>Thanks to <a href="http://www.beginningwithi.com">Deirdré</a> for organizing a great conference, and filming and editing my video, and to all who spoke, attended, and helped out. For more about zfsday, see Adam&#8217;s <a href="http://dtrace.org/blogs/ahl/2012/11/25/illumos-and-zfs-days/">summary</a>. I&#8217;m looking forward to the next zfsday!</p>
]]></content:encoded>
			<wfw:commentRss>http://dtrace.org/blogs/brendan/2012/12/29/zfsday-zfs-performance-analysis-and-tools/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The USE Method: SmartOS Performance Checklist</title>
		<link>http://dtrace.org/blogs/brendan/2012/12/19/the-use-method-smartos-performance-checklist/</link>
		<comments>http://dtrace.org/blogs/brendan/2012/12/19/the-use-method-smartos-performance-checklist/#comments</comments>
		<pubDate>Wed, 19 Dec 2012 17:23:41 +0000</pubDate>
		<dc:creator>Brendan Gregg</dc:creator>
				<category><![CDATA[illumos]]></category>
		<category><![CDATA[omnios]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[smartos]]></category>
		<category><![CDATA[Solaris]]></category>
		<category><![CDATA[usemethod]]></category>
		<category><![CDATA[zones]]></category>

		<guid isPermaLink="false">http://dtrace.org/blogs/brendan/?p=3283</guid>
		<description><![CDATA[The USE Method provides a strategy for performing a complete check of system health, identifying common bottlenecks and errors. For each system resource, metrics for utilization, saturation and errors are identified and checked. Any issues discovered are then investigated using further strategies. In this post, I&#8217;ll provide an example of a USE-based metric list for [...]]]></description>
			<content:encoded><![CDATA[<p>The <a href="http://dtrace.org/blogs/brendan/2012/02/29/the-use-method/">USE Method</a> provides a strategy for performing a complete check of system health, identifying common bottlenecks and errors.  For each system resource, metrics for utilization, saturation and errors are identified and checked.  Any issues discovered are then investigated using further strategies.</p>
<p>In this post, I&#8217;ll provide an example of a USE-based metric list for use within a <a href="http://smartos.org/">SmartOS</a> SmartMachine (Zone), such as those provided by the <a href="http://www.joyent.com">Joyent Public Cloud</a>. These use the <a href="http://illumos.org">illumos</a> kernel, and so this list should also be mostly relevant for <a href="http://omnios.omniti.com/">OmniOS</a> Zones, and to a lesser degree (due to some missing features) Solaris Zones. This is primarily intended for users of the zones. For the system administrators of the physical systems (via the Global Zone), also see the <a href="http://dtrace.org/blogs/brendan/2012/03/01/the-use-method-solaris-performance-checklist/">Solaris</a> checklist, which has greater visibility.</p>
<p>Cloud limits (software resource controls) are listed first, as they are usually encountered before the physical limits.</p>
<style type="text/css">
tt {
	font-size:13px;
}
</style>
<h2>Cloud Limits</h2>
<p>These cover CPU, memory, disk I/O (file system), and network.</p>
<table border=1 cellpadding=2 width="100%">
<tr>
<th>component</th>
<th>type</th>
<th>metric</th>
</tr>
<tr>
<td>CPU cap</td>
<td>utilization</td>
<td><tt>sm-cpuinfo</tt> (previously <tt>jinf -c</tt>); raw counters: <tt>kstat -p caps::cpucaps_zone*:</tt>, &#8220;usage&#8221; == current CPU used, &#8220;value&#8221; == CPU cap</td>
</tr>
<tr>
<td>CPU cap</td>
<td>saturation</td>
<td><tt>uptime</tt> load averages are zone-aware; per-process: <tt>prstat -mLc 1</tt>, &#8220;LAT&#8221;; rough counter: <tt>kstat -p caps::cpucaps_zone*:above_sec</tt></td>
</tr>
<tr>
<td>CPU cap</td>
<td>errors</td>
<td><i>N/A</i></td>
</tr>
<tr>
<td>Memory cap</td>
<td>utilization</td>
<td><tt>sm-meminfo rss</tt> for main memory (previously <tt>jinf -m</tt>); <tt>sm-meminfo swap</tt> for virtual memory; <tt>zonememstat</tt>, &#8220;RSS&#8221; vs &#8220;CAP&#8221;; <tt>prstat -Z</tt>, zone &#8220;RSS&#8221;, &#8220;SIZE&#8221; (VM); raw counters: <tt>kstat -p memory_cap:::</tt>, &#8220;rss&#8221; vs &#8220;physcap&#8221;, &#8220;swap&#8221; vs &#8220;swapcap&#8221;</td>
</tr>
<tr>
<td>Memory cap</td>
<td>saturation</td>
<td><tt>zonememstat</tt>, increasing &#8220;NOVER&#8221; (# over) and &#8220;POUT&#8221; (paged out); per-process: <tt>prstat -mLc 1</tt>, &#8220;DFL&#8221;; some raw counters: <tt>kstat -p memory_cap:::anonpgin</tt></td>
</tr>
<tr>
<td>Memory cap</td>
<td>errors</td>
<td>DTrace failed malloc()s; raw counters: <tt>kstat -p memory_cap:::anon_alloc_fail</tt></td>
</tr>
<tr>
<td>FS I/O throttle</td>
<td>utilization</td>
<td><i>N/A &#8211; it kicks in only when needed (see saturation)</i></td>
</tr>
<tr>
<td>FS I/O throttle</td>
<td>saturation</td>
<td><tt>vfsstat</tt>, &#8220;d/s&#8221; (delays/sec), and magnitude of &#8220;del_t&#8221; (average delay time, us)</td>
</tr>
<tr>
<td>FS I/O throttle</td>
<td>errors</td>
<td><i>N/A</i></td>
</tr>
<tr>
<td>FS capacity</td>
<td>utilization</td>
<td><tt>df -h</tt>, &#8220;used&#8221; / &#8220;size&#8221;</td>
</tr>
<tr>
<td>FS capacity</td>
<td>saturation</td>
<td>once it&#8217;s full, ENOSPC</td>
</tr>
<tr>
<td>FS capacity</td>
<td>errors</td>
<td>DTrace errno for FS syscalls; /var/adm/messages file system full messages</td>
</tr>
<tr>
<td>Network cap</td>
<td>utilization</td>
<td><tt>dladm show-linkprop -p maxbw</tt> for max bandwidth (if set); <tt>dladm show-link -s -i 1 <i>net0</i></tt>, for current throughput; <tt>nicstat</tt> can also show throughput</td>
</tr>
<tr>
<td>Network cap</td>
<td>saturation</td>
<td><i>not available from within a zone (need to DTrace mac_bw_state &amp; SRS_BW_ENFORCED)</i></td>
</tr>
<tr>
<td>Network cap</td>
<td>errors</td>
<td><i>N/A</i></td>
</tr>
</table>
<p></p>
<ul>
<li>For the Joyent Public Cloud, the CPU cap is the <i>bursting</i> limit. You are bursting when your CPU usage is over the kstat &#8220;caps::cpucaps_zone*:baseline&#8221;. If everyone bursts at the same time, your minimum usage should be the baseline, which is provided by the Fair Share Scheduler (FSS).</li>
<li><tt>sm-cpuinfo</tt> is from the <a href="http://wiki.joyent.com/wiki/display/jpc2/SmartMachine+Tools+Package">smtools</a> package.</li>
</ul>
<p>Storage devices (disks) are not listed, since limits for storage I/O are imposed at the file system layer.</p>
<h2>Physical Resources</h2>
<p>Since Zones are <i>OS-Virtualization</i> (OS partitioning), the physical resources are not emulated or virtualized, and many of the observability tools will show you the entire physical system. This can be both good &ndash; you can really understand what&#8217;s going on, and confusing &ndash; why are the resources busy when my system is idle? (it&#8217;s someone else; you can&#8217;t see their process address space).</p>
<table border=1 cellpadding=2 width="100%">
<tr>
<th>component</th>
<th>type</th>
<th>metric</th>
</tr>
<tr>
<td>CPU</td>
<td>utilization</td>
<td>per-cpu: <tt>mpstat 1</tt>, &#8220;idl&#8221;; system-wide: <tt>vmstat 1</tt>, &#8220;id&#8221;; per-process: <tt>prstat -c 1</tt> (&#8220;CPU&#8221; == recent), <tt>prstat -mLc 1</tt> (&#8220;USR&#8221; + &#8220;SYS&#8221;); per-kernel-thread: <i>not available from within a zone</i></td>
</tr>
<tr>
<td>CPU</td>
<td>saturation</td>
<td>system-wide: <tt>vmstat 1</tt>, &#8220;r&#8221;; per-process: <tt>prstat -mLc 1</tt>, &#8220;LAT&#8221;</td>
</tr>
<tr>
<td>CPU</td>
<td>errors</td>
<td><tt>fmdump</tt></td>
</tr>
<tr>
<td>Memory capacity</td>
<td>utilization</td>
<td>system-wide: <tt>vmstat 1</tt>, &#8220;free&#8221; (main memory), &#8220;swap&#8221; (virtual memory); per-process: <tt>prstat -c</tt>, &#8220;RSS&#8221; (main memory), &#8220;SIZE&#8221; (virtual memory)</td>
</tr>
<tr>
<td>Memory capacity</td>
<td>saturation</td>
<td>system-wide: <tt>vmstat 1</tt>, &#8220;sr&#8221; (bad now), &#8220;w&#8221; (was very bad); <tt>vmstat -p 1</tt>, &#8220;api&#8221; (anon page ins == pain), &#8220;apo&#8221;; per-process: <tt>prstat -mLc 1</tt>, &#8220;DFL&#8221;</td>
</tr>
<tr>
<td>Memory capacity</td>
<td>errors</td>
<td><tt>fmdump</tt>; DTrace failed malloc()s</td>
</tr>
<tr>
<td>Network Interfaces</td>
<td>utilization</td>
<td><tt>nicstat</tt> (see notes below); <tt>kstat</tt> (look for physical interface kstats, eg, <tt>kstat -p | grep ifspeed</tt> to find their names, and then <tt>kstat -p ixgbe::mac:</tt> for ixgbe interfaces)</tt></td>
</tr>
<tr>
<td>Network Interfaces</td>
<td>saturation</td>
<td><tt>nicstat</tt>; <tt>kstat</tt> for whatever custom statistics are available (eg, "nocanputs", "defer", "norcvbuf", "noxmtbuf"); <tt>netstat -s</tt>, retransmits</td>
</tr>
<tr>
<td>Network Interfaces</td>
<td>errors</td>
<td><tt>netstat -i</tt>, error counters; <tt>kstat</tt> for extended errors, look in the interface and "link" statistics (there are often custom counters for the card); <i>driver internals not available from within a zone</i></td>
</tr>
<tr>
<td>Storage device I/O</td>
<td>utilization</td>
<td>system-wide: <tt>iostat -xnz 1</tt>, "%b"</td>
</tr>
<tr>
<td>Storage device I/O</td>
<td>saturation</td>
<td><tt>iostat -xnz 1</tt>, "wait"</td>
</tr>
<tr>
<td>Storage device I/O</td>
<td>errors</td>
<td><tt>iostat -En</tt>; <i>driver internals not available from within a zone</i></tt></td>
</tr>
<tr>
<td>Storage capacity</td>
<td>utilization</td>
<td>swap: <tt>swap -s</tt>; file systems: "df -h"</td>
</tr>
<tr>
<td>Storage capacity</td>
<td>saturation</td>
<td>once it's full, ENOSPC</td>
</tr>
<tr>
<td>Storage capacity</td>
<td>errors</td>
<td>DTrace errno on FS syscalls; /var/adm/messages file system full messages</td>
</tr>
<tr>
<td>Storage controller</td>
<td>utilization</td>
<td><tt>iostat -Cxnz 1</tt>, compare to known IOPS/tput limits per-card</td>
</tr>
<tr>
<td>Storage controller</td>
<td>saturation</td>
<td>look for kernel queueing: sd (iostat "wait" again)</td>
</tr>
<tr>
<td>Storage controller</td>
<td>errors</td>
<td>/var/adm/messages; <i>driver internals not available from within a zone</i></td>
</tr>
<tr>
<td>Network controller</td>
<td>utilization</td>
<td>infer from <tt>kstat</tt> or <tt>nicstat</tt> and known controller max tput</td>
</tr>
<tr>
<td>Network controller</td>
<td>saturation</td>
<td>see network interface saturation</td>
</tr>
<tr>
<td>Network controller</td>
<td>errors</td>
<td><tt>kstat</tt> for whatever is there; <i>driver internals not available from within a zone</i></td>
</tr>
<tr>
<td>CPU interconnect</td>
<td>utilization</td>
<td><i>not available from within a zone</i></td>
</tr>
<tr>
<td>CPU interconnect</td>
<td>saturation</td>
<td><i>not available from within a zone</i></td>
</tr>
<tr>
<td>CPU interconnect</td>
<td>errors</td>
<td><i>not available from within a zone</i></td>
</tr>
<tr>
<td>Memory interconnect</td>
<td>utilization</td>
<td><i>not available from within a zone</i></td>
</tr>
<tr>
<td>Memory interconnect</td>
<td>saturation</td>
<td><i>not available from within a zone</i></td>
</tr>
<tr>
<td>Memory interconnect</td>
<td>errors</td>
<td><i>not available from within a zone</i></td>
</tr>
<tr>
<td>I/O interconnect</td>
<td>utilization</td>
<td><i>not available from within a zone</i></td>
</tr>
<tr>
<td>I/O interconnect</td>
<td>saturation</td>
<td><i>not available from within a zone</i></td>
</tr>
<tr>
<td>I/O interconnect</td>
<td>errors</td>
<td><i>not available from within a zone</i></td>
</tr>
</table>
<p></p>
<ul>
<li>For Joyent SmartMachines, <a href="http://dtrace.org/blogs/dap/2011/03/01/welcome-to-cloud-analytics/">Cloud Analytics</a> (both the GUI and API) provide additional details from the physical system (global zone) that are not directly visible from within the SmartMachine.</li>
<li>CPU utilization: a single hot CPU can be caused by a single hot thread, or mapped hardware interrupt.  Relief of the bottleneck usually involves tuning to use more CPUs in parallel.</li>
<li>nicstat may already be available in the SmartMachine; if not, there are both <a href="https://blogs.oracle.com/timc/entry/nicstat_the_solaris_and_linux">C</a> and <a href="http://www.brendangregg.com/K9Toolkit/nicstat">Perl</a> versions (either may need a little tweaking to work properly).</li>
<li>vmstat "r": this is coarse as it is only updated once per second.</li>
<li>Memory capacity utilization: interpreting vmstat's "free" has been tricky across different Solaris versions (we documented it in the Perf &amp; Tools book), due to different ways it was calculated, and tunables that affect when the system will kick-off the page scanner.  It'll also typically shrink as the kernel uses unused memory for caching (ZFS ARC).</li>
<li>Be aware that kstat can report bad data (so can any tool); there isn't really a test suite for kstat data, and engineers can add new code paths and forget to add the counters.</li>
</ul>
<h2>Software Resources</h2>
<table border=1 width="100%">
<tr>
<th>component</th>
<th>type</th>
<th>metric</th>
</tr>
<tr>
<td>Kernel mutex</td>
<td>utilization</td>
<td><i>not available from within a zone</i></td>
</tr>
<tr>
<td>Kernel mutex</td>
<td>saturation</td>
<td><tt>mpstat</tt> "smtx"</td>
</tr>
<tr>
<td>Kernel mutex</td>
<td>errors</td>
<td><i>not available from within a zone</i></td>
</tr>
<tr>
<td>User mutex</td>
<td>utilization</td>
<td><tt>plockstat -H</tt> (held time); DTrace plockstat provider</td>
</tr>
<tr>
<td>User mutex</td>
<td>saturation</td>
<td><tt>plockstat -C</tt> (contention); <tt>prstat -mLc 1</tt>, "LCK"; DTrace plockstat provider</td>
</tr>
<tr>
<td>User mutex</td>
<td>errors</td>
<td>DTrace plockstat and pid providers, for EAGAIN, EINVAL, EPERM, EDEADLK, ENOMEM, EOWNERDEAD, ... see pthread_mutex_lock(3C)</tt></td>
</tr>
<tr>
<td>Process capacity</td>
<td>utilization</td>
<td><tt>kstat</tt>, "unix:0:var:v_proc" for system-wide max, system-wide current usage isn't available in a zone, but "unix:0:process_cache:slab_alloc" gives a rough idea; zone: "unix:0:system_misc:nproc" for current zone usage; <tt>prctl -n zone.max-processes -i zone <i>ZONE</i></tt>, "privileged/system" for zone max, and "usage" for current usage.</td>
</tr>
<tr>
<td>Process capacity</td>
<td>saturation</td>
<td>queueing on pidlinklock in pid_allocate(), as it scans for available slots once the table gets full.</td>
</tr>
<tr>
<td>Process capacity</td>
<td>errors</td>
<td>"can't fork()" messages</td>
</tr>
<tr>
<td>Thread capacity</td>
<td>utilization</td>
<td>user-level: <tt>prctl -n zone.max-lwps -i zone <i>ZONE</i></tt>, "privileged/system" for zone max, and "usage" for current zone usage; kernel: limited by system memory - see memory usage.</td>
</tr>
<tr>
<td>Thread capacity</td>
<td>saturation</td>
<td>threads blocking on memory allocation - see memory cap usage.</td>
</tr>
<tr>
<td>Thread capacity</td>
<td>errors</td>
<td>user-level: pthread_create() failures with EAGAIN, EINVAL, ...; kernel: <i>not available from within a zone</i></td>
</tr>
<tr>
<td>File descriptors</td>
<td>utilization</td>
<td>system-wide (no limit other than RAM); per-process: <tt>pfiles</tt> vs <tt>ulimit</tt> or <tt>prctl -t basic -n process.max-file-descriptor PID</tt>; a quicker check than pfiles is <tt>ls /proc/PID/fd | wc -l</tt></td>
</tr>
<tr>
<td>File descriptors</td>
<td>saturation</td>
<td>I don't think there is any queueing or blocking, other than on memory allocation.</td>
</tr>
<tr>
<td>File descriptors</td>
<td>errors</td>
<td><tt>truss</tt> or DTrace (better) to look for errno == EMFILE on syscalls returning fds (eg, open(), accept(), ...).</td>
</tr>
</table>
<ul>
<li>plockstat often drop events due to load; I often roll my own to avoid this using the DTrace plockstat provider (examples in the <a href="http://www.dtracebook.com">DTrace book</a>).</li>
<li>File descriptor utilization: while other OSes have a system-wide limit, Solaris doesn't (at least at the moment, this could change; see my <a href="http://dtrace.org/blogs/brendan/files/2012/03/solaris_fd_limit.txt">writeup</a> about it).</li>
</ul>
<h2>What's Next</h2>
<p>See <a href="http://dtrace.org/blogs/brendan/2012/02/29/the-use-method/">the USE Method</a> for the follow-up strategies after identifying a possible bottleneck.  If you complete this checklist but still have a performance issue, move onto other strategies: drill-down analysis and latency analysis.</p>
<p>Also see the <a href="http://dtrace.org/blogs/brendan/2012/03/01/the-use-method-solaris-performance-checklist/">Solaris Performance Checklist</a> if you have access to the physical host (global zone).</p>
]]></content:encoded>
			<wfw:commentRss>http://dtrace.org/blogs/brendan/2012/12/19/the-use-method-smartos-performance-checklist/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>USENIX LISA 2012: Performance Analysis Methodology</title>
		<link>http://dtrace.org/blogs/brendan/2012/12/13/usenix-lisa-2012-performance-analysis-methodology/</link>
		<comments>http://dtrace.org/blogs/brendan/2012/12/13/usenix-lisa-2012-performance-analysis-methodology/#comments</comments>
		<pubDate>Thu, 13 Dec 2012 22:51:19 +0000</pubDate>
		<dc:creator>Brendan Gregg</dc:creator>
				<category><![CDATA[methodology]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[slides]]></category>
		<category><![CDATA[talk]]></category>
		<category><![CDATA[usemethod]]></category>

		<guid isPermaLink="false">http://dtrace.org/blogs/brendan/?p=3255</guid>
		<description><![CDATA[At USENIX LISA 2012, I gave a talk titled Performance Analysis Methodology. This covered ten performance analysis anti-methodologies and methodologies, including the USE Method. I wrote about these in the ACMQ article Thinking Methodically about Performance, which is worth reading for more detail. I&#8217;ve also posted USE Method-derived checklists for Solaris- and Linux-based systems. The [...]]]></description>
			<content:encoded><![CDATA[<p>At <a href="https://www.usenix.org/conference/lisa12">USENIX LISA</a> 2012, I gave a talk titled <a href="https://www.usenix.org/conference/lisa12/tech-schedule/technical-sessions">Performance Analysis Methodology</a>. This covered ten performance analysis anti-methodologies and methodologies, including the USE Method. I wrote about these in the ACMQ article <a href="http://queue.acm.org/detail.cfm?id=2413037">Thinking Methodically about Performance</a>, which is worth reading for more detail. I&#8217;ve also posted USE Method-derived checklists for <a href="http://dtrace.org/blogs/brendan/2012/03/01/the-use-method-solaris-performance-checklist/">Solaris</a>- and <a href="http://dtrace.org/blogs/brendan/2012/03/07/the-use-method-linux-performance-checklist/">Linux</a>-based systems.</p>
<p>The <a href="https://www.usenix.org/conference/lisa12/performance-analysis-methodology">video of the talk</a> is on the LISA site, and the slides are below, also available as a <a href="http://www.brendangregg.com/Slides/LISA2012_methodologies.pdf">PDF</a>.</p>
<p><center><iframe src="http://www.slideshare.net/slideshow/embed_code/15627732" width="427" height="356" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC;border-width:1px 1px 0;margin-bottom:5px" allowfullscreen webkitallowfullscreen mozallowfullscreen> </iframe></center></p>
<p>I&#8217;ve summarized the methodologies in the talk below.</p>
<h2>Methodology Summaries</h2>
<p>Blame-Someone-Else Anti-Method:</p>
<ol>
<li>Find a system or environment component you are not responsible for</li>
<li>Hypothesize that the issue is with that component</li>
<li>Redirect the issue to the responsible team</li>
<li>When proven wrong, go to 1</li>
</ol>
<p>Streetlight Anti-Method:</p>
<ol>
<li>Pick observability tools that are</li>
<ul>
   familiar<br />
   found on the Internet<br />
   found at random
</ul>
<li>Run tools</li>
<li>Look for obvious issues</li>
</ol>
<p>Ad Hoc Checklist Method:</p>
<ol>
<li>..N. Run A, if B, do C</li>
</ol>
<p>Problem Statement Method:</p>
<ol>
<li>What makes you think there is a performance problem?</li>
<li>Has this system ever performed well?</li>
<li>What has changed recently? (Software? Hardware? Load?)</li>
<li>Can the performance degradation be expressed in terms of latency or run time?</li>
<li>Does the problem affect other people or applications (or is it just you)?</li>
<li>What is the environment? What software and hardware is used? Versions? Configuration?</li>
</ol>
<p>Scientific Method:</p>
<ol>
<li>Question</li>
<li>Hypothesis</li>
<li>Prediction</li>
<li>Test</li>
<li>Analysis</li>
</ol>
<p>Workload Characterization Method:</p>
<ol>
<li>Who is causing the load? PID, UID, IP addr, &#8230;</li>
<li>Why is the load called? code path</li>
<li>What is the load? IOPS, tput, type</li>
<li>How is the load changing over time?</li>
</ol>
<p>Drill-Down Analysis Method:</p>
<ol>
<li>Start at highest level</li>
<li>Examine next-level details</li>
<li>Pick most interesting breakdown</li>
<li>If problem unsolved, go to 2</li>
</ol>
<p>Latency Analysis Method:</p>
<ol>
<li>Measure operation time (latency)</li>
<li>Divide into logical synchronous components</li>
<li>Continue division until latency origin is identified</li>
<li>Quantify: estimate speedup if problem fixed</li>
</ol>
<p>USE Method:</p>
<p>For every resource, check:</p>
<ol>
<li>Utilization</li>
<li>Saturation</li>
<li>Errors</li>
</ol>
<p>Stack Profile Method:</p>
<ol>
<li>Profile thread stack traces (on- and off-CPU)</li>
<li>Coalesce</li>
<li>Study stacks bottom-up</li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://dtrace.org/blogs/brendan/2012/12/13/usenix-lisa-2012-performance-analysis-methodology/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
