The Total Cost of Unmasked Data

Data breaches make headlines at a regular cadence. Each is a surprise, but they are not, as a whole, surprising. While the extensive and sophisticated Target breach stuck in the headlines, a significant breach at three South Korean credit card companies happened around the same time. The theft of personal information for 20m subscribers didn’t have near the level of sophistication. Developers and contractors were simply given copies of production databases filled with personal information that they shouldn’t have been able to access.

When talking to Delphix customers and prospects, those that handle personal or sensitive information (typically financial services or heath care) inevitably ask how Delphix can help with masking. Turning the question around, asking how they mask data today sucks the air out of the room. Some deflect, talking about relevant requirements and regulations; others, pontificate obliquely about solutions they’ve bought; no one unabashedly claims to be fully implemented and fully compliant.

Data masking is hard to deploy consistently. I hear it from (honest) customers, and from data masking vendors. The striking attribute of the South Korean breach was that the Economist and other non-technical news sources called out unmasked data as the root cause:

“In 2012 a law was passed requiring the encryption of most companies’ databases, yet the filched data were not encoded. The contractor should never have been given access to customer records, he says; dummy data would have sufficed.”

These were non-production database copies, used for development and testing. There was no need for employees or contractors to interact with sensitive data. Indeed, those companies have a legal obligation not to keep production data in their development environments. All three credit card companies, and the credit bureau are customers of vendors that provide masking solutions. The contractor who loaded data for 20m individuals onto a USB stick didn’t need the real data, and should never been granted access. As with the customers I talk to, data masking surely proved too difficult to roll out in a manner that was secure and didn’t slow development, so it was relegated to shelfware.

Delphix fully automates the creation of non-production environments. It integrates with masking tools from Axis, Informatica, IBM, and others to ensure that every one of those environments is masked as a matter of mechanism rather than a manual process. What is the cost of unimplemented data masking? Obviously there are the fines and negative press, the lawsuits, and the endless mea culpas. At these credit card companies though literally dozens of executives resigned for failing to secure data, from all three CEOs on down. And in all likelihood, they had data masking solutions on the shelf, cast aside as too hard to implement.

Posted on February 12, 2014 at 10:01 am by ahl · Permalink · Comments Closed
In: Delphix · Tagged with: 

The OpenZFS write throttle

In my last blog post, I wrote about the ZFS write throttle, and how we saw it lead to pathological latency variability on customer systems. Matt Ahrens, the co-founder of ZFS, and I set about to fix it in OpenZFS. While the solution we came to may seem obvious, we arrived at it only through a bit of wandering in a wide open solution space.

The ZFS write throttle was fundamentally flawed — the data indelibly supported this diagnosis. The cure was far less clear. The core problem involved the actual throttling mechanism, allowing many fast writes while stalling some writes nearly without bound, with some artificially delayed writes ostensibly to soften the landing. Further, the mechanism relied on an accurate calculation of the backend throughput — a problem in itself, but one we’ll set aside for the moment.

On a frictionless surface in a vacuum…

Even in the most rigorously contrived, predictable cases, the old write throttle would yield high variance in the latency of writes. Consider a backend that can handle an unwavering 100MB/s (or 1GB/s or 10GB/s — pick your number). For a client with 10 threads executing 8KB async writes (again to keep it simple) to hit 100MB/s, the average latency would be around 780µs — not unreasonable.

Here’s how that scenario would play out with the old write throttle assuming full quiesced and syncing transaction groups (you may want to refer to my last blog post for a refresher on the mechanism and some of the numbers). With a target of 5 seconds to write out its contents, the currently open transaction group would be limited to 500MB. Recall that after 7/8ths of the limit is consumed, the old write throttle starts inserting a 10ms delay, so the first 437.5MB would come sailing in, say, with an average latency of 780µs, but then the remaining writes would average at least 10ms (scheduling delay could drive this even higher). With this artificially steady rate, the delay would occur 7/8ths of the way into our 5 second window, with 1/8th of the total remaining. So with 5/8ths of a second left, and an average latency of 10ms, the client would be able to write only and additional 500KB worth of data. More simply: data would flow at 100MB/s most of the time, and at less than 1MB/s the rest.

In this example the system inserted far too much delay — indeed, no delay was needed. In another scenario it could just have easily inserted too little.

Consider a case where we would require writers to be throttled. This time, let’s say the client has 1000 threads, and — since it’s now relevant — let’s say we’re limited to the optimistic 10GbE speed of 1GB/s. In this case the client would hit the 7/8ths in less than a second. 1000 threads writing 8KB every 10ms still push data at 800MB/s so we’d hit the hard limit just a fraction of a second later. With the quota exhausted, all 1000 threads would then block for about 4 seconds. A backend that can do 100MB/s x 5 seconds = 500MB = 64,000 x 8KB; the latency of those 64,000 writes breaks down like this: 55000 super fast, 8000 at 10ms, and 1000 at 4 seconds. Note that the throughput would be only slightly higher than in the previous example; the average latency would be approximately 1000 times higher which is optimal and expected.

In this example we delayed way too little, and paid the price with enormous 4 second outliers.

How to throttle

Consistency is more important than the average. The VP of Systems at a major retailer recently told me that he’d take almost always take a higher average for lower variance. Our goal for OpenZFS was to have consistent latency without lowering the average (if we could improve the average, hey so much the better). Given the total amount of work, there is a certain amount of delay we’d need to insert. The ZFS write throttle does so unequally. Our job was to delay all writes a little bit rather than some a lot.

One of our first ideas was to delay according to measured throughput. As with the example above, let’s say that the measured throughput of the backend was 100MB/s. If the transaction group had been open for 500ms, and we had accumulated 55MB so far, the next write would be delayed for 50ms, enough time to reduce the average to 100MB/s.

Think of it like a diagonal line on a graph from 0MB at time zero to the maximum size (say, 500MB) at the end of the transaction group (say, 5s). As the accumulated data pokes above that line, subsequent writes would be delayed accordingly. If we hit the data limit per transaction group then writes would be delayed as before, but it should be less likely as long as we’ve measured the backend throughput accurately.

There were two problems with this solution. First, calculating the backend throughput isn’t possible to do accurately. Performance can fluctuate significantly due to the location of writes, intervening synchronous activity (e.g. reads), or even other workloads on a multitenant storage array. But even if we could calculate it correctly, ZFS can’t spend all its time writing user data; some time must be devoted to writing metadata and doing other housekeeping.

Size doesn’t matter

Erasing the whiteboard, we added one constraint and loosened another: don’t rely an estimation of backend throughput, and don’t worry too much about transaction group duration.

Rather than capping transaction groups to a particular size, we would limit the amount of system memory that could be dirty (modified) at any given time. As memory filled past a certain point we would start to delay writes proportionally.

OpenZFS didn’t have a mechanism to track the outstanding dirty data. Adding it was non-trivial as it required communication across the logical (DMU) and physical (SPA) boundaries to smoothly retire dirty data as physical IOs completed. Logical operations given data redundancy (mirrors, RAID-Z, and ditto blocks) have multiple associated physical IOs. Waiting for all of them to complete would lead to lurches in the measure of outstanding dirty data. Instead, we retire a fraction of the logical size each time a physical IO completes.

By using this same metric of outstanding dirty data, we observed that we could address a seemingly unrelated, but chronic problem observed in ZFS — so called “picket-fencing”, the extreme burstiness of writes that ZFS issues to its disks. ZFS has a fixed number of concurrent outstanding IOs it issues to a device. Instead the new IO scheduler would issues a variable number of writes proportional to the amount of dirty data. With data coming in at a trickle, OpenZFS would trickle data to the backend, issuing 1 IO at a time. As incoming data rate increased, the IO scheduler would work harder, scheduling more concurrent writes in order to keep up (up to a fixed limit). As noted above, if OpenZFS couldn’t keep up with the rate of incoming data, it would insert delays also proportional to the amount of outstanding dirty data.


The goal was improved consistency with no increase in the average latency. The results of our tests speak for themselves (log-log scale).

Note the single-moded distribution of OpenZFS compared with the highly varied results from ZFS. You can see by the dashed lines that we managed to slightly improve the average latency (1.04ms v. 1.27ms).

OpenZFS now represents a significant improvement over ZFS with regard to consistency both of client write latency and of backend write operations. In addition, the new IO scheduler improves upon ZFS when it comes to tuning. The mysterious magic numbers and inscrutable tuneables of the old write throttle have been replaced with knobs that are comprehensible, and can be connected more directly with observed behavior. In the final post in this series, I’ll look at how to tune the OpenZFS write throttle.

Posted on February 10, 2014 at 3:55 am by ahl · Permalink · 8 Comments
In: ZFS · Tagged with: , ,

ZFS fundamentals: the write throttle

It’s no small feat to build a stable, modern filesystem. The more I work with ZFS, the more impressed I am with how much it got right, and how malleable it’s proved. It has evolved to fix shortcomings and accommodate underlying technological shifts. It’s not surprising though that even while its underpinnings have withstood the test of production use, ZFS occasionally still shows the immaturity of the tween that it is.

Even before the ZFS storage appliance launched in 2008, ZFS was heavily used and discussed Solaris and OpenSolaris communities, the frequent subject of praise and criticism. A common grievance was that write-heavy workloads would consume massive amounts of system memory… and then render the system unusable as ZFS dutifully deposited the new data onto the often anemic storage (often a single spindle for OpenSolaris users).

For workloads whose ability to generate new data far outstripped the throughput of persistent storage, it became clear that ZFS needed to impose some limits. ZFS should have effective limits on the amount of system memory devoted to “dirty” (modified) data. Transaction groups should be bounded to prevent high latency IO and administrative operations. At a high level, ZFS transaction groups are just collections of writes (transactions), and there can be three transaction groups active at any given time; for a more thorough treatment, check out last year’s installment of ZFS knowledge.

Write Throttle 1.0 (2008)

The proposed solution appealed to an intuitive understanding of the system. At the highest level, don’t let transaction groups grow indefinitely. When a transaction reached a prescribed size, ZFS would create a new transaction group; if three already existed, it would block waiting for the syncing transaction group to complete. Limiting the size of each transaction group yielded a number of benefits. ZFS would no longer consume vast amounts of system memory (quelling outcry from the user community). Administrative actions that execute at transaction group boundaries would be more responsive. And synchronous, latency-sensitive operations wouldn’t have to contend with a deluge of writes from the syncing transaction group.

So how big should transaction groups be? The solution included a target duration for writing out a transaction group (5 seconds). The size of each transaction group would be based on that time target and an inferred write bandwidth. Duration times bandwidth equals target size. The inferred bandwidth would be recomputed after each transaction group.

When the size limit for a transaction group was reached, new writes would wait for the next transaction group to open. This could be nearly instantaneous if there weren’t already three transaction groups active, or it could incur a significant delay. To ameliorate this, the write throttle would insert a 10ms delay for all new writes once 7/8th of the size had been consumed.

See the gory details in the git commit.


That initial write throttle made a comprehensible, earnest effort to address some critical problems in ZFS. And, to a degree, it succeeded. Though the lack of rigorous ZFS performance testing at that time is reflected in the glaring deficiencies with that initial write throttle. A simple logic bug lingered for other two months, causing all writes to be delayed by 10ms, not just those executed after the transaction group had reached 7/8ths of its target capacity — trivial, yes, but debilitating and telling. The computation of the write throttle resulted in values that varied rapidly; eventually a slapdash effort at hysteresis was added.

Stepping back, the magic constants arouse concern. Why should transaction groups last 5 seconds? Yes, they should be large enough to amortize metadata updates within a transaction group, and they should not be so large that they cause administrative unresponsiveness. For the ZFS storage appliance we experimented with lower values in an effort to smooth out the periodic bursts of writes — an effect we refer to as “picket-fencing” for its appearance in our IO visualization interface. Even more glaring, where did the 7/8ths cutoff come from or the 10ms worth of delay? Even if the computed throughput was dead accurate, the algorithm would lead to ZFS unnecessarily delaying writes. At first blush, this scheme was not fatally flawed, but surely arbitrary, disconnected from real results, and nearly impossible to reason about on a complex system.


The write throttle demonstrated problems more severe than the widely observed picket-fencing. While ZFS attempted to build a stable estimate of write throughput capacity, the computed number would, in practice, swing wildly. As a result, ZFS would variously over-throttle and under-throttle. It would often insert the 10ms delay, but that delay was intended merely as a softer landing than the hard limit. Once reached, the hard limit — still the primary throttling mechanism — could impose delays well in excess of a second.

The graph below shows the frequency (count) and total contribution (time) for power-of-two IO latencies from a production system.

The latency frequencies clearly show a tri-modal distribution: writes that happen at the speed of software (much less than 1ms), writes that are delayed by the write throttle (tens of milliseconds), and writes that bump up against the transaction group size (hundred of milliseconds up to multiple seconds).

The total accumulated time for each latency bucket highlights the dramatic impact of outliers. The 110 operations taking a second or longer contribute more to the overall elapsed time than the time of the remaining 16,000+ operations.

A new focus

The first attempt at the write throttle addressed a critical need, but was guided by the need to patch a hole rather than an understanding of the fundamental problem. The rate at which ZFS can move data to persistent storage will vary for a variety of reasons: synchronous operations will consume bandwidth; not all writes impact storage in the same way — scattered writes to areas of high fragmentation may be slower than sequential writes. Regardless of the real, instantaneous throughput capacity, ZFS needs to pass on the effective cost — as measured in write latency — to the client. Write throttle 1.0 carved this cost into three tranches: writes early in a transaction group that pay nothing, those late in a transaction group that pay 10ms each, and those at the end that pick up the remainder of the bill.

If the rate of incoming data was less than the throughput capacity of persistent storage the client should be charged nothing — no delay should be inserted. The write throttle failed by that standard as well, delaying 10ms in situations that warranted no surcharge.

Ideally ZFS should throttle writes in a way that optimizes for minimized and consistent latency. As we developed a new write throttle, our objectives were low variance for write latency, and steady and consistent (rather than bursty) writes to persistent storage. In my next post I’ll describe the solution that Matt Ahrens and I designed for OpenZFS.

Posted on December 27, 2013 at 12:40 am by ahl · Permalink · Comments Closed
In: ZFS · Tagged with: , , , , ,

OpenZFS: the next phase of ZFS development

I’ve been watching ZFS from moments after its inception at the hands of Matt Ahrens and Jeff Bonwick, so I’m excited to see it enter its newest phase of development in OpenZFS. While ZFS has long been regarded as the hottest filesystem on 128 bits, and has shipped in many different products, what’s been most impressive to me about ZFS development has been the constant iteration and reinvention.

Before shipping in Solaris 10 update 2, major components of ZFS had already advanced to “2.0″ and “3.0″. I’ve been involved with several ZFS-related products: Solaris 10, the ZFS Storage Appliance (nee Sun Storage 7000), and the Delphix Engine. Each new product and each new use has stressed ZFS in new ways, but also brought renewed focus to development. I’ve come to realize that ZFS will never be completed. I thought I’d use this post to cover the many ways that ZFS had failed in the products I’ve worked on over the years — and it has failed spectacularly at time — but this distracted from the most important aspect of ZFS. For each new failure in each new product with each new use and each new workload ZFS has adapted and improved.

OpenZFS doesn’t need a caretaker community for a finished project; if that were the case, porting OpenZFS to Linux, FreeBSD, and Mac OS X would have been the end. Instead, it was the beginning. The need for the OpenZFS community grew out of the porting efforts who wanted the world’s most advanced filesystem on their platforms and in their products. I wouldn’t trust my customers’ data to a filesystem that hadn’t been through those trials and triumphs over more than a decade. I can’t wait to see the next phase of evolution that OpenZFS brings.


If you’re at LinuxCon today, stop by the talk by Matt Ahrens and Brian Behlendor for more on OpenZFS; follow @OpenZFS for all OpenZFS news.

Posted on September 17, 2013 at 2:00 am by ahl · Permalink · 4 Comments
In: ZFS · Tagged with: , , , , , ,

Delphix plus three years

Today marks my third anniversary of joining Delphix. Joining a startup, I knew there would be lots to learn — indeed there’s been a lesson nearly once-a-day. Here are my top three lessons from my first three years at a startup. Even if the points themselves should have been obvious to me, the degree of their impact certainly wasn’t.

3. Tar-Babies are everywhere

Generalists thrive at a startup — there are many more tasks and problems than there are people. The things you touch inevitably stick to you, for better or for worse. Early on I was the DTrace guy, and the OS guy, and the debugging guy. Later on I became the performance go-to, and upgrade guy, and (proudly) the git neck beard. But I was also lab manager, and the cabler (running cat 6 when we moved offices, and frenetically stripping wires with my teeth as we discovered a second wiring closet), and the real estate guy. When I got approval to open a San Francisco office I asked about the next steps — “figure it out”. And so it goes for pretty much everything that happens at a startup. At big companies roles are subdivided and specialists own their small domains. During my time at Sun I didn’t think about many of the things those people did: they seemingly just happened. Being at a startup makes you intimately aware of all the excruciating minutiae that make a company go.

The more you do the more you own. The answer is not to avoid doing work that needs to be done, but to actively and aggressively recruit people to take over tasks. The stickiness was surprising and the need to offload can be uncomfortable. But you’re asking people to take on tasks — important, trivial, or unglamorous — so you can take on some additional load.

Ownership is important, but you need to share the load or else these tar babies will drag you down.

2. Hiring über alles

It’s not surprising that it’s all about the people. The right people and the right culture are the hardest problems to solve. It was surprising how much work it is to recruit and hire the best folks. We have a great team at Delphix and I love working with them. The big lesson for me was that hiring is far and away the highest leverage thing you can do for your startup. Hiring great people, great engineers, is hard and time consuming. I can’t count the number of coffees, and beers I’ve had for Delphix — “first dates” with prospective hires.

College recruiting had been an important focus for me during my years at Sun. I had only been at Delphix a few weeks when I convinced our CEO that we should recruit from colleges, and left to interview at my alma mater, Brown University. Delphix had been more conservative on hiring; some people regarded college recruiting as a bit of a flier, but it has paid off in a huge way. Our first college hire, two years in, is now a clear engineering leader. We’ve expanded the program from just me going to just one school to a dozen engineers going to ten this fall. About a quarter of our development team joined Delphix straight out of college. The initial effort is high, and you then have to wait 6-9 months for them to show up. But done right it can be the most efficient way to hire great engineers.

Work your ass off to hire great people; they’ll repay you for the time you’ve spent. If you don’t feel like hiring is a major time suck you’re probably doing it wrong.

1. Everything is your fault

The greatest blessing for a startup is also the greatest curse: having customers. Once people are paying for and using your product they’ll find the problems with it and you’ll work all hours to fix them. The surprise was how many problems became Delphix problems. Our product has its tendrils throughout the datacenter. We touch databases, operating systems, systems, hypervisors, networks, SANs and storage. Any butterfly flapping its wings in the datacenter can result in a hurricane on Delphix. As both the new component and the new vendor, we now expect and encourage our customers to talk to us early when diagnosing a problem. Our excellent support team (see hiring above) have diagnosed problems as diverse as poor network performance from a damaged cat 6 cable to over-provisioned ESX servers and misconfigured init.ora files. Obviously they’ve also addressed problems in our own software. But we always take an expansive view of the Delphix solution and work with the customer to chase the problem wherever it leads us.

This realization has also informed the way we build our software. We not only build our software to be resilient to abnormalities and to detect and report problems, but we also use tools to find problems early. A couple of years ago customers might connect Delphix to poor-performing storage — but that would just look like Delphix performing poorly. Now we run a series of storage benchmarks during every installation and report a grade. We build paranoia into our software and into our sales and support processes.

As a startup it’s even more crucial to insulate ourselves against environmental problems, build facilities to detect problems everywhere, and own the total solution with customers.

Starting year four

I hope others can benefit from those lessons; it took me a while to fully realize them. I’m sure there will be many more as I start year four at Delphix. Leave a comment and tell me about your own startup successes, failures, and lessons learned.

Posted on September 13, 2013 at 3:47 pm by ahl · Permalink · 2 Comments
In: Delphix

Topics in post-mortem debugging

A couple of weeks ago, Joyent hosted A Midsummer Night’s Systems meetup, a fun event with talks ranging from Node.js fatwas to big data for Mario Kart 64. My colleague Jeremy Jones had recently done some amazing work, perfect for the meetup, but with his first child less than a day old, Jeremy allowed me to present in his stead. In this short video (16 minutes) I talk about Jeremy’s investigation of a nasty bug that necessitated the creation of two awesome post-mortem tools. The first is what I call jdump, a Volatility plugin that takes a VMware snapshot and produces an illumos kernel crash dump. The second is ::gcore, an mdb command that can extract a fully functioning core file from a kernel crash dump. Together, they let us at Delphix scoop up all the state we’d need for a diagnosis with minimal interruption even when there’s no hard failure. Jeremy’s tools are close to magic, and without them the problem was close to undebuggable.

Thanks, Jeremy for letting me present on your great work. And thanks to Deirdre Straughan and Joyent for the great event!

Posted on August 22, 2013 at 5:00 am by ahl · Permalink · Comments Closed
In: illumos

Delphix and Flash

I started working with flash in 2006 — fortunate timing as flash was just starting to take hold in the enterprise. I started asking customers I’d visit about flash. I’ll always remember the response from an early adopter when I asked about how he planned on using the new, expensive storage, “We just bought it, and we have no idea.” It was a solution in search of a problem — the garbage can model at play.

Flash has evolved significantly since then from a raw material used on its own to a component in systems of increasing complexity. I wrote recently about the various techniques being employed to get the most out of flash; all share the basic idea of trading compute and IOPS (abundant commodities) for capacity (still more expensive for flash than hard drives). The ideal use cases are the ones that benefit most from that trade-off, ones where compression and dedup consume cheap compute cycles rather than expensive space on the NAND flash. Flash storage is best with data that contains high degrees of redundancy that clever software can squeeze out. With those loose criteria, it’s been amazing to me how flash storage vendors have flocked to the VDI use case. It’s certainly well-suited — big on IOPS with nearly identical data from different Windows installs that’s easily compressed and deduped — but seemingly every flash vendor has decided that it’s one part — if not the part — of the market they want to address. Take a look at the information on VDI from various flash storage vendors: Fusion, Nimble, Pure Storage, Tegile, Tintri, Violin, Virident, Whiptailthe list goes on and on.

I worked extensively with flash until leaving Oracle in 2010 when I decided to leave for a start up. I ended up not sticking with flash precisely because it was — and is — such a crowded space. I’d happily bet on the space, but it was harder to pick one winner. One of the things that drew me to Delphix though was precisely its compatibility with flash. At Delphix we create virtual database copies by sharing blocks; think of it as dedup before the fact, or dedup but without the runtime tax. Creating a virtual copy happens almost instantaneously saving tremendous amounts of administration time, unblocking developers, and accelerating projects — hence our credo of agile data. Unlike storage-based snapshots, Delphix virtual copies are database aware, provisioning is fully integrated and automated. Those virtual copies also take up much less physical space, but with as many or more IOPS hitting the aggregate of those virtual copies. Sound familiar yet? One tenth the capacity with the same workload — let’s call it 10x greater IOPS intensity — is ideally suited for flash storage.

Flash storage is best when clever software can squeeze out redundancies; Delphix is that clever software for databases. Delphix customers are starting to combine our product with their flash storage purchases. An all-flash array that’s 5x the $/TB as disk storage suddenly becomes half the price of disk storage when combined with Delphix — with substantially better performance. We as an industry still haven’t realized the full potential of flash storage. Agile data through Delphix fills in another big piece of the flash picture.

Posted on May 6, 2013 at 4:28 am by ahl · Permalink · Comments Closed
In: Delphix, Flash

On Systems Software

A prospective new college hire recently related an odd comment from his professor: systems programming is dead. I was nonplussed; what could the professor have meant? Systems is clearly very much alive. Interesting and important projects march under the banner of systems. But as I tried to construct a less emotional rebuttal, I realized I lacked a crisp definition of what systems programming is.

Wikipedia defines systems software in the narrowest terms: the stuff that interacts with hardware. But that covers a tiny fraction of modern systems. So what is systems software? It depends on when you’re asking the question. At one time, the web server was the application; now it’s the systems software on which many web-facing applications are built. At one time a database was the application; now it’s systems software that supports a variety of custom and off-the-shelf applications. Before my time, shells were probably considered a bleeding edge application; now they’re systems software on which some of the lowest-level plumbing of modern operating systems are built.

Any layer on which people build applications of increasing complexity is systems software. Most software that endures the transition to systems software does so whether its authors intended it or not. People in the software industry often talk about standing on the shoulders of giants; the systems software accumulated and refined over decades are those giants.

Stable interfaces define systems software. The programs that consume those interfaces expect the underlying systems software to be perfect every time. Initially innovation might happen in the interfaces themselves — the concurrent model of Node.js is a great example. As software matures, the interfaces become commodified; innovation happens behind those stable interfaces. Systems is only “dead” at its edges. Interfaces might be flexible and well-designed, or sclerotic and poorly designed. Regardless, new or improved systems software can increase performance, enhance observability, or simply fit a different economic niche.

There are a few different types of systems software. First there’s supporting systems software, systems software written as necessary foundation for some new application. This is systems software written with a purpose and designed to solve an unsolved — or poorly solved — problem. Chronologically, examples include UNIX, compilers, and libraries like jQuery. You write it because you need it, but it’s solving a problem that’s likely not unique to your particular application.

Then there’s accidental systems software. Stick everything from Apache to Excel to the Bourne shell in that category. These didn’t necessarily set out to be the foundation on which increasingly complex software would be written, but they definitely are. I’m sure there were times when indoctrination into systems-hood was painful, where the authors wanted to change interfaces, but good systems software respects its consumers and carries them forward. Somewhat famously make preserved its arcane syntax because two consumers already existed. JavaScript started as a glorified web design tool; now it sits several layers beneath complex client-side applications. Even more recently, developers of Node.js (itself  JavaScript-based) changed a commonly used interface that broke many applications. Historical mistakes can be annoying to live with, but — as the Node.js team determined — compatibility trumps cleanliness.

The largest bucket is replacement systems software. Linux, Java, ZFS, and DTrace fall into this category. At the time of their development, each was a notionally compatible replacement for something that already existed. Linux, of course, reimplemented the UNIX design to provide a free, compatible alternative. Java set about building a better runtime (the stable interface being a binary provided to customers to execute) designed to abstract away the operating system entirely. ZFS represented a completely new way of thinking about filesystems, but it did so within the tight constraints of POSIX interfaces and storage hardware. DTrace added new and unique observability to most of the stable interfaces that applications build on.

Finally, there’s intentional systems software. This is systems software by design, but unlike supporting systems software, there’s no consumer. Intentional systems software takes an “if you build it, they will come” approach. This is highly prone to failure — without an existence proof that your software solves a problem and exposes the right interfaces, it’s very difficult to know if you’re building the right thing.

Why define these categories? Knowing which you’re working with can inform your decisions. If you’ve written accidental systems software that has had systems-ness thrust upon it, realize that future versions need to respect the consumers — or willfully cast them aside. When writing replacement systems software, recognize the constraints on the system, and know exactly where you’re constrained and where you can innovate (or consider if you don’t want to use the existing solution). If you’ve written supporting systems software, know that others will inevitably need solutions to the same problems. Either invest in maintaining it and keeping it best of breed; resign to the fact that it will need to be replaced as others invest in a different solution; or open source it and hope (or advocate) that it becomes that ubiquitous solution.


What’s systems software? It is the increasingly complex, increasingly capable, increasingly diverse foundation on which applications are built. It’s that long and growing tail of the corpus of software at large. The interfaces might be static, but it’s a rich venue for innovation. As more and more critical applications build on an interface, the more value there is in improving the systems software beneath it. Systems software is defined by the constraints; it’s a mission and a mindset with unique challenges, and unique potential.

Posted on February 25, 2013 at 4:46 am by ahl · Permalink · 5 Comments
In: Software · Tagged with: 

The Holistic Engineer

The idea of the holistic engineer embodies the point of view that an engineer needs to consider the whole system, the whole body of work that makes a product successful. It bears no relation to holistic health — and it’s not some even newer age quackery. There are many specialist roles in the software industry — marketing, product management, project management, documentation, education, support, etc. — but the best software engineers are generalists who can assume a portion of each specialty. Further, some software is particularly well-suited for generalists who can combine a deep understanding of the market, the technology, and the implementation.

Software products are born of many different types of organizations, and even within similar organizations roles might have different names. Here’s a generic example with some names on the roles. New products and features start with product managers. Their role is to talk to customers and sales, educate themselves on the market, and determine the right product or enhancement. The handoff to engineering takes the form of a product requirements document (PRD) — it might sound like jargon, but the term is more or less universal. Software engineers execute against that PRD; QA engineers design tests that assert conformance to the PRD while developers steer the product from point A to point B as described by product management. Documentation writers and learning services take the PRD and the software to generate collateral that teaches customers how to use it. Product marketing makes the PowerPoints; sales presents them to customers.

And that’s where babies come from.

It’s not a perfect process, but it’s birthed many successful products. The shortcoming is that it can bury engineers under filters. Instead of learning about actual customer problems, engineers hear some processed form of what the customer said. Instead of raw critique of a new feature, engineers hear a softened and truncated form. The more technical the product and the market, the more those filters impede innovation and hamper the trajectory of the product.

The holistic engineer augments the jobs of those specialists, participating in each phase of product development. They join in those early conversations with customers, and share the responsibility of market comprehension. They partner to construct the requirements and design that those engineers will then implement. Along the way, engineers of course validate decisions with sales and customers — this is Agile writ large — but engineers also participate in the outbound documentation, training, and marketing activities.

From start to finish, the process is designed to fuel innovation by arming creative engineers with data and understanding. Customers often tell you what they want; they rarely tell you what they need. The more technical or disruptive the product, the more value an engineer has in those conversations, extracting the essence of the problem from the noise of preconceptions. The relationships with customer and the full context around their problems keeps engineers grounded as the inevitable gaps emerge in the product specification. Holistic engineers also help to educate the rest of the company and the rest of the world about new products and features. The process of explaining technology advises the way engineers design and build products. When we’re having a hard time explaining a feature or presenting a product, we need to revise our design. We’ve all heard engineering accused of building a product that was too complicated for the market, or engineers complain that a product failed because it was poorly marketed; both are symptoms of poor coordinating. Giving engineers holistic responsibility guards against this problem — if the product is failing the onus is on them to solve it.

Most important though are the feelings of ownership and agency associated with the whole-body approach. The holistic engineer is explicitly tasked with making a product succeed. That’s not to say that he or she goes it alone — specialists in all functions have major roles — rather the engineer is empowered to move the product through all stages; the other side of that coin is that there’s no opportunity to shrug off a responsibility as belonging to someone else.

In this model, everyone in every role at the company has the opportunity to engage in product management. Indeed, there’s still value in explicit product management. Channels of communication need to be easy and open for people with ideas to connect to people who will distill them into implementation. And it’s not enough to just create the right environment; hiring processes need to identify broad thinkers, and mentorship needs to nurture and reward holistic execution. Not every engineer can — or wants to — take on those additional responsibilities, but the best thrive with market and technology awareness, unencumbered by filters. They want responsibility and authority to make their ideas succeed.

The idea of the holistic engineer isn’t theoretical, it’s the model we stumbled into in the Solaris Kernel Group, and later implemented deliberately at Fishworks. There, a small team took on wide ranging responsibilities to build a product that’s now doing $400m/year for Oracle. At Delphix we’re again inculcating and hiring for holistic thinking. At all three I’ve seen engineers develop new products and features that address customer needs that would have otherwise never emerged from customers’ initial requests. It’s not easy to find the right kind of engineers, but if a company can empower the right engineers in the right ways — and they can live up to the responsibility — the payoff is a better product, built more efficiently.

Posted on February 6, 2013 at 8:02 am by ahl · Permalink · One Comment
In: Software

ZFS fundamentals: transaction groups

I’ve continued to explore ZFS as I try to understand performance pathologies, and improve performance. A particular point of interest has been the ZFS write throttle, the mechanism ZFS uses to avoid filling all of system memory with modified data. I’m eager to write about the strides we’re making in that regard at Delphix, but it’s hard to appreciate without an understanding of how ZFS batches data. Unfortunately that explanation is literally nowhere to be found. Back in 2001 I had not yet started working on DTrace, and was talking to Matt and Jeff, the authors of ZFS, about joining them. They had only been at it for a few months; I was fortunate to be in a conference with them as the ideas around transaction groups formulated. Transaction groups are how ZFS batches up chunks of data to be written to disk (“groups” of “transactions”). Jeff stood at the whiteboard and drew the progression of states for transaction groups, from open, accepting new transactions, to quiescing, allowing transactions to complete, to syncing, writing data out to disk. As far as I can tell, that was both the first time that picture had been drawn and the last. If you search for information on ZFS transaction groups you’ll find mention of those states… and not much else. The header comment in usr/src/uts/common/fs/zfs/txg.c isn’t particularly helpful:

 * Pool-wide transaction groups.

I set out to write a proper description of ZFS transaction groups. I’m posting it here first, and I’ll be offering it as a submission to illumos. Many thanks to Matt Ahrens, George Wilson, and Max Bruning for their feedback.

ZFS Transaction Groups

ZFS transaction groups are, as the name implies, groups of transactions that act on persistent state. ZFS asserts consistency at the granularity of these transaction groups. Each successive transaction group (txg) is assigned a 64-bit consecutive identifier. There are three active transaction group states: open, quiescing, or syncing. At any given time, there may be an active txg associated with each state; each active txg may either be processing, or blocked waiting to enter the next state. There may be up to three active txgs, and there is always a txg in the open state (though it may be blocked waiting to enter the quiescing state). In broad strokes, transactions — operations that change in-memory structures — are accepted into the txg in the open state, and are completed while the txg is in the open or quiescing states. The accumulated changes are written to disk in the syncing state.


When a new txg becomes active, it first enters the open state. New transactions — updates to in-memory structures — are assigned to the currently open txg. There is always a txg in the open state so that ZFS can accept new changes (though the txg may refuse new changes if it has hit some limit). ZFS advances the open txg to the next state for a variety of reasons such as it hitting a time or size threshold, or the execution of an administrative action that must be completed in the syncing state.


After a txg exits the open state, it enters the quiescing state. The quiescing state is intended to provide a buffer between accepting new transactions in the open state and writing them out to stable storage in the syncing state. While quiescing, transactions can continue their operation without delaying either of the other states. Typically, a txg is in the quiescing state very briefly since the operations are bounded by software latencies rather than, say, slower I/O latencies. After all transactions complete, the txg is ready to enter the next state.


In the syncing state, the in-memory state built up during the open and (to a lesser degree) the quiescing states is written to stable storage. The process of writing out modified data can, in turn modify more data. For example when we write new blocks, we need to allocate space for them; those allocations modify metadata (space maps)… which themselves must be written to stable storage. During the sync state, ZFS iterates, writing out data until it converges and all in-memory changes have been written out. The first such pass is the largest as it encompasses all the modified user data (as opposed to filesystem metadata). Subsequent passes typically have far less data to write as they consist exclusively of filesystem metadata.

To ensure convergence, after a certain number of passes ZFS begins overwriting locations on stable storage that had been allocated earlier in the syncing state (and subsequently freed). ZFS usually allocates new blocks to optimize for large, continuous, writes. For the syncing state to converge however it must complete a pass where no new blocks are allocated since each allocation requires a modification of persistent metadata. Further, to hasten convergence, after a prescribed number of passes, ZFS also defers frees, and stops compressing.

In addition to writing out user data, we must also execute synctasks during the syncing context. A synctask is the mechanism by which some administrative activities work such as creating and destroying snapshots or datasets. Note that when a synctask is initiated it enters the open txg, and ZFS then pushes that txg as quickly as possible to completion of the syncing state in order to reduce the latency of the administrative activity. To complete the syncing state, ZFS writes out a new uberblock, the root of the tree of blocks that comprise all state stored on the ZFS pool. Finally, if there is a quiesced txg waiting, we signal that it can now transition to the syncing state.

What else?

Please let me know if you have suggestions for how to improve the descriptions above. There’s more to be written on the specifics of the implementation, transactions, the DMU, and, well, ZFS in general. One thing that I’d note is that Matt mentioned to me recently that were he starting from scratch, he might eliminate the quiescing state. I didn’t understand fully until I researched the subsystem. Typically transactions take a very brief amount of time to “complete”, time measured by CPU latency as opposed, say, to I/O latency. Had the quiescing phase been merged into the syncing phase, the design would be slightly simpler, and it would eliminate the mostly idle intermediate phase where a bunch of dirty data can sit in memory relatively idle.

Next I’ll write about the ZFS write throttle, it’s various brokenness, and our ideas for how to fix it.

Posted on December 13, 2012 at 6:17 am by ahl · Permalink · 4 Comments
In: ZFS · Tagged with: , , , ,