Adam Leventhal's blog

Search
Close this search box.

Category: ZFS

Prologue (2006)

I attended my first WWDC in 2006 to participate in Apple’s launch of their DTrace port to the next version of Mac OS X (Leopard). Apple completed all but the fiddliest finishing touches without help from the DTrace team. Even when they did meet with us we had no idea that they were mere weeks away from the finished product being announced to the world. It was a testament both to Apple’s engineering acumen as well as their storied secrecy.

At that same WWDC Apple announced Time Machine, a product that would record file system versions through time for backup and recovery. How were they doing this? We were energized by the idea that there might be another piece of adopted Solaris technology. When we launched Solaris 10, DTrace shared the marquee with ZFS, a new filesystem that was to become the standard against which other filesystems are compared. Key among the many features of ZFS were snapshots that made it simple to capture the state of a filesystem, send the changes around, recover data, etc. Time Machine looked for all the world like a GUI on ZFS (indeed the GUI that we had imagined but knew to be well beyond the capabilities of Sun).

Of course Time Machine had nothing to do with ZFS. After the keynote we rushed to an Apple engineer we knew. With shame in his voice he admitted that it was really just a bunch of hard links to directories. For those who don’t know a symlink from a symtab this is the moral equivalent of using newspaper as insulation: it’s fine until the completely anticipated calamity destroys everything you hold dear.

So there was no ZFS in Mac OS X, at least not yet.

Not So Fast (2007)

A few weeks before WWDC 2007 nerds like me started to lose their minds: Apple really was going to port ZFS to Mac OS X. It was actually going to happen! Beyond the snapshots that would make backing up a cinch, ZFS would dramatically advance the state of data storage for Apple users. HFS was introduced in System 2.1 (“System” being what we called “Mac OS” in the days before operating systems gained their broad and ubiquitous sex appeal). HFS improved upon the Macintosh File System by adding—wait for it—hierarchy! No longer would files accumulate in a single pile; you could organize them in folders. Not that there were many to organize on those 400K floppies, but progress is progress. And that filesystem has limped along for more than 30 years, nudged forward, rewritten to avoid in-kernel Pascal code (though retaining Pascal-style, length-prefixed strings), but never reimagined or reinvented. Even in its most modern form, HFS lacks the most basic functionality around data integrity. Bugs, power failures, and expected and inevitable media failures all mean that data is silently altered. Pray that your old photos are still intact. When’s the last time you backed up your Mac? I’m backing up right now just like I do every time I think about the neglectful stewardship of HFS.

ZFS was to bring to Mac OS X data integrity, compression, checksums, redundancy, snapshots, etc, etc etc. But while energizing Mac/ZFS fans, Sun CEO, Jonathan Schwartz, had clumsily disrupted the momentum that ZFS had been gathering in Apple’s walled garden. Apple had been working on a port of ZFS to Mac OS X. They were planning on mentioning it at the upcoming WWDC. Jonathan, brought into the loop either out of courtesy or legal necessity, violated the cardinal rule of the Steve Jobs-era Apple. Only one person at Steve Job’s company announces new products: Steve Jobs. “In fact, this week you’ll see that Apple is announcing at their Worldwide Developer Conference that ZFS has become the file system in Mac OS 10,” mused Jonathan at a press event, apparently to bolster Sun’s own credibility.

Less than a week later, Apple spoke about ZFS only when it became clear that a port was indeed present in a developer version of Leopard albeit in a nascent form. Yes, ZFS would be there, sort of, but it would be read-only and no one should get their hopes up.

Ray of Hope (2008)

By the next WWDC it seemed that Sun had been forgiven. ZFS was featured in the keynotes, it was on the developer disc handed out to attendees, and it was even mentioned on the Mac OS X Server website. Apple had been working on their port since 2006 and now it was functional enough to be put on full display. I took it for a spin myself; it was really real. The feature that everyone wanted (but most couldn’t say why) was coming!

The Little Engine That Couldn’t (2009)

By the time Snow Leopard shipped only a careful examination of the Apple web site would turn up the odd reference to ZFS left unscrubbed. Whatever momentum ZFS had enjoyed within the Mac OS X product team was gone. I’ve heard a couple of theories and anecdotes from people familiar with the situation; first some relevant background.

Sun was dying. After failed love affairs with IBM and HP (the latter formed, according to former Sun CEO, Scott McNealy, by two garbage trucks colliding), Oracle scooped up the aging dame with dim prospects. The nearly yearlong process of closing the acquisition was particularly hard on Sun, creating uncertainty around its future and damaging its bottom line. Despite the well-documented personal friendship between Steve Jobs and Oracle CEO, Larry Ellison (more on this later), I’m sure this uncertainty had some impact on the decision to continue with ZFS.

In the meantime Sun and NetApp had been locked in a lawsuit over ZFS and other storage technologies since mid-2007. While Jonathan Schwartz had blogged about protecting Apple and its users (as well as Sun customers of course), this likely lead to further uncertainly. On top of that, filesystem transitions are far from simple. When Apple included DTrace in Mac OS X a point in favor was that it could be yanked out should any sort of legal issue arise. Once user data hit ZFS it would take years to fully reverse the decision. While the NetApp lawsuit never seemed to have merit (ZFS uses unique and from-scratch mechanisms for snapshots), it indisputably represented risk for Apple.

Finally, and perhaps most significantly, personal egos and NIH (not invented here) syndrome certainly played a part. I’m told by folks in Apple at the time that certain leads and managers preferred to build their own rather adopting external technology—even technology that was best of breed. They pitched their own project, an Apple project, that would bring modern filesystem technologies to Mac OS X. The design center for ZFS was servers, not laptops—and certainly not phones, tablets, and watches—his argument was likely that it would be better to start from scratch than adapt ZFS. Combined with the uncertainty above and, I’m told, no shortage of political savvy their arguments carried the day. Licensing FUD was thrown into the mix; even today folks at Apple see the ZFS license as nefarious and toxic in some way whereas the DTrace license works just fine for them. Note that both use the same license with the same grants and same restrictions. Maybe the technical arguments really were overwhelming (note however that ZFS was working internally on the iPhone), and maybe the risks really were insurmountable. I obviously have my own opinions, and think this was a great missed opportunity for the industry, but I never had the burden of weighing the totality of the facts and deciding. Nevertheless Apple put an end to its ZFS work; Apple’s from-scratch filesystem efforts were underway.

The Little Engine That Still Couldn’t (2010)

Amazingly that wasn’t quite the end for ZFS at Apple. The architect for ZFS at Apple had left, the project had been shelved, but there were high-level conversations between Sun and Apple about reviving the port. Apple would get indemnification and support for their use of ZFS. Sun would get access to the Apple File Protocol (AFP—which, ironically, seems to have been collateral damage with the new APFS), and, more critically, Sun’s new ZFS-based storage appliance (which I helped develop) would be a natural server and backup agent for millions of Apple devices. It seemed to make some sort of sense.

The excruciatingly debilitatingly slow acquisition of Sun finally closed. The Apple-ZFS deal was brought for Larry Ellison’s approval, the first born child of the conquered land brought to be blessed by the new king. “I’ll tell you about doing business with my best friend Steve Jobs,” he apparently said, “I don’t do business with my best friend Steve Jobs.”

(Amusingly the version of the story told quietly at WWDC 2016 had the friends reversed with Steve saying that he wouldn’t do business with Larry. Still another version I’ve heard calls into question the veracity of their purported friendship, and has Steve instead suggesting that Larry go f*ck himself. Normally the iconoclast, that would, if true, represent Steve’s most mainstream opinion.)

And that was the end.

Epilogue (2016)

In the 7 years since ZFS development halted at Apple, they’ve worked on a variety of improvements in HFS and Core Storage, and hacked at at least two replacements for HFS that didn’t make it out the door. This week Apple announced their new filesystem, APFS, after 2 years in development. It’s not done; some features are still in development, and they’ve announced the ambitious goal of rolling it out to laptop, phone, watch, and tv within the next 18 months. At Sun we started ZFS in 2001. It shipped in 2005 and that was really the starting line, not the finish line. Since then I’ve shipped the ZFS Storage Appliance in 2008 and Delphix in 2010 and each has required investment in ZFS / OpenZFS to make them ready for prime time. A broadly featured, highly functional filesystem takes a long time.

APFS has merits (more in my next post), but it will always disappoint me that Apple didn’t adopt ZFS irrespective of how and why that decision was made. Dedicated members of the OpenZFS community have built and maintain a port. It’s not quite the same as having Apple as a member of that community, embracing and extending ZFS rather than building their own incipient alternative.

Canonical announced a few weeks ago that ZFS will be included in the next release of Ubuntu Linux, on by default and fully supported. And it’s no exaggeration when Dustin Kirkland describes ZFS as “one of the most exciting new features Linux has seen in a very long time.” In the words of our 47th Vice President, this is a big F—ing deal. Ubuntu is an extremely popular Linux distribution, particularly so for servers, and while the Linux ecosystem doesn’t want for variety when it comes to filesystem choices, there is not a clear champion when it comes to stability, functionality, and performance. By throwing their full weight behind ZFS, Canonical brings the Linux community an enterprise-class, modern filesystem, stable, but still under active development, designed to perform well for a variety of workloads.

What’s ZFS?

ZFS was originally developed at Sun Microsystem for the Solaris operating system. Some of the most demanding production environments have depended on ZFS for well over a decade. At its core are the principles of data integrity, ease of use, and performance.[1] Most notably, ZFS has first-class support for arbitrary numbers of snapshots and writable clones, serialization for replication, compression, and data repair. I’ve contributed code to ZFS at Sun, then Oracle, and to OpenZFS after Oracle abandoned the project in 2010. I’ve also built two products built around ZFS, the ZFS Storage Appliance, a NAS box, and Delphix, a copy data virtual appliance.

Why ZFS?

While the distinguishing features of ZFS are broadly useful, they have become specifically relevant in a containerizing world. Users need to save, clone, and replicate containers at will; ZFS provides key facilities for doing so. Containers and ZFS are a fantastic match, something I’ve seen my friends at Joyent demonstrate decisively for the past decade. Ubuntu has selected the most capable technology for our modern computing ecosystem.

No good deed…

So high fives and bro hugs all around, right? Not quite. Enter the licensing boogie man. The Linux kernel is licensed under the GPL v2; OpenZFS is licensed under the CDDL. Both are open source, true, but some contend that they are incompatible. Most folks in the tech world—myself among them—spend somewhere in the vicinity of no time at all considering the topic. Far from ignoring it, Canonical had their lawyers review the licenses and deemed their use of Linux and OpenZFS to be in compliance. I’m not a lawyer; I don’t have an informed opinion. But there are those who vehemently disagree with Canonical. Notably the Software Freedom Conservancy whose mission is to “promote, improve, develop, and defend Free, Libre, and Open Source Software (FLOSS) projects” has posted a lengthy wag of the finger at Canonical. Note that none of this has been specifically tested in the courts so it’s currently just a theoretical disagreement between lawyers (and in many cases, people who engage in lawyerly cosplay).

The wisdom of the crowd has proposed a couple of solutions:

“What if we ask Oracle super nicely?”

Oracle holds a copyright on most of OpenZFS since it was forked from the original ZFS code base. It would be within their rights to decide to relicense ZFS under the GPL. Problem solved! No way and not quite. Starting with the easier problem there are many other copyright holders in OpenZFS. It’s not an impossibly large list, but why would they bother? What benefits would they reap when even goodwill isn’t noticeably on offer? And it is the height of delusion to think that Oracle would grow ears to hear, a heart to care, and a brain to decide. Oracle explicitly backed away from OpenSolaris, shutting down the project in 2010. They do not want to encourage open source use of its component technologies. While open source is arguably the most significant shift in technology over the last decade (Stephen O’Grady’s The Software Paradox is a must-read), large companies and startups continue to be terrified, confused, and irrational when it comes to open source. Oracle ain’t coming to help.[2]

“Let’s re-write it! How hard could it be?”

It’s hard to dignify this with an explanation, but OpenZFS is the product of 100s of person-years. It contains some of the most sophisticated mechanisms I’ve seen designed, by some of the world’s most capable engineers. Re-writing it would probably be no easier than writing it the first time. By way of commentary, this is what makes NIH so distressing. Too often technologies are copied poorly instead of being used and improved, or understood and replaced with something truly superior.

Now what?

Now that you understand a bit of the context here’s my suggestion: consider the licenses, but focus on the technology. Canonical has (one would presume) chosen to include OpenZFS because it offers the best solution to Ubuntu users. Containers and ZFS are highly complementary with further room to grow together. As with anything, evaluate the technology, evaluate the risks, and move on. Ignore pedants who would deride your pragmatic use of technology as heretical or immoral.

I personally could not be more excited by the announcement. The Ubuntu community is going to have built-in support for a filesystem that’s better and more capable than anything they’ve had in the past. The OpenZFS community is going to have a ton more users, more interest, and more drivers for innovation. Both are going to be stronger together.

 

 


[1] “ZFS crashed on me once!” Me too, more than once. “ZFS was slow for me!” That happens. “[some other Linux filesystem] is better!” Could be, but I doubt it. I’m not denying the events, but this kind of Inhofian logic doesn’t nudge ZFS from its perch.

[2] In 2011 the source code for Oracle’s new Solaris 11 operating system appeared on the web replete with CDDL (open source) license notification. For all appearances this looked like open source code, a new version of OpenSolaris. The community asked for clarification: were these stolen goods or something given away intentionally? Was Solaris 11 free and open? Even then Oracle declined to comment.

In previous posts I discussed the problems with the legacy ZFS write throttle that cause degraded performance and wildly variable latencies. I then presented the new OpenZFS write throttle and I/O scheduler that Matt Ahrens and I designed. In addition to solving several problems in ZFS, the new approach was designed to be easy to reason about, measure, and adjust. In this post I’ll cover performance analysis and tuning — using DTrace of course. These details are intended for those using OpenZFS and trying to optimize performance — if you have only a casual interest in ZFS consider yourself warned!

Buffering dirty data

OpenZFS limits the amount of dirty data on the system according to the tunable zfs_dirty_data_max. It’s default value is 10% of memory up to 4GB. The tradeoffs are pretty simple:

Lower Higher
Less memory reserved for use by OpenZFS More memory reserved for use by OpenZFS
Able to absorb less workload variation before throttling Able to absorb more workload variation before throttling
Less data in each transaction group More data in each transaction group
Less time spent syncing out each transaction group More time spent syncing out each transaction group
More metadata written due to less amortization Less metadata written due to more amortization

 

Most workloads contain variability. Think of the dirty data as a buffer for that variability. Let’s say the LUNs assigned to your OpenZFS storage pool are able to sustain 100MB/s in aggregate. If a workload consistently writes at 100MB/s then only a very small buffer would be required. If instead the workload oscillates between 200MB/s and 0MB/s for 10 seconds each, then a small buffer would limit performance. A buffer of 800MB would be large enough to absorb the full 20 second cycle over which the average is 100MB/s. A buffer of only 200MB would cause OpenZFS to start to throttle writes — inserting artificial delays — after less than 2 seconds during which the LUNs could flush 200MB of dirty data while the client tried to generate 400MB.

Track the amount of outstanding dirty data within your storage pool to know which way to adjust zfs_dirty_data_max:

txg-syncing
{
        this->dp = (dsl_pool_t *)arg0;
}

txg-syncing
/this->dp->dp_spa->spa_name == $$1/
{
        printf("%4dMB of %4dMB used", this->dp->dp_dirty_total / 1024 / 1024,
            `zfs_dirty_data_max / 1024 / 1024);
}

# dtrace -s dirty.d pool
dtrace: script 'dirty.d' matched 2 probes
CPU ID FUNCTION:NAME
11 8730 txg_sync_thread:txg-syncing 966MB of 4096MB used
0 8730 txg_sync_thread:txg-syncing 774MB of 4096MB used
10 8730 txg_sync_thread:txg-syncing 954MB of 4096MB used
0 8730 txg_sync_thread:txg-syncing 888MB of 4096MB used
0 8730 txg_sync_thread:txg-syncing 858MB of 4096MB used

The write throttle kicks in once the amount of dirty data exceeds zfs_delay_min_dirty_percent of the limit (60% by default). If the the amount of dirty data fluctuates above and below that threshold, it might be possible to avoid throttling by increasing the size of the buffer. If the metric stays low, you may reduce zfs_dirty_data_max. Weigh this tuning against other uses of memory on the system (a larger value means that there’s less memory for applications or the OpenZFS ARC for example).

A larger buffer also means that flushing a transaction group will take longer. This is relevant for certain OpenZFS administrative operations (sync tasks) that occur when a transaction group is committed to stable storage such as creating or cloning a new dataset. If the interactive latency of these commands is important, consider how long it would take to flush zfs_dirty_data_max bytes to disk. You can measure the time to sync transaction groups (recall, there are up to three active at any given time) like this:

txg-syncing
/((dsl_pool_t *)arg0)->dp_spa->spa_name == $$1/
{
        start = timestamp;
}

txg-synced
/start && ((dsl_pool_t *)arg0)->dp_spa->spa_name == $$1/
{
        this->d = timestamp - start;
        printf("sync took %d.%02d seconds", this->d / 1000000000,
            this->d / 10000000 % 100);
}

# dtrace -s duration.d pool
dtrace: script 'duration.d' matched 2 probes
CPU ID FUNCTION:NAME
5 8729 txg_sync_thread:txg-synced sync took 5.86 seconds
2 8729 txg_sync_thread:txg-synced sync took 6.85 seconds
11 8729 txg_sync_thread:txg-synced sync took 6.25 seconds
1 8729 txg_sync_thread:txg-synced sync took 6.32 seconds
11 8729 txg_sync_thread:txg-synced sync took 7.20 seconds
1 8729 txg_sync_thread:txg-synced sync took 5.14 seconds

Note that the value of zfs_dirty_data_max is relevant when sizing a separate intent log device (SLOG). zfs_dirty_data_max puts a hard limit on the amount of data in memory that has yet been written to the main pool; at most, that much data is active on the SLOG at any given time. This is why small, fast devices such as the DDRDrive make for great log devices. As an aside, consider the ostensible upgrade that Oracle brought to the ZFS Storage Appliance a few years ago replacing the 18GB “Logzilla” with a 73GB upgrade.

I/O scheduler

Where ZFS had a single IO queue for all IO types, OpenZFS has five IO queues for each of the different IO types: sync reads (for normal, demand reads), async reads (issued from the prefetcher), sync writes (to the intent log), async writes (bulk writes of dirty data), and scrub (scrub and resilver operations). Note that bulk dirty data described above are scheduled in the async write queue. See vdev_queue.c for the related tunables:

uint32_t zfs_vdev_sync_read_min_active = 10;
uint32_t zfs_vdev_sync_read_max_active = 10;
uint32_t zfs_vdev_sync_write_min_active = 10;
uint32_t zfs_vdev_sync_write_max_active = 10;
uint32_t zfs_vdev_async_read_min_active = 1;
uint32_t zfs_vdev_async_read_max_active = 3;
uint32_t zfs_vdev_async_write_min_active = 1;
uint32_t zfs_vdev_async_write_max_active = 10;
uint32_t zfs_vdev_scrub_min_active = 1;
uint32_t zfs_vdev_scrub_max_active = 2;

Each of these queues has tunable values for the min and max number of outstanding operations of the given type that can be issued to a leaf vdev (LUN). The tunable zfs_vdev_max_active limits the number of IOs issued to a single vdev. If its value is less than the sum of the zfs_vdev_*_max_active tunables, then the minimums come into play. The minimum number of each queue will be scheduled and the remainder of zfs_vdev_max_active is issued from the queues in priority order.

At a high level, the appropriate values for these tunables will be specific to your LUNs. Higher maximums lead to higher throughput with potentially higher latency. On some devices such as storage arrays with distinct hardware for reads and writes, some of the queues can be thought of as independent; on other devices such as traditional HDDs, reads and writes will likely impact each other.

A simple way to tune these values is to monitor I/O throughput and latency under load. Increase values by 20-100% until you find a point where throughput no longer increases, but latency is acceptable.

#pragma D option quiet

BEGIN
{
        start = timestamp;
}

io:::start
{
        ts[args[0]->b_edev, args[0]->b_lblkno] = timestamp;
}

io:::done
/ts[args[0]->b_edev, args[0]->b_lblkno]/
{
        this->delta = (timestamp - ts[args[0]->b_edev, args[0]->b_lblkno]) / 1000;
        this->name = (args[0]->b_flags & (B_READ | B_WRITE)) == B_READ ?
            "read " : "write ";

        @q[this->name] = quantize(this->delta);
        @a[this->name] = avg(this->delta);
        @v[this->name] = stddev(this->delta);
        @i[this->name] = count();
        @b[this->name] = sum(args[0]->b_bcount);

        ts[args[0]->b_edev, args[0]->b_lblkno] = 0;
}

END
{
        printa(@q);

        normalize(@i, (timestamp - start) / 1000000000);
        normalize(@b, (timestamp - start) / 1000000000 * 1024);

        printf("%-30s %11s %11s %11s %11s\n", "", "avg latency", "stddev",
            "iops", "throughput");
        printa("%-30s %@9uus %@9uus %@9u/s %@8uk/s\n", @a, @v, @i, @b);
}

# dtrace -s rw.d -c 'sleep 60'

  read
           value  ------------- Distribution ------------- count
              32 |                                         0
              64 |                                         23
             128 |@                                        655
             256 |@@@@                                     1638
             512 |@@                                       743
            1024 |@                                        380
            2048 |@@@                                      1341
            4096 |@@@@@@@@@@@@                             5295
            8192 |@@@@@@@@@@@                              5033
           16384 |@@@                                      1297
           32768 |@@                                       684
           65536 |@                                        400
          131072 |                                         225
          262144 |                                         206
          524288 |                                         127
         1048576 |                                         19
         2097152 |                                         0

  write
           value  ------------- Distribution ------------- count
              32 |                                         0
              64 |                                         47
             128 |                                         469
             256 |                                         591
             512 |                                         327
            1024 |                                         924
            2048 |@                                        6734
            4096 |@@@@@@@                                  43416
            8192 |@@@@@@@@@@@@@@@@@                        102013
           16384 |@@@@@@@@@@                               60992
           32768 |@@@                                      20312
           65536 |@                                        6789
          131072 |                                         860
          262144 |                                         208
          524288 |                                         153
         1048576 |                                         36
         2097152 |                                         0

                               avg latency      stddev        iops  throughput
write                              19442us     32468us      4064/s   261889k/s
read                               23733us     88206us       301/s    13113k/s

Async writes

Dirty data governed by zfs_dirty_data_max is written to disk via async writes. The I/O scheduler treats async writes a little differently than other operations. The number of concurrent async writes scheduled depends on the amount of dirty data on the system. Recall that there is a fixed (but tunable) limit of dirty data in memory. With a small amount of dirty data, the scheduler will only schedule a single operation (zfs_vdev_async_write_min); the idea is to preserve low latency of synchronous operations when there isn’t much write load on the system. As the amount of dirty data increases, the scheduler will push the LUNs harder to flush it out by issuing more concurrent operations.

The old behavior was to schedule a fixed number of operations regardless of the load. This meant that the latency of synchronous operations could fluctuate significantly. While writing out dirty data ZFS would slam the LUNs with writes, contending with synchronous operations and increasing their latency. After the syncing transaction group had completed, there would be a period of relatively low async write activity during which synchronous operations would complete more quickly. This phenomenon was known as “picket fencing” due to the square wave pattern of latency over time. The new OpenZFS I/O scheduler is optimized for consistency.

In addition to tuning the minimum and maximum number of concurrent operations sent to the device, there are two other tunables related to asynchronous writes: zfs_vdev_async_write_active_min_dirty_percent and zfs_vdev_async_write_active_max_dirty_percent. Along with the min and max operation counts (zfs_vdev_async_write_min_active and zfs_vdev_aysync_write_max_active), these four tunables define a piece-wise linear function that determines the number of operations scheduled as depicted in this lovely ASCII art graph excerpted from the comments:

 * The number of concurrent operations issued for the async write I/O class
 * follows a piece-wise linear function defined by a few adjustable points.
 *
 *        |                   o---------| <-- zfs_vdev_async_write_max_active
 *   ^    |                  /^         |
 *   |    |                 / |         |
 * active |                /  |         |
 *  I/O   |               /   |         |
 * count  |              /    |         |
 *        |             /     |         |
 *        |------------o      |         | <-- zfs_vdev_async_write_min_active
 *       0|____________^______|_________|
 *        0%           |      |       100% of zfs_dirty_data_max
 *                     |      |
 *                     |      `-- zfs_vdev_async_write_active_max_dirty_percent
 *                     `--------- zfs_vdev_async_write_active_min_dirty_percent

In a relatively steady state we’d like to see the amount of outstanding dirty data stay in a narrow band between the min and max percentages, by default 30% and 60% respectively.

Tune zfs_vdev_async_write_max_active as described above to maximize throughput without hurting latency. The only reason to increase zfs_vdev_async_write_min_active is if additional writes have little to no impact on latency. While this could be used to make sure data reaches disk sooner, an alternative approach is to decrease zfs_vdev_async_write_active_min_dirty_percent thereby starting to flush data despite less dirty data accumulating.

To tune the min and max percentages, watch both latency and the number of scheduled async write operations. If the operation count fluctuates wildly and impacts latency, you may want to flatten the slope by decreasing the min and/or increasing the max (note below that you will likely want to increase zfs_delay_min_dirty_percent if you increase zfs_vdev_async_write_active_max_dirty_percent — see below).

#pragma D option aggpack
#pragma D option quiet

fbt::vdev_queue_max_async_writes:entry
{
        self->spa = args[0];
}
fbt::vdev_queue_max_async_writes:return
/self->spa && self->spa->spa_name == $$1/
{
        @ = lquantize(args[1], 0, 30, 1);
}

tick-1s
{
        printa(@);
        clear(@);
}

fbt::vdev_queue_max_async_writes:return
/self->spa/
{
        self->spa = 0;
}

# dtrace -s q.d dcenter

min .--------------------------------. max | count
< 0 : ▃▆ : >= 30 | 23279

min .--------------------------------. max | count
< 0 : █ : >= 30 | 18453

min .--------------------------------. max | count
< 0 : █ : >= 30 | 27741

min .--------------------------------. max | count
< 0 : █ : >= 30 | 3455

min .--------------------------------. max | count
< 0 : : >= 30 | 0

Write delay

In situations where LUNs cannot keep up with the incoming write rate, OpenZFS artificially delays writes to ensure consistent latency (see the previous post in this series). Until a certain amount of dirty data accumulates there is no delay. When enough dirty data accumulates OpenZFS gradually increases the delay. By delaying writes OpenZFS effectively pushes back on the client to limit the rate of writes by forcing artificially higher latency. There are two tunables that pertain to delay: how much dirty data there needs to be before the delay kicks in, and the factor by which that delay increases as the amount of outstanding dirty data increases.

The tunable zfs_delay_min_dirty_percent determines when OpenZFS starts delaying writes. The default is 60%; note that we don’t start delaying client writes until the IO scheduler is pushing out data as fast as it can (zfs_vdev_async_write_active_max_dirty_percent also defaults to 60%).

The other relevant tunable is zfs_delay_scale is really the only magic number here. It roughly corresponds to the inverse of the maximum number of operations per second (denominated in nanoseconds), and is used as a scaling factor.

Delaying writes is an aggressive step to ensure consistent latency. It is required if the client really is pushing more data than the system can handle, but unnecessarily delaying writes degrades overall throughput. There are two goals to tuning delay: reduce or remove unnecessary delay, and ensure consistent delays when needed.

First check to see how often writes are delayed. This simple DTrace one-liner does the trick:

# dtrace -n fbt::dsl_pool_need_dirty_delay:return'{ @[args[1] == 0 ? "no delay" : "delay"] = count(); }'

If a relatively small percentage of writes are delayed, increasing the amount of dirty data allowed (zfs_dirty_data_max) or even pushing out the point at which delays start (zfs_delay_min_dirty_percent). When increasing zfs_dirty_data_max consider the other users of DRAM on the system, and also note that a small amount of small delays does not impact performance significantly.

If many writes are being delayed, the client really is trying to push data faster than the LUNs can handle. In that case, check for consistent latency, again, with a DTrace one-liner:

# dtrace -n delay-mintime'{ @ = quantize(arg2); }'

With high variance or if many write operations are being delayed for the maximum zfs_delay_max_ns (100ms by default) then try increasing zfs_delay_scale by a factor of 2 or more, or try delaying earlier by reducing zfs_delay_min_dirty_percent (remember to also reduce zfs_vdev_async_write_active_max_dirty_percent).

Summing up

Our experience at Delphix tuning the new write throttle has been so much better than in the old ZFS world: each tunable has a clear and comprehensible purpose, their relationships are well-defined, and the issues in tension pulling values up or down are both easy to understand and — most importantly — easy to measure. I hope that this tuning guide helps others trying to get the most out of their OpenZFS systems whether on Linux, FreeBSD, Mac OS X, illumos — not to mention the support engineers for the many products that incorporate OpenZFS into a larger solution.

In my last blog post, I wrote about the ZFS write throttle, and how we saw it lead to pathological latency variability on customer systems. Matt Ahrens, the co-founder of ZFS, and I set about to fix it in OpenZFS. While the solution we came to may seem obvious, we arrived at it only through a bit of wandering in a wide open solution space.

The ZFS write throttle was fundamentally flawed — the data indelibly supported this diagnosis. The cure was far less clear. The core problem involved the actual throttling mechanism, allowing many fast writes while stalling some writes nearly without bound, with some artificially delayed writes ostensibly to soften the landing. Further, the mechanism relied on an accurate calculation of the backend throughput — a problem in itself, but one we’ll set aside for the moment.

On a frictionless surface in a vacuum…

Even in the most rigorously contrived, predictable cases, the old write throttle would yield high variance in the latency of writes. Consider a backend that can handle an unwavering 100MB/s (or 1GB/s or 10GB/s — pick your number). For a client with 10 threads executing 8KB async writes (again to keep it simple) to hit 100MB/s, the average latency would be around 780µs — not unreasonable.

Here’s how that scenario would play out with the old write throttle assuming full quiesced and syncing transaction groups (you may want to refer to my last blog post for a refresher on the mechanism and some of the numbers). With a target of 5 seconds to write out its contents, the currently open transaction group would be limited to 500MB. Recall that after 7/8ths of the limit is consumed, the old write throttle starts inserting a 10ms delay, so the first 437.5MB would come sailing in, say, with an average latency of 780µs, but then the remaining writes would average at least 10ms (scheduling delay could drive this even higher). With this artificially steady rate, the delay would occur 7/8ths of the way into our 5 second window, with 1/8th of the total remaining. So with 5/8ths of a second left, and an average latency of 10ms, the client would be able to write only and additional 500KB worth of data. More simply: data would flow at 100MB/s most of the time, and at less than 1MB/s the rest.

In this example the system inserted far too much delay — indeed, no delay was needed. In another scenario it could just have easily inserted too little.

Consider a case where we would require writers to be throttled. This time, let’s say the client has 1000 threads, and — since it’s now relevant — let’s say we’re limited to the optimistic 10GbE speed of 1GB/s. In this case the client would hit the 7/8ths in less than a second. 1000 threads writing 8KB every 10ms still push data at 800MB/s so we’d hit the hard limit just a fraction of a second later. With the quota exhausted, all 1000 threads would then block for about 4 seconds. A backend that can do 100MB/s x 5 seconds = 500MB = 64,000 x 8KB; the latency of those 64,000 writes breaks down like this: 55000 super fast, 8000 at 10ms, and 1000 at 4 seconds. Note that the throughput would be only slightly higher than in the previous example; the average latency would be approximately 1000 times higher which is optimal and expected.

In this example we delayed way too little, and paid the price with enormous 4 second outliers.

How to throttle

Consistency is more important than the average. The VP of Systems at a major retailer recently told me that he’d take almost always take a higher average for lower variance. Our goal for OpenZFS was to have consistent latency without lowering the average (if we could improve the average, hey so much the better). Given the total amount of work, there is a certain amount of delay we’d need to insert. The ZFS write throttle does so unequally. Our job was to delay all writes a little bit rather than some a lot.

One of our first ideas was to delay according to measured throughput. As with the example above, let’s say that the measured throughput of the backend was 100MB/s. If the transaction group had been open for 500ms, and we had accumulated 55MB so far, the next write would be delayed for 50ms, enough time to reduce the average to 100MB/s.

Think of it like a diagonal line on a graph from 0MB at time zero to the maximum size (say, 500MB) at the end of the transaction group (say, 5s). As the accumulated data pokes above that line, subsequent writes would be delayed accordingly. If we hit the data limit per transaction group then writes would be delayed as before, but it should be less likely as long as we’ve measured the backend throughput accurately.

There were two problems with this solution. First, calculating the backend throughput isn’t possible to do accurately. Performance can fluctuate significantly due to the location of writes, intervening synchronous activity (e.g. reads), or even other workloads on a multitenant storage array. But even if we could calculate it correctly, ZFS can’t spend all its time writing user data; some time must be devoted to writing metadata and doing other housekeeping.

Size doesn’t matter

Erasing the whiteboard, we added one constraint and loosened another: don’t rely an estimation of backend throughput, and don’t worry too much about transaction group duration.

Rather than capping transaction groups to a particular size, we would limit the amount of system memory that could be dirty (modified) at any given time. As memory filled past a certain point we would start to delay writes proportionally.

OpenZFS didn’t have a mechanism to track the outstanding dirty data. Adding it was non-trivial as it required communication across the logical (DMU) and physical (SPA) boundaries to smoothly retire dirty data as physical IOs completed. Logical operations given data redundancy (mirrors, RAID-Z, and ditto blocks) have multiple associated physical IOs. Waiting for all of them to complete would lead to lurches in the measure of outstanding dirty data. Instead, we retire a fraction of the logical size each time a physical IO completes.

By using this same metric of outstanding dirty data, we observed that we could address a seemingly unrelated, but chronic problem observed in ZFS — so called “picket-fencing”, the extreme burstiness of writes that ZFS issues to its disks. ZFS has a fixed number of concurrent outstanding IOs it issues to a device. Instead the new IO scheduler would issues a variable number of writes proportional to the amount of dirty data. With data coming in at a trickle, OpenZFS would trickle data to the backend, issuing 1 IO at a time. As incoming data rate increased, the IO scheduler would work harder, scheduling more concurrent writes in order to keep up (up to a fixed limit). As noted above, if OpenZFS couldn’t keep up with the rate of incoming data, it would insert delays also proportional to the amount of outstanding dirty data.

Results

The goal was improved consistency with no increase in the average latency. The results of our tests speak for themselves (log-log scale).

Note the single-moded distribution of OpenZFS compared with the highly varied results from ZFS. You can see by the dashed lines that we managed to slightly improve the average latency (1.04ms v. 1.27ms).

OpenZFS now represents a significant improvement over ZFS with regard to consistency both of client write latency and of backend write operations. In addition, the new IO scheduler improves upon ZFS when it comes to tuning. The mysterious magic numbers and inscrutable tuneables of the old write throttle have been replaced with knobs that are comprehensible, and can be connected more directly with observed behavior. In the final post in this series, I’ll look at how to tune the OpenZFS write throttle.

It’s no small feat to build a stable, modern filesystem. The more I work with ZFS, the more impressed I am with how much it got right, and how malleable it’s proved. It has evolved to fix shortcomings and accommodate underlying technological shifts. It’s not surprising though that even while its underpinnings have withstood the test of production use, ZFS occasionally still shows the immaturity of the tween that it is.

Even before the ZFS storage appliance launched in 2008, ZFS was heavily used and discussed Solaris and OpenSolaris communities, the frequent subject of praise and criticism. A common grievance was that write-heavy workloads would consume massive amounts of system memory… and then render the system unusable as ZFS dutifully deposited the new data onto the often anemic storage (often a single spindle for OpenSolaris users).

For workloads whose ability to generate new data far outstripped the throughput of persistent storage, it became clear that ZFS needed to impose some limits. ZFS should have effective limits on the amount of system memory devoted to “dirty” (modified) data. Transaction groups should be bounded to prevent high latency IO and administrative operations. At a high level, ZFS transaction groups are just collections of writes (transactions), and there can be three transaction groups active at any given time; for a more thorough treatment, check out last year’s installment of ZFS knowledge.

Write Throttle 1.0 (2008)

The proposed solution appealed to an intuitive understanding of the system. At the highest level, don’t let transaction groups grow indefinitely. When a transaction reached a prescribed size, ZFS would create a new transaction group; if three already existed, it would block waiting for the syncing transaction group to complete. Limiting the size of each transaction group yielded a number of benefits. ZFS would no longer consume vast amounts of system memory (quelling outcry from the user community). Administrative actions that execute at transaction group boundaries would be more responsive. And synchronous, latency-sensitive operations wouldn’t have to contend with a deluge of writes from the syncing transaction group.

So how big should transaction groups be? The solution included a target duration for writing out a transaction group (5 seconds). The size of each transaction group would be based on that time target and an inferred write bandwidth. Duration times bandwidth equals target size. The inferred bandwidth would be recomputed after each transaction group.

When the size limit for a transaction group was reached, new writes would wait for the next transaction group to open. This could be nearly instantaneous if there weren’t already three transaction groups active, or it could incur a significant delay. To ameliorate this, the write throttle would insert a 10ms delay for all new writes once 7/8th of the size had been consumed.

See the gory details in the git commit.

Commentary

That initial write throttle made a comprehensible, earnest effort to address some critical problems in ZFS. And, to a degree, it succeeded. Though the lack of rigorous ZFS performance testing at that time is reflected in the glaring deficiencies with that initial write throttle. A simple logic bug lingered for other two months, causing all writes to be delayed by 10ms, not just those executed after the transaction group had reached 7/8ths of its target capacity — trivial, yes, but debilitating and telling. The computation of the write throttle resulted in values that varied rapidly; eventually a slapdash effort at hysteresis was added.

Stepping back, the magic constants arouse concern. Why should transaction groups last 5 seconds? Yes, they should be large enough to amortize metadata updates within a transaction group, and they should not be so large that they cause administrative unresponsiveness. For the ZFS storage appliance we experimented with lower values in an effort to smooth out the periodic bursts of writes — an effect we refer to as “picket-fencing” for its appearance in our IO visualization interface. Even more glaring, where did the 7/8ths cutoff come from or the 10ms worth of delay? Even if the computed throughput was dead accurate, the algorithm would lead to ZFS unnecessarily delaying writes. At first blush, this scheme was not fatally flawed, but surely arbitrary, disconnected from real results, and nearly impossible to reason about on a complex system.

Problems

The write throttle demonstrated problems more severe than the widely observed picket-fencing. While ZFS attempted to build a stable estimate of write throughput capacity, the computed number would, in practice, swing wildly. As a result, ZFS would variously over-throttle and under-throttle. It would often insert the 10ms delay, but that delay was intended merely as a softer landing than the hard limit. Once reached, the hard limit — still the primary throttling mechanism — could impose delays well in excess of a second.

The graph below shows the frequency (count) and total contribution (time) for power-of-two IO latencies from a production system.

The latency frequencies clearly show a tri-modal distribution: writes that happen at the speed of software (much less than 1ms), writes that are delayed by the write throttle (tens of milliseconds), and writes that bump up against the transaction group size (hundred of milliseconds up to multiple seconds).

The total accumulated time for each latency bucket highlights the dramatic impact of outliers. The 110 operations taking a second or longer contribute more to the overall elapsed time than the time of the remaining 16,000+ operations.

A new focus

The first attempt at the write throttle addressed a critical need, but was guided by the need to patch a hole rather than an understanding of the fundamental problem. The rate at which ZFS can move data to persistent storage will vary for a variety of reasons: synchronous operations will consume bandwidth; not all writes impact storage in the same way — scattered writes to areas of high fragmentation may be slower than sequential writes. Regardless of the real, instantaneous throughput capacity, ZFS needs to pass on the effective cost — as measured in write latency — to the client. Write throttle 1.0 carved this cost into three tranches: writes early in a transaction group that pay nothing, those late in a transaction group that pay 10ms each, and those at the end that pick up the remainder of the bill.

If the rate of incoming data was less than the throughput capacity of persistent storage the client should be charged nothing — no delay should be inserted. The write throttle failed by that standard as well, delaying 10ms in situations that warranted no surcharge.

Ideally ZFS should throttle writes in a way that optimizes for minimized and consistent latency. As we developed a new write throttle, our objectives were low variance for write latency, and steady and consistent (rather than bursty) writes to persistent storage. In my next post I’ll describe the solution that Matt Ahrens and I designed for OpenZFS.

I’ve been watching ZFS from moments after its inception at the hands of Matt Ahrens and Jeff Bonwick, so I’m excited to see it enter its newest phase of development in OpenZFS. While ZFS has long been regarded as the hottest filesystem on 128 bits, and has shipped in many different products, what’s been most impressive to me about ZFS development has been the constant iteration and reinvention.

Before shipping in Solaris 10 update 2, major components of ZFS had already advanced to “2.0” and “3.0”. I’ve been involved with several ZFS-related products: Solaris 10, the ZFS Storage Appliance (nee Sun Storage 7000), and the Delphix Engine. Each new product and each new use has stressed ZFS in new ways, but also brought renewed focus to development. I’ve come to realize that ZFS will never be completed. I thought I’d use this post to cover the many ways that ZFS had failed in the products I’ve worked on over the years — and it has failed spectacularly at time — but this distracted from the most important aspect of ZFS. For each new failure in each new product with each new use and each new workload ZFS has adapted and improved.

OpenZFS doesn’t need a caretaker community for a finished project; if that were the case, porting OpenZFS to Linux, FreeBSD, and Mac OS X would have been the end. Instead, it was the beginning. The need for the OpenZFS community grew out of the porting efforts who wanted the world’s most advanced filesystem on their platforms and in their products. I wouldn’t trust my customers’ data to a filesystem that hadn’t been through those trials and triumphs over more than a decade. I can’t wait to see the next phase of evolution that OpenZFS brings.

 

If you’re at LinuxCon today, stop by the talk by Matt Ahrens and Brian Behlendor for more on OpenZFS; follow @OpenZFS for all OpenZFS news.

Recent Posts

January 22, 2024
January 13, 2024
December 29, 2023
February 12, 2017
December 18, 2016
August 9, 2016

Archives

Archives