Adam Leventhal's blog

Search
Close this search box.

Category: ZFS

I’ve continued to explore ZFS as I try to understand performance pathologies, and improve performance. A particular point of interest has been the ZFS write throttle, the mechanism ZFS uses to avoid filling all of system memory with modified data. I’m eager to write about the strides we’re making in that regard at Delphix, but it’s hard to appreciate without an understanding of how ZFS batches data. Unfortunately that explanation is literally nowhere to be found. Back in 2001 I had not yet started working on DTrace, and was talking to Matt and Jeff, the authors of ZFS, about joining them. They had only been at it for a few months; I was fortunate to be in a conference with them as the ideas around transaction groups formulated. Transaction groups are how ZFS batches up chunks of data to be written to disk (“groups” of “transactions”). Jeff stood at the whiteboard and drew the progression of states for transaction groups, from open, accepting new transactions, to quiescing, allowing transactions to complete, to syncing, writing data out to disk. As far as I can tell, that was both the first time that picture had been drawn and the last. If you search for information on ZFS transaction groups you’ll find mention of those states… and not much else. The header comment in usr/src/uts/common/fs/zfs/txg.c isn’t particularly helpful:

/*
 * Pool-wide transaction groups.
 */

I set out to write a proper description of ZFS transaction groups. I’m posting it here first, and I’ll be offering it as a submission to illumos. Many thanks to Matt Ahrens, George Wilson, and Max Bruning for their feedback.

ZFS Transaction Groups

ZFS transaction groups are, as the name implies, groups of transactions that act on persistent state. ZFS asserts consistency at the granularity of these transaction groups. Each successive transaction group (txg) is assigned a 64-bit consecutive identifier. There are three active transaction group states: open, quiescing, or syncing. At any given time, there may be an active txg associated with each state; each active txg may either be processing, or blocked waiting to enter the next state. There may be up to three active txgs, and there is always a txg in the open state (though it may be blocked waiting to enter the quiescing state). In broad strokes, transactions — operations that change in-memory structures — are accepted into the txg in the open state, and are completed while the txg is in the open or quiescing states. The accumulated changes are written to disk in the syncing state.

Open

When a new txg becomes active, it first enters the open state. New transactions — updates to in-memory structures — are assigned to the currently open txg. There is always a txg in the open state so that ZFS can accept new changes (though the txg may refuse new changes if it has hit some limit). ZFS advances the open txg to the next state for a variety of reasons such as it hitting a time or size threshold, or the execution of an administrative action that must be completed in the syncing state.

Quiescing

After a txg exits the open state, it enters the quiescing state. The quiescing state is intended to provide a buffer between accepting new transactions in the open state and writing them out to stable storage in the syncing state. While quiescing, transactions can continue their operation without delaying either of the other states. Typically, a txg is in the quiescing state very briefly since the operations are bounded by software latencies rather than, say, slower I/O latencies. After all transactions complete, the txg is ready to enter the next state.

Syncing

In the syncing state, the in-memory state built up during the open and (to a lesser degree) the quiescing states is written to stable storage. The process of writing out modified data can, in turn modify more data. For example when we write new blocks, we need to allocate space for them; those allocations modify metadata (space maps)… which themselves must be written to stable storage. During the sync state, ZFS iterates, writing out data until it converges and all in-memory changes have been written out. The first such pass is the largest as it encompasses all the modified user data (as opposed to filesystem metadata). Subsequent passes typically have far less data to write as they consist exclusively of filesystem metadata.

To ensure convergence, after a certain number of passes ZFS begins overwriting locations on stable storage that had been allocated earlier in the syncing state (and subsequently freed). ZFS usually allocates new blocks to optimize for large, continuous, writes. For the syncing state to converge however it must complete a pass where no new blocks are allocated since each allocation requires a modification of persistent metadata. Further, to hasten convergence, after a prescribed number of passes, ZFS also defers frees, and stops compressing.

In addition to writing out user data, we must also execute synctasks during the syncing context. A synctask is the mechanism by which some administrative activities work such as creating and destroying snapshots or datasets. Note that when a synctask is initiated it enters the open txg, and ZFS then pushes that txg as quickly as possible to completion of the syncing state in order to reduce the latency of the administrative activity. To complete the syncing state, ZFS writes out a new uberblock, the root of the tree of blocks that comprise all state stored on the ZFS pool. Finally, if there is a quiesced txg waiting, we signal that it can now transition to the syncing state.

What else?

Please let me know if you have suggestions for how to improve the descriptions above. There’s more to be written on the specifics of the implementation, transactions, the DMU, and, well, ZFS in general. One thing that I’d note is that Matt mentioned to me recently that were he starting from scratch, he might eliminate the quiescing state. I didn’t understand fully until I researched the subsystem. Typically transactions take a very brief amount of time to “complete”, time measured by CPU latency as opposed, say, to I/O latency. Had the quiescing phase been merged into the syncing phase, the design would be slightly simpler, and it would eliminate the mostly idle intermediate phase where a bunch of dirty data can sit in memory relatively idle.

Next I’ll write about the ZFS write throttle, it’s various brokenness, and our ideas for how to fix it.

Back in October I was pleased to attend — and my employer, Delphix, was pleased to sponsor — illumos day and ZFS day, run concurrently with Oracle Open World. Inspired by the success of dtrace.conf(12) in the Spring, the goal was to assemble developers, practitioners, and users of ZFS and illumos-derived distributions to educate, share information, and discuss the future.

illumos day

The week started with the developer-centric illumos day. While illumos picked up the torch when Oracle re-closed OpenSolaris, each project began with a very different focus. Sun and the OpenSolaris community obsessed with inclusion, and developer adoption — often counterproductively. The illumos community is led by those building products based on the unique technologies in illumos — DTrace, ZFS, Zones, COMSTAR, etc. While all are welcome, it’s those who contribute the most whose voices are most relevant.

I was asked to give a talk about technologies unique to illumos that are unlikely to appear in Oracle Solaris. It was only when I started to prepare the talk that the difference in focuses of illumos and Oracle Solaris fell into sharp focus. In the illumos community, we’ve advanced technologies such as ZFS in ways that would benefit Oracle Solaris greatly, but Oracle has made it clear that open source is anathema for its version of Solaris. For example, at Delphix we’ve recently been fixing bugs, asking ourselves, “how has Oracle never seen this?”.

Yet the differences between illumos and Oracle Solaris are far deeper. In illumos we’re building products that rely on innovation and differentiation in the operating system, and it’s those higher-level products that our various customers use. At Oracle, the priorities are more traditional: support for proprietary SPARC platforms, packaging and updating for administrators, and ease-of-use. In my talk, rather than focusing on the sundry contributions to illumos, I picked a few of my favorites. The slides are more or less incomprehensible on their own; many thanks to Deirdre Straughan for posting the video (and for putting together the event!) — check out 40:30 for a photo of Jean-Luc Picard attending the DTrace talk at OOW.

[youtube_sc url=”https://www.youtube.com/watch?v=7YN6_eRIWWc&t=0m19s”]

ZFS day

While illumos day was for developers, ZFS day was for users of ZFS to learn from each others’ experiences, and hear from ZFS developers. I had the ignominous task of presenting an update on the Hybrid Storage Pool (HSP). We developed the HSP at Fishworks as the first enterprise storage system to add flash memory into the storage hierarchy to accelerate reads and writes. Since then, economics and physics have thrown up some obstacles: DRAM has gotten cheaper, and flash memory is getting harder and harder to turn into a viable enterprise solution. In addition, the L2ARC that adds flash as a ZFS read cache, has languished; it has serious problems that no one has been motivated or proficient enough to address.

I’ll warn you that after the explanation of the HSP, it’s mostly doom and gloom (also I was sick as a dog when I prepared and gave the talk), but check out the slides and video for more on the promise and shortcomings of the HSP.

[youtube_sc url=”http://www.youtube.com/watch?v=P77HEEgdnqE&feature=youtu.be”]

Community

For both illumos day and ZFS day, it was a mostly full house. Reuniting with the folks I already knew was fun as always, but even better was to meet so many who I had no idea were building on illumos or using ZFS. The events highlighted that we need to facilitate more collaboration — especially around ZFS — between the illumos distros, FreeBSD, and Linux — hell, even Oracle Solaris would be welcome.

Lately, I’ve been rooting around in the bowels of ZFS as we’ve explored some long-standing performance pathologies. To that end I’ve been fortunate to learn at the feet of Matt Ahrens who was half of the ZFS founding team and George Wilson who has forgotten more about ZFS than most people will ever know. I wanted to start sharing some of the interesting details I’ve unearthed.

For allocation purposes, ZFS carves vdevs (disks) into a number of “metaslabs” — simply smaller, more manageable chunks of the whole. How many metaslabs? Around 200:

void
vdev_metaslab_set_size(vdev_t *vd)
{
        /*
         * Aim for roughly 200 metaslabs per vdev.
         */
        vd->vdev_ms_shift = highbit(vd->vdev_asize / 200);
        vd->vdev_ms_shift = MAX(vd->vdev_ms_shift, SPA_MAXBLOCKSHIFT);
}

http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/vdev.c#1553

Why 200? Well, that just kinda worked and was never revisited. Is it optimal? Almost certainly not. Should there be more or less? Should metaslab size be independent of vdev size? How much better could we do? All completely unknown.

The space in the vdev is allotted proportionally, and contiguously to those metaslabs. But what happens when a vdev is expanded? This can happen when a disk is replaced by a larger disk or if an administrator grows a SAN-based LUN. It turns out that ZFS simply creates more metaslabs — an answer whose simplicity was only obvious in retrospect.

For example, let’s say we start with a 2T disk; then we’ll have 200 metaslabs of 10G each. If we then grow the LUN to 4TB then we’ll have 400 metaslabs. If we started instead from a 200GB LUN that we eventually grew to 4TB we’d end up with 4,000 metaslabs (each 1G). Further, if we started with a 40TB LUN (why not) and grew it by 100G ZFS would not have enough space to allocate a full metaslab and we’d therefore not be able to use that additional space.

At Delphix our metaslabs can become highly fragmented because most of our datasets use a 8K record size (read up on space maps to understand how metaslabs are managed — truly fascinating), and our customers often expand LUNs as a mechanism for adding more space. It’s not clear how much room there is for improvement, but these are curious phenomena that we intend to investigate along with the structure of space maps, the idiosyncrasies of the allocation path, and other aspects of ZFS as we continue to understand and improve performance. Stay tuned.

Exactly 10 years ago today, Jeff Bonwick and Matt Ahrens got their first ZFS prototype working in user-land. Jeff had scrapped his previous attempt at reinventing filesystems, working through the established filesystem management and engineering channels at Sun, and this time started with a clean sheet of paper. Matt had joined Sun that June shortly after graduating from Brown University. Both prodigious coders, the duo, in remarkably short order, showed us a glint of what ZFS would be. A year later, the master and apprentice had ZFS working in the kernel, moving data from end to end. Three years after that, standing in front of a team of a dozen engineers, Matt typed ‘putback’ to integrate ZFS into Solaris. The distance ZFS has traveled these past 10 years has been monumental, and ZFS has indelibly impacted the industry. ZFS is one of the load-bearing pillars here at Delphix; without it, our task would have been too ambitious to even begin. Congratulations to our own Matt Ahrens on this milestone, as well as to Jeff, and everyone else who has contributed to ZFS over the last 10 years including the growing community building new products around ZFS and illumos.

Update: Check out Matt’s blog post on the subject.

Apple recently announced a new iMac model — in itself, only as notable as the seasons — but with an interesting option: users can choose to have both an HDD and an SSD. Their use of these two is absolutely pedestrian, as noted on the Apple store:

If you configure your iMac with both the solid-state drive and a Serial ATA hard drive, it will come preformatted with Mac OS X and all your applications on the solid-state drive. Then you can use the hard drive for videos, photos, and other files.

Fantastic. This is hierarchical storage management (HSM) as it was conceived by Alan Turing himself as he toiled against the German war machine (if I remember my history correctly). The onus of choosing the right medium for given data is completely on the user. I guess the user who forks over $500 for this fancy storage probably is savvy enough to copy files to and fro, but aren’t computers pretty good at automating stuff like that?

Back at Sun, we built the ZFS Hybrid Storage Pool (HSP), a system that combines disk, DRAM, and, yes, flash. A significant part of this is the L2ARC implemented by Brendan Gregg that uses SSDs as a cache for often-used data. Hey Apple, does this sound useful?

Curiously, the new iMacs contain the Intel Z68 chipset which provides support for SSD caching. Similar to what ZFS does in software, the Intel chipset stores a subset of the data from the HDD on the SSD. By the time the hardware sees the data, it’s stripped of all semantic meaning — it’s just offsets and sizes. In ZFS, however, the L2ARC knows more about the data so can do a better job about retaining data that’s relevant. But iMac users suffer a more fundamental problem: the SSD caching feature of the Z68 doesn’t appear to be used.

It’s a shame that Apple abandoned the port of ZFS they had completed ostensibly due to “licensing issues” (DTrace in Mac OS X uses the same license — perhaps a subject for another blog post). Fortunately, Ten’s Complement has picked up the reins. Apple systems with HDDs and SSDs could be the ideal use case ZFS in the consumer environment.

Chris George from DDRdrive put together a great presentation at the OpenStorage summit looking at the ZFS intent log (ZIL), and how their product is particularly well-suited as a ZIL device. Chris did a particularly interesting analysis of the I/O pattern ZFS generates to ZIL devices (using DTrace of course). With writes to a single ZFS dataset, writes are almost 100% sequential, but with activity to multiple datasets, writes become significantly more non-sequential. The ZIL was initially designed to accelerate performance with a dedicated hard drive, but the Hybrid Storage Pool found a significantly better ZIL device in write-optimized, flash SSDs.

In the 7000 series, the performance of these SSDs — called Logzillas — aren’t particularly sensitive to random write patters. Less sophisticated, cheaper SSDs are more significantly impacted by randomness in that both performance and longevity can suffer.

Chris concludes that NV-DRAM is better suited than flash for the ZIL (Oracle’s Logzilla built by STEC actually contains a large amount of NV-DRAM). I completely agree; further, if HDDs and commodity SSDs continue to be target ZIL devices, ZFS could and should do more to ensure that writes are sequential.

Recent Posts

April 17, 2024
January 13, 2024
December 29, 2023
February 12, 2017
December 18, 2016

Archives

Archives