Expand-O-Matic RAID-Z

I was having a conversation with an OpenBSD user and developer the other day, and he mentioned some ongoing work in the community to consolidate support for RAID controllers. The problem, he was saying, was that each controller had a different administrative model and utility — but all I could think was that the real problem was the presence of a RAID controller in the first place! As far as I’m concerned, ZFS and RAID-Z have obviated the need for hardware RAID controllers.

ZFS users seem to love RAID-Z, but a frustratingly frequent request is to be able to expand the width of a RAID-Z stripe. While the ZFS community may care about solving this problem, it’s not the highest priority for Sun’s customers and, therefore, for the ZFS team. It’s common for a home user to want to increase his total storage capacity by a disk or two at a time, but enterprise customers typically want to grow by multiple terabytes at once so adding on a new RAID-Z stripe isn’t an issue. When the request has come up on the ZFS discussion list, we have, perhaps unhelpfully, pointed out that the code is all open source and ready for that contribution. Partly, it’s because we don’t have time to do it ourselves, but also because it’s a tricky problem and we weren’t sure how to solve it.

Jeff Bonwick did a great job explaining how RAID-Z works, so I won’t go into it too much here, but the structure of RAID-Z makes it a bit trickier to expand than other RAID implementations. On a typical RAID with N+M disks, N data sectors will be written with M parity sectors. Those N data sectors may contain unrelated data so adding modifying data on just one disk involves reading the data off that disk and updating both those data and the parity data. Expanding a RAID stripe in such a scheme is as simple as adding a new disk and updating the parity (if necessary). With RAID-Z, blocks are never rewritten in place, and there may be multiple logical RAID stripes (and multiple parity sectors) in a given row; we therefore can’t expand the stripe nearly as easily.

A couple of weeks ago, I had lunch with Matt Ahrens to come up with a mechanism for expanding RAID-Z stripes — we were both tired of having to deflect reasonable requests from users — and, lo and behold, we figured out a viable technique that shouldn’t be very tricky to implement. While Sun still has no plans to allocate resources to the problem, this roadmap should lend credence to the suggestion that someone in the community might work on the problem.

The rest of this post will discuss the implementation of expandable RAID-Z; it’s not intended for casual users of ZFS, and there are no alchemic secrets buried in the details. It would probably be useful to familiarize yourself with the basic structure of ZFS, space maps (totally cool by the way), and the code for RAID-Z.

Dynamic Geometry

ZFS uses vdevs — virtual devices — to store data. A vdev may correspond to a disk or a file, or it may be an aggregate such as a mirror or RAID-Z. Currently the RAID-Z vdev determines the stripe width from the number of child vdevs. To allow for RAID-Z expansion, the geometry would need to be a more dynamic property. The storage pool code that uses the vdev would need to determine the geometry for the current block and then pass that as a parameter to the various vdev functions.

There are two ways to record the geometry. The simplest is to use the GRID bits (an 8 bit field) in the DVA (Device Virtual Address) which have already been set aside, but are currently unused. In this case, the vdev would need to have a new callback to set the contents of the GRID bits, and then a parameter to several of its other functions to pass in the GRID bits to indicate the geometry of the vdev when the block was written. An alternative approach suggested by Jeff and Bill Moore is something they call time-dependent geometry. The basic idea is that we store a record each time the geometry of a vdev is modified and then use the creation time for a block to infer the geometry to pass to the vdev. This has the advantage of conserving precious bits in the fixed-width DVA (though at 128 bits its still quite big), but it is a bit more complex since it would require essentially new metadata hanging off each RAID-Z vdev.

Metaslab Folding

When the user requests a RAID-Z vdev be expanded (via an existing or new zpool(1M) command-line option) we’ll apply a new fold operation to the space map for each metaslab. This transformation will take into account the space we’re about to add with the new devices. Each range [a, b] under a fold from width n to width m will become

[ m * (a / n) + (a % n), m * (b / n) + b % n ]

The alternative would have been to account for m – n free blocks at the end of every stripe, but that would have been overly onerous both in terms of processing and in terms of bookkeeping. For space maps that are resident, we can simply perform the operation on the AVL tree by iterating over each node and applying the necessary transformation. For space maps which aren’t in core, we can do something rather clever: by taking advantage of the log structure, we can simply append a new type of space map entry that indicates that this operation should be applied. Today we have allocated, free, and debug; this would add fold as an additional operation. We’d apply that fold operation to each of the 200 or so space maps for the given vdev. Alternatively, using the idea of time-dependent geometry above, we could simply append a marker to the space map and access the geometry from that repository.

Normally, we only rewrite the space map if the on-disk, log-structure is twice as large as necessary. I’d argue that the fold operation should always trigger a rewrite since processing it always requires a O(n) operation, but that’s really an ancillary point.

vdev Update

At the same time as the previous operation, the vdev metadata will need to be updated to reflect the additional device. This is mostly just bookkeeping, and a matter of chasing down the relevant code paths to modify and augment.


With the steps above, we’re actually done for some definition since new data will spread be written in stripes that include the newly added device. The problem is that extant data will still be stored in the old geometry and most of the capacity of the new device will be inaccessible. The solution to this is to scrub the data reading off every block and rewriting it to a new location. Currently this isn’t possible on ZFS, but Matt and Mark Maybee have been working on something they call block pointer rewrite which is needed to solve a variety of other problems and nicely completes this solution as well.

That’s It

After Matt and I had finished thinking this through, I think we were both pleased by the relative simplicity of the solution. That’s not to say that implementing it is going to be easy — there’s still plenty of gaps to fill in — but the basic algorithm is sound. A nice property that falls out is that in addition to changing the number of data disks, it would also be possible to use the same mechanism to add an additional parity disk to go from singl
e- to double-parity RAID-Z — another common request.

So I can now extend a slightly more welcoming invitation to the ZFS community to engage on this problem and contribute in a very concrete way. I’ve posted some diffs which I used sketch out some ideas; that might be a useful place to start. If anyone would like to create a project on OpenSolaris.org to host any ongoing work, I’d be happy to help set that up.

Posted on April 7, 2008 at 9:59 pm by ahl · Permalink
In: ZFS · Tagged with: , , ,

16 Responses

Subscribe to comments via RSS

  1. Written by MC
    on April 7, 2008 at 11:10 pm

    Some of us have been waiting years to read this post :)

  2. Written by Adam Leventhal
    on April 8, 2008 at 12:07 am

    @MC that’s good to hear, but bear in mind, this just means there’s one less excuse to make it happen. It’s still going to require someone to do the heavy lifting.

  3. Written by Don MacAskill
    on April 8, 2008 at 7:32 am

    I can’t wait until you’re right – for ZFS to have "obviated the need for hardware RAID controllers" but I’m afraid we’re just not there yet.
    The big, glaring hole is the lack of a battery-backed write cache.
    I’m aware of some of the upcoming hardware solutions to this problem, but they’re not here yet, and until they are, ZFS isn’t a complete solution for an enterprise needing lots of IOPS. :(

  4. Written by Wes W.
    on April 8, 2008 at 8:18 am

    Does a system UPS circumvent the need for a battery-backed write cache?

  5. Written by Adam Leventhal
    on April 8, 2008 at 9:05 am

    @Don I think you’re conflating two issues. RAID controllers often supply a battery-backed, non-volatile cache and this can certainly reduce latency and drive IOPS as synchronous writes will be bounded on the speed of that cache rather than on the speed of the drives. A fairly recent addition to ZFS is support for separate intent log devices. If you configure your pool with a fast, non-volatile intent log, you can easily push IOPS to match what you’d see with a RAID controller. I recently submitted an article to the ACM that discusses the issue of improving latency and IOPS. I’ll post here when that’s published.
    @Wes Unfortunately not. Power loss is just one possible cause for service interruption. Software failures have similar results and aren’t mitigated with a UPS. Sadly.

  6. Written by Nigel Smith
    on April 8, 2008 at 1:32 pm

    Adam, many thanks for writing up these notes.
    Let’s hope someone will pick this up & run with it.
    BTW, you have some bad links in the post:
    Your link ‘the basic structure of ZFS’, just links
    back to you post.
    And Bill Moore’s blog is simply http://blogs.sun.com/bill/
    (A pity he does not post more often.)
    You link is to Bill Moffitt’s blog!

  7. Written by Adam Leventhal
    on April 8, 2008 at 1:46 pm

    @Nigel Thanks for catching the problems — I’ve fixed those. Next time I see Bill I’ll let him know that he’s in danger of dipping below one post per year ;-)

  8. Written by Andrei Dorofeev
    on April 8, 2008 at 3:03 pm

    Great post! I think shrinking of raid-z pools would be useful also. That’s assuming that there’s enough free space in those pools to allow whole data (not parity) disks be removed.

  9. Written by Adam Leventhal
    on April 8, 2008 at 11:03 pm

    @Andrei Sure. First we need to be able to remove top level vdevs (that work is ongoing) and shrinking a RAID-Z vdev becomes a degenerate case. The order of operations would be just a bit different from the above: update bookkeeping, scrub to move data off the device, and then remove the device.

  10. Written by Jeff Bonwick
    on April 10, 2008 at 12:11 am

    The metaslab fold is clever — much nicer than adding a zillion frags. I still prefer time-dependent geometry to using the GRID bits, however, because that way we’re storing the minimum possible information. I want to reserve the GRID bits for properties that *must* be stored per-block, like data-dependent replication models. (More on that later.)
    The only real problem is that when adding (say) one disk, you lose the guarantee that even under theoretical worst-case fragmentation, all space is still usable via gang blocks. However, we can address this by doing a re-layout in the background once we have bp rewrite support.

  11. Written by Adam Leventhal
    on April 10, 2008 at 12:24 am

    @Jeff I figured just posting this would be a better way to elicit a response anyway ;-)
    While I agree that bits in the block pointer are a precious resource, someone could certainly prototype this whole thing by using the GRID bits to start and then swapping in time-dependent geometry later on. And I think the fold is clever too — of course, it’s all make possible by the cleverness of space maps.
    This all presumes the existence of BP rewrite anyway. You’re probably adding a disk because your pool is mostly full; in that case even an optimal algorithm wouldn’t have enough contiguous free space to do anything useful.

  12. Written by Haudy Kazemi
    on April 12, 2008 at 12:38 am

    Here’s an idea that I ran by Neil Perrin a few days ago to grow a raidz2 via use of slices and zpool replace.
    The thought is to partition/slice 5 drives into 10 slices, then make a raidz2 of those 10 slices (2 slices per drive). With raidz2, 5 drives, and 2 slices per drive, I believe I’d start out with a redundancy level equivalent of raidz1 over 5 drives.
    To grow the pool, the next step is to ‘zpool replace’ the 10 slices with 10 larger slices on 5 larger drives (or even 10 drives). Replacing the 5 smaller drives one by one with 5 larger drives, with a scrub in between each replacement should maintain raidz1 level redundancy. To give the pool true raidz2 level redundancy, one would use 10 real drives.
    To add more of the same size drive, this technique should also work to turn, say, a 5x500gb drive array into a 10×500 gb drive array, by replacing the original 10x250gb slices with 10x500gb drives (i.e. reuse the original 500gb drives and add 5 new 500gb drives, to create a 10x500gb drive raidz2). I was able to work out the drive replacement order on paper to do this in-place.
    If this technique works (I’ve tested the zpool replace part, but not create slices part), this would be one way to grow a raidz2 by replacing the existing drives in the pool, and starting with only 5 drives, it is a plausible case for a small budget user (e.g. home user). With 5 drives, N-1 remain usable for storage (like in raidz1), while an expanded array to 10 drives turns into a raidz2 level redundancy N-2). One reason I targeted drives in units of 5, is that internal SATA enclosures that convert 3×5.25" bays to 5×3.5" hot-swap bays are available for under $200. These seem to be nearly ideal for a small home server.
    Granted it is not as flexible as simply adding one drive at a time, but adding/replacing in blocks of 5 may be good enough until the algorithm Adam is proposing (or something similar/equivalent) is actually implemented.
    Buying 5 drives at one shot isn’t unreasonablely expensive even for home users…Best Buy had 320gb Seagate SATAs for $60 last November…5×60 = $300, and recently had 1 TB Seagate SATAs for $200 each. Frys recently had 750gb Seagate SATAs for $140. 500gb Seagate SATAs have been near $100.

  13. Written by Adam Leventhal
    on April 14, 2008 at 10:59 am

    @Haudy That’s certainly an interesting idea, and it seems like it could be used to solve the problem. It does, however, sound like quite an administrative burden, and I’m fairly sure that it would impose a significant performance penalty as ZFS would be treating each drive as 10 distinct devices. But it’s hard to complain about the bird in hand…

  14. Written by Andrey Kuzmin
    on May 2, 2008 at 1:01 pm

    > I still prefer time-dependent geometry to using
    > the GRID bits, however, because that way we’re
    > storing the minimum possible information.
    > Posted by Jeff Bonwick on April 10, 2008 at 12:11
    > AM PDT #
    Me too, but I’m somewhat concerned with using time – making geometry dependent on wall clock doesn’t look very much reliable. Is there anything with guaranteed monotonicity in zfs, transaction id for instance?

  15. Written by Wout Mertens
    on May 5, 2008 at 8:38 am

    I wish that instead of implementing this, a parity scheme would be added to the "copies=xx" attribute.
    The "copies=xx" logic is the higher-layer, flexible implementation of a mirror. Adding a parity scheme would gain ZFS a higher-layer, flexible implementation of a RAID-Z pool. So in addition to RAID-Z zpools there would be RAID-Z filesystems.
    Sure, there would be a performance hit relative to RAID-Z, but the target audience for this tweak is the home user. They would probably prefer extreme flexibility. The enterprise users would have no changes to their code paths, keeping it lean and mean.
    Just to make sure I’m not misunderstood, I mean that a block in a filesystem with "copies=8z" would be split up in 7 data blocks and one parity block (6+2 for "copies=8z2"). Those blocks would then be optimally distributed by the block allocator.
    The advantages?
    - Redundancy/time/space trade-off tweakable per filesystem
    - The block allocator would maintain RAID-Z properties as long as there’s enough space on enough vdevs.
    - Growing the RAID-Z filesystem would be as simple as adding or replacing a disk
    - If there’s not enough vdevs, allocation can still continue, although the redundancy would be impacted. Disks can’t break without data loss but surface defects would still be fixed. The user should be notified that he should add diskspace + scrub.
    - No impact to enterprise users
    - Performance hit of unknown magnitude due to explicit block allocation versus geometry-enforced allocation.
    - Added complexity of splitting up the blocks, I don’t even know if that’s possible. I assume that the case of 2+1 could be supported with the 3 block pointers, but for more blocks I suppose a sort of redundant linked list would be in order?
    I could have my ZFS internals all mixed up, apologies if so. I just wish ZFS was more like a Drobo in the "just add disks" department. Doing the above would make it so, no?

  16. Written by Adam Leventhal
    on May 7, 2008 at 2:59 pm

    @Wout Those are some really good ideas. If I can rephrase a bit, the idea would be to add disks to a storage pool without specifying the layout (e.g. RAID-Z, mirror, etc.), and then to choose the replication and fault resiliency at the dataset level. Jeff and I discussed this a bit when we were talking about this expandable RAID-Z stuff — apparently one of the initial ideas for ZFS was to do something like this. For any dataset you could choose mirroring, RAID-Z, or whatever. One of the problems is that a scheme like this requires the kernel to have extensive knowledge of the fault domains so that a space allocation would be sure to provide the requisite reliability — the kind of knowledge that an administrator applies when he chooses a particular zpool layout and configuration. For
    example, if you’re making a zpool on a thumper, you want to lay out your RAID-Z2 groups so that the loss of a fan or controller wouldn’t cause data loss.
    This is all possible, and could lead to a much better system. We could probably do this with a new vdev type to which it was possible to communicate fault domains and allocating policies. It’s not a small amount of work,
    and, unfortunately, the problem that it solves isn’t the biggest problem with ZFS today. I look forward to the day Sun picks up some projects like
    this or that the greater ZFS community gets involved with some tough coding projects like this.

Subscribe to comments via RSS