GaloisField – Adam Leventhal's blog

Triple-Parity RAID-Z

July 21, 2009

Double-parity RAID, or RAID-6, is the de facto industry standard for storage; when I started talking about triple-parity RAID for ZFS earlier this year, the need wasn’t always immediately obvious. Double-parity RAID, of course, provides protection from up to two failures (data corruption or the whole drive) within a RAID stripe. The necessity of triple-parity RAID arises from the observation that while hard drive capacity has roughly followed Kryder’s law, doubling annually, hard drive throughput has improved far more modestly. Accordingly, the time to populate a replacement drive in a RAID stripe is increasing rapidly. Today, a 1TB SAS drive takes about 4 hours to fill at its theoretical peak throughput; in a real-world environment that number can easily double, and 2TB and 3TB drives expected this year and next won’t move data much faster. Those long periods spent in a degraded state increase the exposure to the bit errors and other drive failures that would in turn lead to data loss. The industry moved to double-parity RAID because one parity disk was insufficient; longer resilver times mean that we’re spending more and more time back at single-parity. From that it was obvious that double-parity will soon become insufficient. (I’m working on an article that examines these phenomena quantitatively so stay tuned… update Dec 21, 2009: you can find the article here)

Last week I integrated triple-parity RAID into ZFS. You can take a look at the implementation and the details of the algorithm here, but rather than describing the specifics, I wanted to describe its genesis. For double-parity RAID-Z, we drew on the work of Peter Anvin which was also the basis of RAID-6 in Linux. This work was more or less a tutorial for systems programers, simplifying some of the more subtle underlying mathematics with an eye towards optimization. While a systems programmer by trade, I have a background in mathematics so was interested to understand the foundational work. James S. Plank’s paper A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like Systems describes a technique for generalized N+M RAID. Not only was it simple to implement, but it could easily be made to perform well. I struggled for far too long trying to make the code work before discovering trivial flaws with the math itself. A bit more digging revealed that the author himself had published Note: Correction to the 1997 Tutorial on Reed-Solomon Coding 8 years later addressing those same flaws.

Predictably, the mathematically accurate version was far harder to optimize, stifling my enthusiasm for the generalized case. My more serious concern was that the double-parity RAID-Z code suffered some similar systemic flaw. This fear was quickly assuaged as I verified that the RAID-6 algorithm was sound. Further, from this investigation I was able to find a related method for doing triple-parity RAID-Z that was nearly as simple as its double-parity cousin. The math is a bit dense; but the key observation was that given that 3 is the smallest factor of 255 (the largest value representable by an unsigned byte) it was possible to find exactly of 3 different seed or generator values after which there were collections of failures that formed uncorrectable singularities. Using that technique I was able to implement a triple-parity RAID-Z scheme that performed nearly as well as the double-parity version.

As far as generic N-way RAID-Z goes, it’s still something I’d like to add to ZFS. Triple-parity will suffice for quite a while, but we may want more parity sooner for a variety of reasons. Plank’s revised algorithm is an excellent start. The test will be if it can be made to perform well enough or if some new clever algorithm will need to be devised. Now, as for what to call these additional RAID levels, I’m not sure. RAID-7 or RAID-8 seem a bit ridiculous and RAID-TP and RAID-QP aren’t any better. Fortunately, in ZFS triple-parity RAID is just raidz3.

A little over three years ago, I integrated double-parity RAID-Z into ZFS, a feature expected of enterprise class storage. This was in the early days of Fishworks when much of our focus was on addressing functional gaps. The move to triple-parity RAID-Z comes in the wake of a number of our unique advancements to the state of the art such as DTrace-powered Analytics and the Hybrid Storage Pool as the Sun Storage 7000 series products meet and exceed the standards set by the industry. Triple-parity RAID-Z will, of course, be a feature included in the next major software update for the 7000 series (2009.Q3).

Double-Parity RAID-Z

June 18, 2006

When ZFS first started, it was just Jeff trying to pair old problems with new solutions in margins too small to contain either. Then Matt joined up to bring some young blood to the project. By the time the project putback, the team had grown to more than a dozen. And now I’ve been pulled in — if only for a cameo.

When ZFS first hit the streets, Jeff wrote about RAID-Z, an implementation of RAID designed for ZFS. RAID-Z improves upon previous RAID schemes primarily in that it eliminates the so-called “write hole” by using a full (and variable-sized) stripe for all write operations. It’s worth noting that RAID-Z exploits the fact that ZFS is an end-to-end solution such that metadata (traditionally associated with the filesystem layer) is used to interpret the RAID layout on disk (an operation usually ascribed to a volume manager). In that post, Jeff mentioned that a double-parity version of RAID-Z was in the works. What he actually meant is that he had read a paper, and thought it might work out — you’d be forgiven for inferring that actual code had been written.

Over lunch, Bill — yet another elite ZFS hacker — mentioned double-parity RAID-Z and their plans for implementing it. I pressed for details, read the paper, got interested in the math, and started yakking about it enough for Bill to tell me to put up or shut up.

RAID-6

The basic notion behind double-parity RAID or RAID-6 is that a stripe can survive two failures without losing data where RAID-5 can survive only a single failure. There are a number of different ways of implementing double-parity RAID; the way Jeff and Bill had chosen (due to its computational simplicity and lack of legal encumbrance) was one described by H. Peter Anvin in this paper. It’s a nice read, but I’ll attempt to summarize some of the math (warning: this summary is going to be boring and largely unsatisfying so feel free to skip it).

For a given stripe of n data blocks, D₀ .. D_n-1, RAID-5 computes the contents of the parity disk P by taking the bitwise XOR of those data blocks. If any D_n is corrupted or missing, we can recover it by taking the XOR of all other data blocks with P. With RAID-6, we need to compute another prity disk Q using a different technique such that Q alone can reconstruct any D_n and P and Q together can reconstruction any two data blocks.

To talk about this, it’s easier — believe it or not — to define a Galois field (or a finite field as I learned it) over the integers [0..255] — the values that can be stored in a single byte. The addition field operation (+) is just bitwise XOR. Multiplication (x) by 2 is given by this bitwise operation for x x 2 = y:

y₇	=	x₆
y₆	=	x₅
y₅	=	x₄
y₄	=	x₃ + x₇
y₃	=	x₂ + x₇
y₂	=	x₁ + x₇
y₁	=	x₀
y₀	=	x₇

A couple of simple things worth noting: addition (+) is the same as subtraction (-), 0 is the additive identity and the multiplicative annihilator, 1 is the multiplicative identity. Slightly more subtle: each element of the field except for 0 (i.e. [1..255]) can be represented as 2ⁿ for some n. And importantly: x^-1 = x²⁵⁴. Also note that x x y can be rewritten as 2^{log x} x 2^{log y} or 2^{log x + log y} (where + in that case is normal integer addition).

We compute Q as
2^n-1 D₀ + 2^n-2 D₁ … + D_n-1
or equivalently
((…(((D₀ x 2 + D₁ + …) x 2 + D_n-2) x 2 + D_n-1.
Computing Q isn’t much slower than computing P since we’re just dealing with a few simple bitwise operations.

With P and Q we can recover from any two failures. If D_x fails, we can repair it with P. If P also fails, we can recover D_x by computing Q_x where Q_i = Q + 2^{n – 1 – x} x D_x (easily done by performing the same computation as for generating Q but with D_x set to 0); D_x is then (Q_x + Q) / 2^{n – 1 – x} = (Q_x + Q) x 2^{x + 1 – n}. Once we solve for D_x, then we recompute P as we had initially.

When two data disks are missing, D_x and D_y, that’s when the rubber really meets the road. We compute P_xy and Q_xy such that P_xy + D_x + D_y = P and Q_xy + 2^{n – 1 – x} x D_x + 2^{n – 1 – y} x D_y = Q (as before). Using those two expressions and some basic algebra, we can solve for D_x and then plug that in to solve for D_y. The actual expressions are a little too hairy for HTML, but you can check out equation 16 in the paper or the code for the gory details.

Double-Parity RAID-Z

As of build 42 of OpenSolaris, RAID-Z comes in a double-parity version to complement the existing single-parity version — and it only took about 400 additional lines of code. Check out the code here. Of particular interest are the code to generate both parity blocks and the code to do double block reconstruction. What’s especially cool about ZFS is that we don’t just blithely reconstruct data, but we can verify it against the known checksum. This means, for example, that we could get back seemingly valid data from all disks, but fail the checksum; in that case we’d first try reconstructing each individual block, and then try reconstructing every pair of blocks until we’ve found something that checksums. You can see the code for combinatorial reconstruction here.

Using raidz2

To make a double-parity RAID-Z vdev, specify raidz2 to zpool(1M):

# zpool create pool raidz2 c1t0d0 c1t0d1 c1t0d2 c1t0d3 c1t0d4

This will create a pool with a double-parity RAID-Z vdev of width 5 where all data can sustain up to two failures be they corrupt data coming off the drives or drives that are failed or missing. The raidz vdev type continues to mean single-parity RAID-Z as does the new alias raidz1.

Double-parity RAID-Z is probably going to supplant the use of its single-parity predecessor in many if not most cases. As Dave Hitz of NetApp helpfully noted in a recent post double-parity RAID doesn’t actually cost you any additional space because you’ll typically have wider stripes. Rather than having two single-parity stripes of 5 disks each, you’ll have one double-parity stripe with 10 disks — the same capacity with extra protection against failures. It also shouldn’t cost you in terms of performance because the total number of disk operations will be the same and the additional math, while slightly more complex, is still insignificant compared with actually getting bits on disk. So enjoy the extra parity.

Technorati Tags: OpenSolaris ZFS

Adam Leventhal's blog

Tag: GaloisField

Triple-Parity RAID-Z

Double-Parity RAID-Z

RAID-6

Double-Parity RAID-Z

Using raidz2

Recent Posts

Austin API Summit Wrap-up

Rust and JSON Schema: odd couple or perfect strangers

Oxide and Friends Season 4

DTrace probes in Rust

From Prometheus to Sisyphus

DTrace at Home

Archives

Archives