RAID – Adam Leventhal's blog

APFS in Detail: Data Integrity

June 19, 2016

This series of posts covers APFS, Apple’s new filesystem announced at WWDC 2016. See the first post for the table of contents.

Data Integrity

Arguably the most important job of a file system is preserving data integrity. Here’s my data, don’t lose it, don’t change it. If file systems could be trusted absolutely then the “only” reason for backup would be the idiot operators (i.e. you and me). There are a few mechanisms that file systems employ to keep data safe.

Redundancy

APFS makes no claims with regard to data redundancy. As Apple’s Eric Tamura noted at WWDC, most Apple devices have a single storage device (i.e. one logical SSD) making RAID, for example, moot. Instead redundancy comes from lower layers such as Apple RAID (apparently a thing), hardware RAID controllers, SANs, or even the “single” storage devices themselves..

As an aside note that SSDs in most Apple products where APFS will run include multiple more-or-less independent NAND chips. High-end SSDs do implement data redundancy within the device, but it comes at the price of reduced capacity and performance. As noted above, the “flash-optimization” of APFS doesn’t actually extend much below the surface of the standard block device interface, but the raw materials for innovation are there.

Also, APFS removes the most common way of a user achieving local data redundancy: copying files. A copied file in APFS actually creates a lightweight clone with no duplicated data. Corruption of the underlying device would mean that both “copies” were damaged whereas with full copies localized data corruption would affect just one.

Crash Consistency

Computer systems can fail at any time—crashes, bugs, power outages, etc.—so file systems need to anticipate and recover from these scenarios. The old old old school method is to plod along and then have a special utility to check and repair the file system during boot (fsck, short for file system check). More modern systems labor to achieve an always consistent format, or only narrow windows of inconsistency, obviating the need for the full, expensive fsck. ZFS, for example, builds up new state on disk and then atomically transitions from the previous state to the new one with a single atomic operation.

Overwriting data creates the most obvious opening for inconsistency. If the file system needs to overwrite several regions there is a window where some regions represent the new state and some represent the former state. Copy-on-write (COW) is a method to avoid this by always allocating new regions and then releasing old ones for reuse rather than modifying data in-place. APFS claims to implement a “novel copy-on-write metadata scheme”; APFS lead developer Dominic Giampaolo emphasized the novelty of this approach without delving into the details. In conversation later, he made it clear that APFS does not employ the ZFS mechanism of copying all metadata above changed user data which allows for a single, atomic update of the file system structure.

It’s surprising to see that APFS includes fsck_apfs—even after asking Dominic I’m not sure why it would be necessary. For comparison I don’t believe there’s been an instance where fsck for ZFS would have found a problem that the file system itself didn’t already know how to detect. But Dominic was just as confused about why ZFS would forego fsck, so perhaps it’s just a matter of opinion.

Checksums

Notably absent from the APFS intro talk was any mention of checksums. A checksum is a digest or summary of data used to detect (and correct) data errors. The story here is surprisingly nuanced. APFS checksums its own metadata but not user data. The justification for checksumming metadata is strong: there’s relatively not much of it (so the checksums don’t consume much storage) and losing metadata can cast a potentially huge shadow of data loss. If, for example, metadata for a top level directory is corrupted then potentially all data on the disk could be rendered inaccessible. ZFS duplicates metadata (and triple duplicates top-level metadata) for exactly this reason.

Explicitly not checksumming user data is a little more interesting. The APFS engineers I talked to cited strong ECC protection within Apple storage devices. Both flash SSDs and magnetic media HDDs use redundant data to detect and correct errors. The engineers contend that Apple devices basically don’t return bogus data. NAND uses extra data, e.g. 128 bytes per 4KB page, so that errors can be corrected and detected. (For reference, ZFS uses a fixed size 32 byte checksum for blocks ranging from 512 bytes to megabytes. That’s small by comparison, but bear in mind that the SSD’s ECC is required for the expected analog variances within the media.) The devices have a bit error rate that’s tiny enough to expect no errors over the device’s lifetime. In addition, there are other sources of device errors where a file system’s redundant check could be invaluable. SSDs have a multitude of components, and in volume consumer products they rarely contain end-to-end ECC protection leaving the possibility of data being corrupted in transit. Further, their complex firmware can (does) contain bugs that can result in data loss.

The Apple folks were quite interested in my experience with regard to bit rot (aging data silently losing integrity) and other device errors. I’ve seen many instances where devices raised no error but ZFS (correctly) detected corrupted data. Apple has some of the most stringent device qualification tests for its vendors; I trust that they really do procure the best components. Apple engineers I spoke with claimed that bit rot was not a problem for users of their devices, but if your software can’t detect errors then you have no idea how your devices really perform in the field. ZFS has found data corruption on multi-million dollar storage arrays; I would be surprised if it didn’t find errors coming from TLC (i.e. the cheapest) NAND chips in some of Apple’s devices. Recall the (fairly) recent brouhaha regarding storage problems in the high capacity iPhone 6. At least some of Apple’s devices have been imperfect.

As someone who has data he cares about on a Mac, who has seen data lost from HFS, and who knows that even expensive, enterprise-grade equipment can lose data, I would gladly sacrifice 16 bytes per 4KB–less than 1% of my device’s size.

Scrub

As data ages you might occasionally want to check for bit rot. Likely fsck_apfs can accomplish this; as noted though there’s no data redundancy and no checksums for user data, so scrub would only help to find problems and likely wouldn’t help to correct them. And if it makes it any easier for Apple to reverse course, let’s say it’s for the el cheap-o drive I bought from Fry’s not for the gold-plated device I got from Apple.

Next in this series: Conclusions

On RAID-6 recovery

September 21, 2011

[latexpage] RAID algorithms have become a particular fascination of mine, and I recently read a very interesting paper that describes an optimization for RAID reconstruction (by Xiang, Xu, Lui, Chang, Pan, and Li). Before writing double- and triple-parity RAID algorithms for ZFS, I spent a fair bit of time researching the subject and have stayed interested since. Before describing the reconstruction optimization, some preamble is required. RAID algorithms can be divided into two buckets: one-dimensional algorithms, and multi-dimensional algorithms (terms of my own choosing; I haven’t seen this distinction discussed in literature).

One-dimensional RAID

A one-dimensional algorithm is one in which all data in a single RAID stripe is used to compute all parities. The RAID algorithm used by ZFS falls into this category as do most algorithms derived from Reed-Solomon coding. For a given RAID stripe’s set of data blocks, $D$, we can compute the nth parity block with some function $p(D, n)$. For example, ZFS, roughly, uses the formula \[ p(D,n) = \sum_{i=1}^{\left|{D}\right|} 2^{(i-1)(n-1)} \cdot D_{i} \] Here, addition and multiplication are defined over a Galois Field – the explanation would be far longer than it would be interesting or relevant so I’ll omit it from this post. It is worth noting that this particular approach only works for three parity disks or fewer, but that too is an entirely different subject (albeit an interesting one). Reconstruction of a missing block in a one-dimensional algorithm requires reading the available data, and performing some computation; each stripe may be reconstructed separately (and thus, in parallel).

Multi-dimensional RAID

A multi-dimensional algorithm is one in which parts of multiple logical RAID stripes may contribute to parity calculation. Examples of this include IBM’s EVENODD and NetApp’s slight simplification, Row-Diagonal Parity (RDP). These are most easily conveyed through diagrams:

With both EVENODD and RDP, calculation of the first parity block simply XOR the data blocks in that RAID stripe. The second parity block is calculated by a simple XOR of data values across RAID stripes more or less diagonally. Both of these techniques place constraints on the width of a RAID stripe.

Optimizing RAID reconstruction for fewer reads

The paper, A Hybrid Approach of Failed Disk Recovery Using RAID-6 Codes: Algorithms and Performance Evaluation, describes a optimization for reconstruction under multi-dimensional RAID algorithms. The key insight is that with parity calculations that effectively overlap, a clever reconstruction algorithm can use fewer blocks, thus incurring fewer disk reads. As described in the paper, normally when a given disk fails, all remaining data blocks and blocks from the first parity disk are used for reconstruction:

It is, however, possible to read fewer total blocks by taking advantage of the fact that certain blocks can be multiply used. In the reconstruction below, blocks with circles are used for “row” reconstruction, and blocks with squares are used for “diagonal” reconstruction.

Note that the simple approach requires reading 36 blocks (none from disk 7) whereas the reconstruction described in the paper requires reading only 27 blocks. This applies generally: the new approach requires 25% fewer blocks to perform the same reconstruction. And the paper includes a method of balancing the reads between disks.

Disappointing results

Unfortunately, optimizing for fewer reads didn’t translate to significant performance improvements in the overall recovery. For RDP it was about 12% better in the best case, but typically closer to 7%. For EVENODD the improvement was less than 5% in all cases. Why? The naïve recovery algorithm streams data off the healthy hard drives, performs a simple computation, and streams good data onto the replacement drive. Streaming is what drives do best – 3.5” or 2.5”, 7200, 10K, or 15K RPM; SATA, SAS, or FC they all stream pretty well. There may be some contention for I/O resources, but either that contention isn’t severe or the “skips” in the I/O patterns interrupt the normal streaming efficiency.

Applicability in flash

There’s another medium, however, that has throughput and IOPS to spare where this RAID reconstruction approach could be highly effective: flash. With SSDs, it’s possible to see throughput that strains the limits of I/O systems; reading less data could be a significant improvement, and the non-contiguous read patterns wouldn’t degrade performance as they do with HDDs. For all-flash arrays, this sort of optimization may be one of many in its class; with a surplus of IOPS, compute, and memory, the RAID algorithms designed for slow disks, slow CPUs, and a dearth of DRAM, may need to be scrapped and rethought.

What is RAID-Z?

July 21, 2010

The mission of ZFS was to simplify storage and to construct an enterprise level of quality from volume components by building smarter software — indeed that notion is at the heart of the 7000 series. An important piece of that puzzle was eliminating the expensive RAID card used in traditional storage and replacing it with high performance, software RAID. To that end, Jeff invented RAID-Z; it’s key innovation over other software RAID techniques was to close the “RAID-5 write hole” by using variable width stripes. RAID-Z, however, is definitely not RAID-5 despite that being the most common comparison.

RAID levels

Last year I wrote about the need for triple-parity RAID, and in that article I summarized the various RAID levels as enumerated by Gibson, Katz, and Patterson, along with Peter Chen, Edward Lee, and myself:

RAID-0 Data is striped across devices for maximal write performance. It is an outlier among the other RAID levels as it provides no actual data protection.
RAID-1 Disks are organized into mirrored pairs and data is duplicated on both halves of the mirror. This is typically the highest-performing RAID level, but at the expense of lower usable capacity.
RAID-2 Data is protected by memory-style ECC (error correcting codes). The number of parity disks required is proportional to the log of the number of data disks.
RAID-3 Protection is provided against the failure of any disk in a group of N+1 by carving up blocks and spreading them across the disks — bitwise parity. Parity resides on a single disk.
RAID-4 A group of N+1 disks is maintained such that the loss of any one disk would not result in data loss. A single disks is designated as the dedicated parity disk. Not all disks participate in reads (the dedicated parity disk is not read except in the case of a failure). Typically parity is computed simply as the bitwise XOR of the other blocks in the row.
RAID-5 N+1 redundancy as with RAID-4, but with distributed parity so that all disks participate equally in reads.
RAID-6 This is like RAID-5, but employs two parity blocks, P and Q, for each logical row of N+2 disk blocks.
RAID-7 Generalized M+N RAID with M data disks protected by N parity disks (without specifications regarding layout, parity distribution, etc).

RAID-Z: RAID-5 or RAID-3?

Initially, ZFS supported just one parity disk (raidz1), and later added two (raidz2) and then three (raidz3) parity disks. But raidz1 is not RAID-5, and raidz2 is not RAID-6. RAID-Z avoids the RAID-5 write hole by distributing logical blocks among disks whereas RAID-5 aggregates unrelated blocks into fixed-width stripes protected by a parity block. This actually means that RAID-Z is far more similar to RAID-3 where blocks are carved up and distributed among the disks; whereas RAID-5 puts a single block on a single disk, RAID-Z and RAID-3 must access all disks to read a single block thus reducing the effective IOPS.

RAID-Z takes a significant step forward by enabling software RAID, but at the cost of backtracking on the evolutionary hierarchy of RAID. Now with advances like flash pools and the Hybrid Storage Pool, the IOPS from a single disk may be of less importance. But a RAID variant that shuns specialized hardware like RAID-Z and yet is economical with disk IOPS like RAID-5 would be a significant advancement for ZFS.

Adam Leventhal's blog

Tag: RAID

APFS in Detail: Data Integrity

Data Integrity

Redundancy

Crash Consistency

Checksums

Scrub

On RAID-6 recovery

One-dimensional RAID

Multi-dimensional RAID

Optimizing RAID reconstruction for fewer reads

Disappointing results

Applicability in flash

What is RAID-Z?

RAID levels

RAID-Z: RAID-5 or RAID-3?

Recent Posts

Austin API Summit Wrap-up

Rust and JSON Schema: odd couple or perfect strangers

Oxide and Friends Season 4

DTrace probes in Rust

From Prometheus to Sisyphus

DTrace at Home

Archives

Archives