On Disk Failure

And we never failed to fail
It was the easiest thing to do

– Stephen Stills, Rick and Michael Curtis; “Southern Cross” (1981)

With Brian Beach’s article on disk drive failure continuing to stir up popular press and criticism, I’d like to discuss a much-overlooked facet of disk drive failure.  Namely, the failure itself.  Ignoring for a moment whether Beach’s analysis is any good or the sample populations even meaningful, the real highlight for me from the above back-and-forth was this comment from Brian Wilson, CTO of BackBlaze, in response to a comment on Mr. Beach’s article:

Replacing one drive takes about 15 minutes of work. If we have 30,000 drives and 2 percent fail, it takes 150 hours to replace those. In other words, one employee for one month of 8 hour days. Getting the failure rate down to 1 percent means you save 2 weeks of employee salary – maybe $5,000 total? The 30,000 drives costs you $4 million.

The $5k/$4million means the Hitachis are worth 1/10th of 1 percent higher cost to us. ACTUALLY we pay even more than that for them, but not more than a few dollars per drive (maybe 2 or 3 percent more).

Moral of the story: design for failure and buy the cheapest components you can. :-)

He later went on to disclaim in a followup comment, after being rightly taken to task by other commenters for, among other things, ignoring the effect of higher component failure rates on overall durability, that his observation applies only to his company’s application.  That rang hollow for me.  Here’s why.

The two modern papers on disk reliability that are more or less required reading for anyone in the field are the CMU paper by Schroeder and Gibson and the Google paper by Pinheiro, Weber, and Barroso.  Both repeatedly emphasise the difficulty of assessing failure and the countless ways that devices can fail.  Both settle on the same metric for failure: if the operator decided to replace the disk, it failed.  If you’re looking for a stark damnation of a technology stack, you won’t find a much better example than that: the only really meaningful way we have to assess failure is the decision a human made after reviewing the data (often a polite way of saying “groggily reading a pager at 3am” or “receiving a call from a furious customer”).  Everyone who has actually worked for any length of time for a manufacturer or large-scale consumer of disk-based storage systems knows all of this; it may not make for polite cocktail party conversation, but it’s no secret.  And that, much more than any methodological issues with Mr. Beach’s work, casts doubt on Mr. Wilson’s approach.  Even ignoring for a moment the overall reduction in durability that unreliable components creates in a system, some but not all of which can be mitigated by increasing the spare and parity device counts at increased cost, the assertion that the cost of dealing with a disk drive failure that does not induce permanent data loss is the cost of 15 minutes of one employee’s time is indefensible.  True, it may take only 15 minutes for a data centre technician armed with a box of new disk drives and a list of locations of known-faulty components to wander the data centre verifying that each has its fault LED helpfully lit, replacing each one, and moving on to the next, but that’s hardly the whole story.

"That's just what I wanted you to think, with your soft, human brain!"

Given the failure metric we’ve chosen out of necessity, it seems like we need to account for quite a bit of additional cost.  After all, someone had to assemble the list of faulty devices and their exact locations (or cause their fault indicators to be activated, or both).  Replacements had to be ordered, change control meetings held, inventories updated, and all the other bureaucratic accoutrements made up and filed in triplicate.  The largest operators and their supply chain partners have usually automated some portion of this, but that’s really beside the point: however it’s done, it costs money that’s not accounted for in the delightfully naive “15-minute model” of data centre operations.  Last, but certainly not least, we need to consider the implied cost of risk.  But the most interesting part of the problem, at least to me personally, is the first step in the process: identifying failure.  Just what does that entail, and what does it cost?

Courteous Failure

Most people not intimately familiar with storage serviceability probably assume, as I once did, that disk failure is a very simple matter.  Something inside the disk stops working (probably generating the dreaded clicky-clicky noise in the process), and as a result the disk starts responding to every request made of it with some variant of “I’m broken”.  The operating system notices this, stops using the disk (bringing a spare online if one is available), lights the fault LED on that disk’s bay, and gets on with life until the operator goes on site and replaces the broken device with a fresh one.  Sure enough, that can happen, but it’s exceedingly rare.  I’ll call this disk drive failure mode “courteous failure”; it has three highly desirable attributes:

Even in the case of courteous failure, many things still have to happen for the desired behaviour to occur.  All illumos-derived operating systems support FMA, the fault management architecture.  The implementation of this ranges from telemetry sources in both the kernel and userland to diagnosis and retirement engines that determine which component(s) are faulty and remove them from service.  For operators, the main documented components in this system are fmd(1M), fmadm(1M), and syseventd(1M).  With the cooperation of hardware, firmware, device drivers, kernel infrastructure, and the sysevent transport, these components are responsible for determining that a disk (or other component) is broken, informing the operator of this fact, and perhaps taking other actions such as turning on a fault indicator or instructing ZFS to stop using the device in favour of a spare.  While this sounds great, the practical reality leaves much to be desired.  Let’s take a look at the courteous failure case first.

Our HBA will transport the error status (normally CHECK CONDITION for SCSI devices) back to sd(7D), where we will generate a REQUEST SENSE command to obtain further details from the disk drive.  Since we are assuming courteous failure, this command will succeed and provide us with useful detail when we then end up in sd_ssc_ereport_post().  Based on the specific sense data we retrieved from the device, we’ll then generate an appropriate ereport via scsi_fm_ereport_post() and ultimately fm_dev_ereport_postv().  This interface is private; however, DDI-compliant device drivers and many HBA drivers also generate their own telemetry via ddi_fm_ereport_post(9F) which has similar semantics.  The posting of the ereport will result in a message (a sysevent) being delivered by the kernel to syseventd.  One or more of fmd’s modules will subscribe to sysevents in classes relevant to disk devices and other I/O telemetry, and will receive the event.  In our case, the Eversholt module will do this work.  Eversholt is actually a general-purpose diagnosis engine that can diagnose faults in a range of devices; the relevant definitions for disks may be found in disk.esc.  Since we’re assuming courteous failure due to a fatal media flaw, we’ll assume we end up diagnosing a media error fault.  Once we do so, the disk will show up in our list of broken devices from ‘fmadm faulty’, and the sysevent generated, of class ‘fault.io.scsi.cmd.disk.dev.rqs.merr’, will be passed along to the disk-lights engine.  This mechanism is responsible for (don’t laugh) turning on the LED for the faulty disk and turning it back off again when the disk is replaced.  We’re almost done; all that remains is to tell ZFS to stop using the device and replace it with a spare if possible.  The mechanism responsible for this is zfs-retire.  As part of retiring the broken device, ZFS will also select a spare and begin resilvering onto it from the other devices in the same vdev (whether by mirroring or reconstructing from parity).  Of course, all of these steps are reversed when the disk is eventually replaced, with the unfortunate exception of ZFS requiring an operator to execute the zpool(1M) replace command to trigger resilvering onto the new device and return the spare to the spares list.

If that sounds like a lot of moving pieces, that’s because it is.  When they all work properly, the determination that a disk has failed is very easy to make: fmd and ZFS agree that the device is broken, all documented tools report that fact, and the device is automatically taken out of use and its fault LED turned on.  If all failures manifested themselves in this way, there’d be little to talk about, and the no-trouble-found rate for disk drive RMAs would be zero.  There would also be little or no customer impact; a few I/O requests might be slightly delayed as they’re retried against the broken device’s mirror or the data rebuilt from other devices in a RAIDZ set, but it’s unlikely anyone other than the operator would even notice.  Unfortunately, the courteous failure mode I’ve just detailed is exceedingly rare.  I can’t actually recall ever seeing it happen this way, although I’m sure that’s a product of selective memory.  Let’s take a look at all the things that can go shatter this lovely vision of crisp, effective error handling.

Discourteous Failure, a.k.a. Real-World Failure

First, note that we make this diagnosis only when the underlying telemetry indicates that it’s fatal.  This won’t happen if the request that triggered it is eligible to be retried, and by default, illumos’s SCSI stack will retry most commands many, many times before giving up (we’ve greatly reduced this behaviour in SmartOS).  Since each retry can take seconds, even a minute, it can easily be minutes or even hours before the first error telemetry is generated!  This is perhaps the most common, and also among the worst, disk drive failure modes: the endless timeout.  This failure mode seems to be caused by firmware or controller issues; the drive simply never responds to any request, or to certain requests.  The best approach here is to be much more aggressive about failing requests quickly; for example, the B_FAILFAST option used by ZFS will abort commands immediately if it can be determined that the underlying device has been physically removed from the system or is otherwise known to be unreachable.  But this does not address the case in which the device is clearly present but simply refuses to respond.  A milder variant of this failure mode is the long-retry case, in which the disk will internally retry a read on a marginal sector, trying to position the head accurately enough to recover the data.  Most enterprise drives will retry for a few seconds before giving up; some consumer-grade devices will keep trying more or less forever.  Of course, firmware can fail as well; modern disk drives have a CPU and an OS on them; that OS can panic like any other, and will do so due to bugs or unexpected hardware faults.  Should this occur, requests are lost and will usually time out.  When any of these failures occur, application software is perceived to hang for long periods of time or proceed extremely slowly.  Other, less common, failure modes include returning bad data successfully, which only ZFS can detect; returning inaccurate sense data, precluding correct telemetry generation; and, most infuriatingly of all, working correctly but with excessively high latency.  Most of these failure modes are not handled well, if at all, by current software.

One additional discourteous failure mode highlights the fundamental challenge of diagnosis especially well.  Sometimes, whether because of a firmware fault or a hardware defect or fault, a disk drive will simply “go away”.  The visible impact of this from software is exactly the same as if the disk drive were physically removed from the system.  The industry standard practice in storage system serviceability is known as “surprise hotplug”; the system must support the unannounced removal and replacement of disk drives (limited only by the storage system’s redundancy attributes) without failing or indicating an error.  It’s easy to see that satisfying this requirement and diagnosing the vanishing disk drive failure are mutually exclusive.  One option is to ignore the surprise hotplug requirement in favour of something like cfgadm(1M).  While simpler to implement, this approach really transfers the burden onto the operator, one of the hallmarks of poor design.  Another option is to declare that a given hardware configuration requires the presence of disk drives in certain bays, and treat the absence of one as a fault.  In one sense, this has the right semantics: a disk drive that goes away will be diagnosed as faulty in the same way as any other failure mode and its indicator illuminated.  When it’s removed, it will already be faulty, so no new diagnosis will be made; when the replacement is inserted, the repair will be noticed.  But there’s another problem here: let’s think about all the underlying causes that could lead to this failure mode:

Notice that many of these root causes actually have nothing to do with the disk drive and will recur (often intermittently) on the same phy or bay, or in some cases on arbitrary phys or bays, even after the “faulty” disk drive is replaced.  These failure modes are not academic; I’ve seen at least three of them in the field.  The simplest way to distinguish mechanical removal of a disk drive from the various electrical and software-related failure modes is to place a microswitch in each disk drive bay and present its state to software.  But few if any enclosures have this feature, and a quick scan of our list of possible root causes shows that it wouldn’t be terribly effective anyway: even distinguishing the removal case doesn’t tell us whether the disk drive, the enclosure, the HBA, the backplane, or one of several firmware and software components is the true source of the problem.  Those that are not specific to the disk can occur intermittently for months before finally being root-caused, and can lead to a great many incorrect “faulty disk” diagnoses in the meantime.  Similar problems can occur when reading self-reported status from the disk, such as via the acceptable temperature range log pages.  Unfortunately, many disks have faulty firmware that will report absurd acceptable ranges incorrect values; this particular issue has resulted in hundreds of incorrect diagnoses in the field against at least two different disk drive models.  So even seemingly reliable data can easily result in false positives, many of them with no provably correct alternative.  Taking the human completely out of the loop is not merely difficult but outright impossible; a skilled storage FE will never lack for work.

Small-scale storage users rarely have occasion to notice any of this.  For starters, disk drives are really quite reliable and few people will experience any of these failure modes if they have only a handful of devices.  For another, most commodity operating systems don’t even make an effort to diagnose and report hardware failures, so at best they are limited to faults self-reported via mechanisms such as SMART.  In the absence of self-reported failure, a disk drive failure on a desktop or laptop system is quite easy to confuse with any of a number of other possible faults: the system becomes excessively slow or simply hangs and stops working altogether.  Larger-scale users, however, are familiar with many of these failure modes.  Users of commercial storage arrays may experience them less often, as the larger established vendors not only have had many years to improve their diagnosis capabilities but also tend to diagnose faults aggressively and rely instead on an extensive burn-in protocol and highly capable RMA processing to manage false positives.  That, unfortunately, leaves the rest of us in something of a tough middle ground.  Fortunately, ZFS also has its own rudimentary error handling mechanism, though it does not deal with slow devices or endless timeouts any better than illumos itself.  It can detect invalid data, however (other filesystems generally cannot), and on operating systems that lack illumos’s FMA mechanisms provides at least some ability to remove faulty disks from service.  ZFS also handles sparing in automatically, and will resilver spares and repair blocks with bad data without operator intervention, minimising the window of vulnerability to additional failures.

Taken together, these capabilities are really just a promising start; as should be obvious from the discussion of discourteous failure modes, there’s a lot of open work here.  I’m very happy that illumos has at least some of the infrastructure necessary to improve the situation; other than storage-specific proprietary operating systems, we’re in the best shape of anyone.  Some promising work is being done in this area by the team at Nexenta, for example, building on the mechanism I described above.  Even so, the situation remains ugly.  Almost all of these discourteous failure modes will inevitably be customer-visible to some extent.  Most of them will require an operator to diagnose the problem manually, often from little more than a support ticket stating the “nothing works” (this support ticket is often entirely accurate).  Not only does that take time, but it is time during which a customer, or even several, is in pain.  Manual diagnosis of discourteous disk drive failure is a common cause of poor customer experience among all storage providers (whether public or private), and in fact is among the more obnoxious routine challenges operators face.  Independent of actually replacing the faulty devices themselves, operators will often have to spend considerable time observing a system, often at odd hours and under pressure, to determine (a) that a disk drive has failed, and (b) which one is to blame.  As in some of the cases enumerated above, the problem may be intermittent or related to a firmware version or interoperability issue, and these problems can take arbitrarily long to thoroughly root-cause and correct.  We’ve experienced several of the simpler discourteous failure modes ourselves over the past year, and even in the best cases, with many years of experience and deep knowledge of the entire technology stack, it’s rare that I’ve gone from problem statement to confirmed root cause in less than 15 minutes.  An inexperienced first-line operator has no chance, and time will be lost escalating the issue, re-explaining it, gathering data, and so on.  Multiple people will be involved, some to diagnose the problem, others to write incident reports, communicate with affected customers, or set up repairs.  All of these processes can be streamlined to one degree or another, and many can be (and usually are) automated.  But given the complexity I’ve outlined here, the idea that handling a discourteous disk drive failure requires a total of 15 minutes of effort from a single employee and never has any indirect costs is every bit as silly as it sounds.  We have a long way to go before illumos reaches that level of reliability and completeness, and everyone else is still farther behind.

The Implied Cost of Risk

That brings us to the part we’ve been ignoring thus far: the knock-on effects of unreliable components on overall durability and the implied cost of risk.  If one uses manufacturer-supplied AFRs and ignores the possibility of data loss caused by software, it’s very easy to “prove” that the MTTDL of an ordinary double-parity RAID array with a couple of hot spares is in the tens of thousands of years.  But as is discussed at length in the CMU paper, AFR is not a constant; actual failure rates are best described by a Weibull distribution.  To make matters worse, replacement events do not seem to be independent of one another and resilvering times are rising as disks become larger but not faster.  When one further considers that both the CMU and Google researchers concluded that actual failure rates (presumably even among the best manufacturers’ products) are considerably higher than published, suddenly the prospect of data loss does not seem so remote.  But what does data loss cost?  Storage is a trust business; storage systems are in service much longer than most other IT infrastructure, and they occupy the bottom layer in the application stack.  They have to work, and customers are understandably unhappy when they fail.  The direct cost to customers of lost data can be considerable, and it’s not a stretch to suggest that a major data loss incident could doom a business like BackBlaze or Joyent’s Manta object storage service.

Instead of assessing risk by plugging manufacturer specifications into RAID formulae, I’d like to suggest a thought exercise courtesy of our over-financialised economy: Suppose you’re Ajit Jain and a medium-sized technology service company like BackBlaze or Joyent came to you asking for a policy that would compensate it for all the direct, incidental, and consequential damages that would arise from a major data loss incident induced by disk drive failure(s).  You’ve read the CMU paper.  You’ve read the Google paper.  You’ve interviewed storage system vendors and experienced large-scale operators.  What would you charge for such a policy?  If you really believe that MTTDL is in the tens of thousands of years, you’d argue that such a policy should cost perhaps a thousand dollars a year.  I expect Mr. Jain would ask a premium several orders of magnitude larger, probably on the order of a million dollars a year, implying that the true MTTDL  across your entire service is at most a few decades (which it probably is).  A cheap disk drive might easily cut that MTTDL by 80% (remember, failures are not independent and resilvering is not instantaneous!), at least quintupling the cost of our notional insurance policy.  Instead of saving you $5000 a year, the more reliable drives are now saving you $4 million — the entire cost of your disk drive population.  Mr. Wilson suggests that the marginal cost of 30,000 better disk drives amounts to perhaps $250,000 over the course of a disk’s 5-year lifetime, or $50,000 a year.  We can quibble over the specific numbers, but if you take your operational knowledge of disk drive failure and put yourself in Mr. Jain’s shoes, would you really write this policy for less than that?

The implied cost of risk alone is sufficient to cast considerable doubt on the economic superiority of the worse-is-better approach.  Perhaps when he said this approach is appropriate only to his business, Mr. Wilson really meant his actual metric is not MTTDL but mean time until someone notices that something they lost in the last 30 days was also lost on the backup service after the last upload and then complains loudly enough to cause a PR disaster.  That may be uncharitable, but I can’t otherwise reconcile his position with the facts.  If that’s what he meant, then substantially everyone can safely ignore BackBlaze as a useful example; we’re not seeking to optimise the same metrics.

Not Quite the Easiest Thing to Do

Mr. Stills is one hell of a musician, but we’ll need to look elsewhere for a realistic assessment of the fine art of failure.  Far from being the easiest thing to do, failing is messy, inexact, and often takes far longer than any reasonable person would expect.  Other times it’s so complete as to be indistinguishable from a magical disappearing act.  Disk drives frequently fail to fail, and that’s why Mr. Wilson’s simplistic analysis, methodology aside, is grossly ignorant: it is predicated on a world that looks nothing like reality.  The approach I’ve taken at Joyent is one informed by years of misery at Fishworks building, selling, and above all supporting disk-based storage products.  It’s a multi-pronged approach: like BackBlaze, we design for failure, not only at the application level but also through improvements and innovation at the operating system level and even integration at the hardware and firmware levels; unlike Mr. Wilson, we acknowledge the real-world challenges in identifying failure and the true risks and costs associated with excess failure rates.  As such, we work hard to identify  and populate our data centres with the best components we can afford (which is to say, the best components our customers are willing to pay for), including the very HGST disk drives Mr. Beach concludes are superior but his company refuses to purchase.  They’re not the cheapest components on the market, but they make up for their modest additional cost by enabling us to offer a better experience to our customers and reducing overhead for our Operations and Support teams.  At the same time, we continue working toward an ideal future in which fault detection is crisp and response automatic and correct.  Fault detection and response pose a systemic challenge, requiring a systemic response across all layers of the stack; it’s not enough to design for failure at the application layer and ignore opportunities to innovate in the operating system.  Nor is the answer to disregard the implied cost of excess risk and just hope your customers won’t notice.

Posted on February 20, 2014 at 22:45 by wesolows · Permalink
In: Uncategorized