Adam Leventhal's blog

Search
Close this search box.

Tag: Fishworks

At 4:08pm today, we will launch Delphix Server at DEMO. At the presentation, Richard Rothschild from TiVo will describe how they have been using Delphix. TiVo, of course, has been canonized as technology that changes the way we live or work. My past work on DTrace was described by users as “TiVo for the kernel” — a revolutionary technology from which there is no going back. The comment that “Delphix is like TiVo for databases” is all the more profound coming from the Senior Director of IT at TiVo.

I was introduced to the notion of virtualization in 2000 when some of my classmates signed on at VMware. Later in my work at Fishworks building storage products, storage virtualization was an important topic. Virtualization of both servers and storage decouples the work that’s being done, the computation or I/O and data persistence, from specific, inflexible physical resources. Those physical resources can then be used more efficiently (saving money) and moved around more quickly and easily (saving time).

Databases, while not themselves physical resources, are bound to physical resources in an analogous way. They are tied to equipment, duplicated many times over, and complex to provision and deploy — much like both servers and storage. Delphix brings virtualization to the realm of databases, abstracting them away from their physical resources, and creating the same benefits as virtualizing compute or disk. In contrast though, since databases are not tangible like a rack-mount server or storage array, databases can be seamlessly pulled into the virtualization framework, or, just as easily, recast and deployed as physical databases.

Check out the broadcast live today (or here after the fact), or take a look at today’s eWeek article. We’ve also got some videos posted with more info on the product and customer testimonials. As I write this, I’m midway through my second day at the company; I look forward to writing more about the technology as I dive in.

I joined the Solaris Kernel Group in 2001 at what turned out to be a remarkable place and time for the industry. More by luck and intuition than by premonition, I found myself surrounded by superlative engineers working on revolutionary technologies that were the products of their own experience and imagination rather than managerial fiat. I feel very lucky to have worked with Bryan and Mike on DTrace; it was amazing that just down the hall our colleagues reinvented the operating system with Zones, ZFS, FMA, SMF and other innovations.

With Solaris 10 behind us, lauded by customers and pundits, I was looking for that next remarkable place and time, and found it with Fishworks. The core dozen or so are some of the finest engineers I could hope to work with, but there were so many who contributed to the success of the 7000 series. From the executives who enabled our fledgling skunkworks in its nascent days, to our Solaris colleagues building fundamental technologies like ZFS, COMSTAR, SMF, networking, I/O, and IPS, and the OpenStorage team who toiled to bring a product to market, educating us without deflating our optimism in the process.

I would not trade the last 9 years for anything. There are many engineers who never experience a single such confluence of talent, organizational will, and success; I’m grateful to my colleagues and to Sun for those two opportunities. Now I’m off to look for my next remarkable place and time beyond the walls of Oracle. My last day will be August 20th, 2010.

Thank you to the many readers of this blog. After six years and 130 posts I’d never think of giving it up. You’ll be able to find my new blog at dtrace.org/blogs/ahl (comments to this post are open there); I can’t wait to begin chronicling my next endeavors. You can reach me by email here: my initials at alumni dot brown dot edu. I look forward to your continued to comments and emails. Thanks again!

This year’s flash memory summit got me thinking about our use of SSDs over the years at Fishworks. The picture of our left is a visual history of SSD evals in rough chronological order from the oldest at the bottom to the newest at the top (including some that have yet to see the light of day).

Early Days

When we started Fishworks, we were inspired by the possibilities presented by ZFS and Thumper. Those components would be key building blocks in the enterprise storage solution that became the 7000 series. An immediate deficiency we needed to address was how to deliver competitive performance using 7,200 RPM disks. Folks like NetApp and EMC use PCI-attached NV-DRAM as a write accelerator. We evaluated something similar, but found the solution lacking because it had limited scalability (the biggest NV-DRAM cards at the time were 4GB), consumed our limited PCIe slots, and required a high-speed connection between nodes in a cluster (e.g. IB, further eating into our PCIe slot budget).

The idea we had was to use flash. None of us had any experience with flash beyond cell phones and USB sticks, but we had the vague notion that flash was fast and getting cheaper. By luck, flash SSDs were just about to be where we needed them. In late 2006 I started evaluating SSDs on behalf of the group, looking for what we would eventually call Logzilla. At that time, SSDs were getting affordable, but were designed primarily for environments such as military use where ruggedness was critical. The performance of those early SSDs was typically awful.

Logzilla

STEC — still Simpletech in those days — realized that their early samples didn’t really suit our needs, but they had a new device (partly due to the acquisition of Gnutech) that would be a good match. That first sample was fibre-channel and took some finagling to get working (memorably it required metric screw of an odd depth), but the Zeus IOPS, an 18GB 3.5″ SATA SSD using SLC NAND, eventually became our Logzilla (we’ve recently updated it with a SAS version for our updated SAS-2 JBODs). Logzilla addressed write performance economically, and scalably in a way that also simplified clustering; the next challenge was read performance.

Readzilla

Intent on using commodity 7,200 RPM drives, we realized that our random read latency would be about twice that of 15K RPM drives (duh). Fortunately, most users don’t access all of their data randomly (regardless of how certain benchmarks are designed). We already had much more DRAM cache than other storage products in our market segment, but we thought that we could extend that cache further by using SSDs. In fact, the invention of the L2ARC followed a slightly different thought process: seeing the empty drive bays in the front of our system (just two were used as our boot disks) and the piles of SSDs laying around, I stuck the SSDs in the empty bays and figured out how we’d use them.

It was again STEC who stepped up to provide our Readzilla, a 100GB 2.5″ SATA SSD using SLC flash.

Next Generation

Logzilla and Readzilla are important features of the Hybrid Storage Pool. For the next generation expect the 7000 series to move away from SLC NAND flash. It was great for the first generation, but other technologies provide better $/IOPS for Logzilla and better $/GB for Readzilla (while maintaining low latency). For Logzilla we think that NV-DRAM is a better solution (I reviewed one such solution here), and for Readzilla MLC flash has sufficient performance at much lower cost and ZFS will be able to ensure the longevity.

I’ve been expecting this automated mail for a while now, but it was disheartening nonetheless:

List:       dtrace-discuss
Member:     bryan.cantrill@eng.sun.com
Action:     Subscription disabled.
Reason:     Excessive or fatal bounces.
Bryan Cantrill, VP of Engineering at Joyent, earning $15.

As one of the moderators of the DTrace discussion list, I see people subscribe and unsubscribe. Bryan has, of course, left Oracle and joined Joyent to be their VP of engineering.

Bryan is a terrific engineer, and I count myself lucky to have worked with him for the past nine years first on DTrace and then on Fishworks. He taught me many things, but perhaps most important was his holistic view of engineering that encompasses all aspects of making a product successful including docs, pricing, talks, papers, and, of course, excellent code. Now Bryan is off to cut through the layers software that make up the cloud. Far from leaving the DTrace community, he’s going to take DTrace to new places and I look forward to seeing the fruits of his labor as he sinks his teeth into a new onion of abstractions.

… and, Robin, Bryan’s certainly a smart guy, but “the smart guy behind Dtrace [sic]”?? Just don’t refer to me and Mike as “the dumb guys behind DTrace” okay?

The mission of ZFS was to simplify storage and to construct an enterprise level of quality from volume components by building smarter software — indeed that notion is at the heart of the 7000 series. An important piece of that puzzle was eliminating the expensive RAID card used in traditional storage and replacing it with high performance, software RAID. To that end, Jeff invented RAID-Z; it’s key innovation over other software RAID techniques was to close the “RAID-5 write hole” by using variable width stripes. RAID-Z, however, is definitely not RAID-5 despite that being the most common comparison.

RAID levels

Last year I wrote about the need for triple-parity RAID, and in that article I summarized the various RAID levels as enumerated by Gibson, Katz, and Patterson, along with Peter Chen, Edward Lee, and myself:

  • RAID-0 Data is striped across devices for maximal write performance. It is an outlier among the other RAID levels as it provides no actual data protection.
  • RAID-1 Disks are organized into mirrored pairs and data is duplicated on both halves of the mirror. This is typically the highest-performing RAID level, but at the expense of lower usable capacity.
  • RAID-2 Data is protected by memory-style ECC (error correcting codes). The number of parity disks required is proportional to the log of the number of data disks.
  • RAID-3 Protection is provided against the failure of any disk in a group of N+1 by carving up blocks and spreading them across the disks — bitwise parity. Parity resides on a single disk.
  • RAID-4 A group of N+1 disks is maintained such that the loss of any one disk would not result in data loss. A single disks is designated as the dedicated parity disk. Not all disks participate in reads (the dedicated parity disk is not read except in the case of a failure). Typically parity is computed simply as the bitwise XOR of the other blocks in the row.
  • RAID-5 N+1 redundancy as with RAID-4, but with distributed parity so that all disks participate equally in reads.
  • RAID-6 This is like RAID-5, but employs two parity blocks, P and Q, for each logical row of N+2 disk blocks.
  • RAID-7 Generalized M+N RAID with M data disks protected by N parity disks (without specifications regarding layout, parity distribution, etc).

RAID-Z: RAID-5 or RAID-3?

Initially, ZFS supported just one parity disk (raidz1), and later added two (raidz2) and then three (raidz3) parity disks. But raidz1 is not RAID-5, and raidz2 is not RAID-6. RAID-Z avoids the RAID-5 write hole by distributing logical blocks among disks whereas RAID-5 aggregates unrelated blocks into fixed-width stripes protected by a parity block. This actually means that RAID-Z is far more similar to RAID-3 where blocks are carved up and distributed among the disks; whereas RAID-5 puts a single block on a single disk, RAID-Z and RAID-3 must access all disks to read a single block thus reducing the effective IOPS.

RAID-Z takes a significant step forward by enabling software RAID, but at the cost of backtracking on the evolutionary hierarchy of RAID. Now with advances like flash pools and the Hybrid Storage Pool, the IOPS from a single disk may be of less importance. But a RAID variant that shuns specialized hardware like RAID-Z and yet is economical with disk IOPS like RAID-5 would be a significant advancement for ZFS.

Double-parity RAID, or RAID-6, is the de facto industry standard for storage; when I started talking about triple-parity RAID for ZFS earlier this year, the need wasn’t always immediately obvious. Double-parity RAID, of course, provides protection from up to two failures (data corruption or the whole drive) within a RAID stripe. The necessity of triple-parity RAID arises from the observation that while hard drive capacity has roughly followed Kryder’s law, doubling annually, hard drive throughput has improved far more modestly. Accordingly, the time to populate a replacement drive in a RAID stripe is increasing rapidly. Today, a 1TB SAS drive takes about 4 hours to fill at its theoretical peak throughput; in a real-world environment that number can easily double, and 2TB and 3TB drives expected this year and next won’t move data much faster. Those long periods spent in a degraded state increase the exposure to the bit errors and other drive failures that would in turn lead to data loss. The industry moved to double-parity RAID because one parity disk was insufficient; longer resilver times mean that we’re spending more and more time back at single-parity. From that it was obvious that double-parity will soon become insufficient. (I’m working on an article that examines these phenomena quantitatively so stay tuned… update Dec 21, 2009: you can find the article here)

Last week I integrated triple-parity RAID into ZFS. You can take a look at the implementation and the details of the algorithm here, but rather than describing the specifics, I wanted to describe its genesis. For double-parity RAID-Z, we drew on the work of Peter Anvin which was also the basis of RAID-6 in Linux. This work was more or less a tutorial for systems programers, simplifying some of the more subtle underlying mathematics with an eye towards optimization. While a systems programmer by trade, I have a background in mathematics so was interested to understand the foundational work. James S. Plank’s paper A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like Systems describes a technique for generalized N+M RAID. Not only was it simple to implement, but it could easily be made to perform well. I struggled for far too long trying to make the code work before discovering trivial flaws with the math itself. A bit more digging revealed that the author himself had published Note: Correction to the 1997 Tutorial on Reed-Solomon Coding 8 years later addressing those same flaws.

Predictably, the mathematically accurate version was far harder to optimize, stifling my enthusiasm for the generalized case. My more serious concern was that the double-parity RAID-Z code suffered some similar systemic flaw. This fear was quickly assuaged as I verified that the RAID-6 algorithm was sound. Further, from this investigation I was able to find a related method for doing triple-parity RAID-Z that was nearly as simple as its double-parity cousin. The math is a bit dense; but the key observation was that given that 3 is the smallest factor of 255 (the largest value representable by an unsigned byte) it was possible to find exactly of 3 different seed or generator values after which there were collections of failures that formed uncorrectable singularities. Using that technique I was able to implement a triple-parity RAID-Z scheme that performed nearly as well as the double-parity version.

As far as generic N-way RAID-Z goes, it’s still something I’d like to add to ZFS. Triple-parity will suffice for quite a while, but we may want more parity sooner for a variety of reasons. Plank’s revised algorithm is an excellent start. The test will be if it can be made to perform well enough or if some new clever algorithm will need to be devised. Now, as for what to call these additional RAID levels, I’m not sure. RAID-7 or RAID-8 seem a bit ridiculous and RAID-TP and RAID-QP aren’t any better. Fortunately, in ZFS triple-parity RAID is just raidz3.

A little over three years ago, I integrated double-parity RAID-Z into ZFS, a feature expected of enterprise class storage. This was in the early days of Fishworks when much of our focus was on addressing functional gaps. The move to triple-parity RAID-Z comes in the wake of a number of our unique advancements to the state of the art such as DTrace-powered Analytics and the Hybrid Storage Pool as the Sun Storage 7000 series products meet and exceed the standards set by the industry. Triple-parity RAID-Z will, of course, be a feature included in the next major software update for the 7000 series (2009.Q3).

Recent Posts

January 22, 2024
January 13, 2024
December 29, 2023
February 12, 2017
December 18, 2016
August 9, 2016

Archives

Archives