Adam Leventhal's blog

Search
Close this search box.

Tag: ZFS

This week it was my pleasure to welcome my former Sun colleague Matt Ahrens and George Wilson to Delphix. Matt and I studied computer science together at Brown and then joined Sun in 2001. Matt joined Jeff Bonwick to start ZFS while I worked on DTrace. George joined Sun in 1996, and worked in a variety of roles, joining the ZFS team in 2006 (just as I was leaving to help start Fishworks).

George and Matt bring an amazing knowledge of ZFS — the lower, and upper halves respectively — and are also just great engineers who are already contributing tangibly to the the success of Delphix. You can take a look at Matt’s old blog, or watch George in a bunch of videos (including one of him being interviewed by a muppet).

Chris George from DDRdrive put together a great presentation at the OpenStorage summit looking at the ZFS intent log (ZIL), and how their product is particularly well-suited as a ZIL device. Chris did a particularly interesting analysis of the I/O pattern ZFS generates to ZIL devices (using DTrace of course). With writes to a single ZFS dataset, writes are almost 100% sequential, but with activity to multiple datasets, writes become significantly more non-sequential. The ZIL was initially designed to accelerate performance with a dedicated hard drive, but the Hybrid Storage Pool found a significantly better ZIL device in write-optimized, flash SSDs.

In the 7000 series, the performance of these SSDs — called Logzillas — aren’t particularly sensitive to random write patters. Less sophisticated, cheaper SSDs are more significantly impacted by randomness in that both performance and longevity can suffer.

Chris concludes that NV-DRAM is better suited than flash for the ZIL (Oracle’s Logzilla built by STEC actually contains a large amount of NV-DRAM). I completely agree; further, if HDDs and commodity SSDs continue to be target ZIL devices, ZFS could and should do more to ensure that writes are sequential.

I had the chance to speak at the OpenStorage Summit a couple of weeks ago about RAID-Z (the ZFS implementation of RAID). The talk was an accumulation of blog posts and articles written by me and others as well as quite a bit of new material that’s been building up. The talk was an overview of the history of RAID-Z, the strengths and weaknesses that have emerged, and a look towards the challenges ahead for ZFS and RAID with some possible solutions and mitigating factors. Thanks to Nexenta for putting the conference together; questions or comments are very welcome.

This year’s flash memory summit got me thinking about our use of SSDs over the years at Fishworks. The picture of our left is a visual history of SSD evals in rough chronological order from the oldest at the bottom to the newest at the top (including some that have yet to see the light of day).

Early Days

When we started Fishworks, we were inspired by the possibilities presented by ZFS and Thumper. Those components would be key building blocks in the enterprise storage solution that became the 7000 series. An immediate deficiency we needed to address was how to deliver competitive performance using 7,200 RPM disks. Folks like NetApp and EMC use PCI-attached NV-DRAM as a write accelerator. We evaluated something similar, but found the solution lacking because it had limited scalability (the biggest NV-DRAM cards at the time were 4GB), consumed our limited PCIe slots, and required a high-speed connection between nodes in a cluster (e.g. IB, further eating into our PCIe slot budget).

The idea we had was to use flash. None of us had any experience with flash beyond cell phones and USB sticks, but we had the vague notion that flash was fast and getting cheaper. By luck, flash SSDs were just about to be where we needed them. In late 2006 I started evaluating SSDs on behalf of the group, looking for what we would eventually call Logzilla. At that time, SSDs were getting affordable, but were designed primarily for environments such as military use where ruggedness was critical. The performance of those early SSDs was typically awful.

Logzilla

STEC — still Simpletech in those days — realized that their early samples didn’t really suit our needs, but they had a new device (partly due to the acquisition of Gnutech) that would be a good match. That first sample was fibre-channel and took some finagling to get working (memorably it required metric screw of an odd depth), but the Zeus IOPS, an 18GB 3.5″ SATA SSD using SLC NAND, eventually became our Logzilla (we’ve recently updated it with a SAS version for our updated SAS-2 JBODs). Logzilla addressed write performance economically, and scalably in a way that also simplified clustering; the next challenge was read performance.

Readzilla

Intent on using commodity 7,200 RPM drives, we realized that our random read latency would be about twice that of 15K RPM drives (duh). Fortunately, most users don’t access all of their data randomly (regardless of how certain benchmarks are designed). We already had much more DRAM cache than other storage products in our market segment, but we thought that we could extend that cache further by using SSDs. In fact, the invention of the L2ARC followed a slightly different thought process: seeing the empty drive bays in the front of our system (just two were used as our boot disks) and the piles of SSDs laying around, I stuck the SSDs in the empty bays and figured out how we’d use them.

It was again STEC who stepped up to provide our Readzilla, a 100GB 2.5″ SATA SSD using SLC flash.

Next Generation

Logzilla and Readzilla are important features of the Hybrid Storage Pool. For the next generation expect the 7000 series to move away from SLC NAND flash. It was great for the first generation, but other technologies provide better $/IOPS for Logzilla and better $/GB for Readzilla (while maintaining low latency). For Logzilla we think that NV-DRAM is a better solution (I reviewed one such solution here), and for Readzilla MLC flash has sufficient performance at much lower cost and ZFS will be able to ensure the longevity.

The mission of ZFS was to simplify storage and to construct an enterprise level of quality from volume components by building smarter software — indeed that notion is at the heart of the 7000 series. An important piece of that puzzle was eliminating the expensive RAID card used in traditional storage and replacing it with high performance, software RAID. To that end, Jeff invented RAID-Z; it’s key innovation over other software RAID techniques was to close the “RAID-5 write hole” by using variable width stripes. RAID-Z, however, is definitely not RAID-5 despite that being the most common comparison.

RAID levels

Last year I wrote about the need for triple-parity RAID, and in that article I summarized the various RAID levels as enumerated by Gibson, Katz, and Patterson, along with Peter Chen, Edward Lee, and myself:

  • RAID-0 Data is striped across devices for maximal write performance. It is an outlier among the other RAID levels as it provides no actual data protection.
  • RAID-1 Disks are organized into mirrored pairs and data is duplicated on both halves of the mirror. This is typically the highest-performing RAID level, but at the expense of lower usable capacity.
  • RAID-2 Data is protected by memory-style ECC (error correcting codes). The number of parity disks required is proportional to the log of the number of data disks.
  • RAID-3 Protection is provided against the failure of any disk in a group of N+1 by carving up blocks and spreading them across the disks — bitwise parity. Parity resides on a single disk.
  • RAID-4 A group of N+1 disks is maintained such that the loss of any one disk would not result in data loss. A single disks is designated as the dedicated parity disk. Not all disks participate in reads (the dedicated parity disk is not read except in the case of a failure). Typically parity is computed simply as the bitwise XOR of the other blocks in the row.
  • RAID-5 N+1 redundancy as with RAID-4, but with distributed parity so that all disks participate equally in reads.
  • RAID-6 This is like RAID-5, but employs two parity blocks, P and Q, for each logical row of N+2 disk blocks.
  • RAID-7 Generalized M+N RAID with M data disks protected by N parity disks (without specifications regarding layout, parity distribution, etc).

RAID-Z: RAID-5 or RAID-3?

Initially, ZFS supported just one parity disk (raidz1), and later added two (raidz2) and then three (raidz3) parity disks. But raidz1 is not RAID-5, and raidz2 is not RAID-6. RAID-Z avoids the RAID-5 write hole by distributing logical blocks among disks whereas RAID-5 aggregates unrelated blocks into fixed-width stripes protected by a parity block. This actually means that RAID-Z is far more similar to RAID-3 where blocks are carved up and distributed among the disks; whereas RAID-5 puts a single block on a single disk, RAID-Z and RAID-3 must access all disks to read a single block thus reducing the effective IOPS.

RAID-Z takes a significant step forward by enabling software RAID, but at the cost of backtracking on the evolutionary hierarchy of RAID. Now with advances like flash pools and the Hybrid Storage Pool, the IOPS from a single disk may be of less importance. But a RAID variant that shuns specialized hardware like RAID-Z and yet is economical with disk IOPS like RAID-5 would be a significant advancement for ZFS.

A key component of the ZFS Hybrid Storage Pool is Logzilla, a very fast device to accelerate synchronous writes. This component hides the write latency of disks to enable the use of economical, high-capacity drives. In the Sun Storage 7000 series, we use some very fast SAS and SATA SSDs from STEC as our Logzilla &mdash the devices are great and STEC continues to be a terrific partner. The most important attribute of a good Logzilla device is that it have very low latency for sequential, uncached writes. The STEC part gives us about 100μs latency for a 4KB write — much much lower than most SSDs. Using SAS-attached SSDs rather than the more traditional PCI-attached, non-volatile DRAM enables a much simpler and more reliable clustering solution since the intent-log devices are accessible to both nodes in the cluster, but SAS is much slower than PCIe…

DDRdrive X1

Christopher George, CTO of DDRdrive was kind enough to provide me with a sample of the X1, a 4GB NV-DRAM card with flash as a backing store. The card contains 4 DIMM slots populated with 1GB DIMMs; it’s a full-height card which limits its use in Sun/Oracle systems (typically half-height only), but there are many systems that can accommodate the card. The X1 employs a novel backup power solution; our Logzilla used in the 7000 series protects its DRAM write cache with a large super-capacitor, and many NV-DRAM cards use a battery. Supercaps can be limiting because of their physical size, and batteries have a host of problems including leaking and exploding. Instead, the DDRdrive solution puts a DC power connector on the PCIe faceplate and relies on an external source of backup power (a UPS for example).

Performance

I put the DDRdrive X1 in our fastest prototype system to see how it performed. A 4K write takes about 51μs — better than our SAS Logzilla — but the SSD outperformed the X1 at transfer sizes over 32KB. The performance results on the X1 are already quite impressive, and since I ran those tests the firmware and driver have undergone several revisions to improve performance even more.

As a Logzilla

While the 7000 series won’t be employing the X1, uses of ZFS that don’t involve clustering and for which external backup power is an option, the X1 is a great and economical Logzilla accelerator. Many users of ZFS have already started hunting for accelerators, and have tested out a wide array of SSDs. The X1 is a far more targeted solution, and is a compelling option. And if write performance has been a limiting factor in deploying ZFS, the X1 is a good reason to give ZFS another look.

Recent Posts

January 22, 2024
January 13, 2024
December 29, 2023
February 12, 2017
December 18, 2016
August 9, 2016

Archives

Archives