Casting the shadow of the Hybrid Storage Pool

The debate, calmly waged, on the best use of flash in the enterprise can be summarized as whether flash should be a replacement for disk, acting as primary storage, or it should be regarded as a new, and complementary tier in the storage hierarchy, acting as a massive read cache. The market leaders in storage have weighed in the issue, and have declared incontrovertibly that, yes, both are the right answer, but there’s some bias underlying that equanimity. Chuck Hollis, EMC’s Global Marketing CTO, writes, that “flash as cache will eventually become less interesting as part of the overall discussion… Flash as storage? Well, that’s going to be really interesting.” Standing boldly with a foot in each camp, Dave Hitz, founder and EVP at Netapp, thinks that “Flash is too expensive to replace disk right away, so first we’ll see a new generation of storage systems that combine the two: flash for performance and disk for capacity.” So what are these guys really talking about, what does the landscape look like, and where does Sun fit in all this?

Flash as primary storage (a.k.a. tier 0)

Integrating flash efficiently into a storage system isn’t obvious; the simplest way is as a direct replacement for disks. This is why most of the flash we use today in enterprise systems comes in units that look and act just like hard drives: SSDs are designed to be drop in replacements. Now, a flash SSD is quite different than a hard drive — rather than a servo spinning platters while a head chatters back and forth, an SSD has floating gates arranged in blocks… actually it’s probably simpler to list what they have in common, and that’s just the form factor and interface (SATA, SAS, FC). Hard drives have all kind of properties that don’t make sense in the world of SSDs (e.g. I’ve seen an SSD that reports it’s RPM telemetry as 1), and SSDs have their own quirks with no direct analog (read/write asymmetry, limited write cycles, etc). SSD venders, however, manage to pound these round pegs into their square holes, and produce something that can stand in for an existing hard drive. Array vendors are all too happy to attain buzzword compliance by stuffing these SSDs into their products.

The trouble with HSM is the burden of the M.

Storage vendors already know how to deal with a caste system for disks: they striate them in layers with fast, expensive 15K RPM disks as tier 1, and slower, cheaper disks filling out the chain down to tape. What to do with these faster, more expensive disks? Tier-0 of course! An astute Netapp blogger asks, “when the industry comes up with something even faster… are we going to have tier -1” — great question. What’s wrong with that approach? Nothing. It works; it’s simple; and we (the computing industry) basically know how to manage a bunch of tiers of storage with something called hierarchical storage management. The trouble with HSM is the burden of the M. This solution kicks the problem down the road, leaving administrators to figure out where to put data, what applications should have priority, and when to migrate data.

Flash as a cache

The other school of thought around flash is to use it not as a replacement for hard drives, but rather as a massive cache for reading frequently accessed data. As I wrote back in June for CACM, “this new flash tier can be thought of as a radical form of hierarchical storage management (HSM) without the need for explicit management. Tersely, HSM without the M. This idea forms a major component of what we at Sun are calling the Hybrid Storage Pool (HSP), a mechanism for integrating flash with disk and DRAM to form a new, and — I argue — superior storage solution.

Let’s set aside the specifics of how we implement the HSP in ZFS — you can read about that elsewhere. Rather, I’ll compare the use of flash as a cache to flash as a replacement for disk independent of any specific solution.

The case for cache

It’s easy to see why using flash as primary storage is attractive. Flash is faster than the fastest disks by at least a factor of 10 for writes and a factor of 100 for reads measured in IOPS. Replacing disks with flash though isn’t without nuance; there are several inhibitors, primary among them is cost. The cost of flash continues to drop, but it’s still much more expensive than cheap disks, and will continue to be for quite awhile. With flash as primary storage, you still need data redundancy — SSDs can and do fail — and while we could use RAID with single- or double-device redundancy, that would cleave the available IOPS by a factor of the stripe width. The reason to migrate to flash is for performance so it wouldn’t make much sense to hang a the majority of that performance back with RAID. The remaining option, therefore, is to mirror SSDs whereby the already high cost is doubled.

It’s hard to argue with results, all-flash solutions do rip. If money were no object that may well be the best solution (but if cost truly wasn’t a factor, everyone would strap batteries to DRAM and call it a day).

Can flash as a cache do better? Say we need to store a 50TB of data. With an all-flash pool, we’ll need to buy SSDs that can hold roughly 100TB of data if we want to mirror for optimal performance, and maybe 60TB if we’re willing to accept a far more modest performance improvement over conventional hard drives. Since we’re already resigned to cutting a pretty hefty check, we have quite a bit of money to play with to design a hybrid solution. If we were to provision our system with 50TB of flash and 60TB of hard drives we’d have enough cache to retain every byte of active data in flash while the disks provide the necessary redundancy. As writes come in the filesystem would populate the flash while it writes data persistently to disk. The performance of this system would be epsilon away from the mirrored flash solution as read requests would only go to disk in the case of faults from the flash devices. Note that we never rely on correctness from the flash; it’s the hard drives that provide reliability.

The performance of this system would be epsilon away from the mirrored flash solution…

The hybrid solution is cheaper, and it’s also far more flexible. If a smaller working set accounted for a disproportionally large number of reads, the total IOPS capacity of the all-flash solution could be underused. With flash as a cache, data could be migrated to dynamically distribute load, and additional cache could be used to enhance the performance of the working set. It would be possible to use some of the same techniques with an all-flash storage pool, but it could be tricky. The luxury of a cache is that the looser contraints allow for more aggressive data manipulation.

Building on the idea of concentrating the use of flash for hot data, it’s easy to see how flash as a cache can improve performance even without every byte present in the cache. Most data doesn’t require 50μs random access latency over the entire dataset, users would see a significant performance improvement with just the active subset in a flash cache. Of course, this means that software needs to be able to anticipate what data
is in use which probably inspired this comment from Chuck Hollis: “cache is cache — we all know what it can and can’t do.” That may be so, but comparing an ocean of flash for primary storage to a thimbleful of cache reflects fairly obtuse thinking. Caching algorithms will always be imperfect, but the massive scale to which we can grow a flash cache radically alters the landscape.

Even when a working set is too large to be cached, it’s possible for a hybrid solution to pay huge dividends. Over at Facebook, Jason Sobel (a colleague of mine in college) produced an interesting presentation on their use of storage (take a look at Jason’s penultimate slide for his take on SSDs). Their datasets are so vast and sporadically accessed that the latency of actually loading a picture, say, off of hard drives isn’t actually the biggest concern, rather it’s the time it takes to read the indirect blocks, the metadata. At facebook, they’ve taken great pains to reduce the number of dependent disk accesses from fifteen down to about three. In a case such as theirs, it would never be economical store or cache the full dataset on flash and the working set is similarly too large as data access can be quite unpredictable. It could, however, be possible to cache all of their metadata in flash. This would reduce the latency to an infrequently accessed image by nearly a factor of three. Today in ZFS this is a manual setting per-filesystem, but it would be possible to evolve a caching algorithm to detect a condition where this was the right policy and make the adjustment dynamically.

Using flash as a cache offers the potential to do better, and to make more efficient and more economical use of flash. Sun, and the industry as a whole have only just started to build the software designed to realize that potential.

Putting products before words

At Sun, we’ve just released our first line of products that offer complete flash integration with the Hybrid Storage Pool; you can read about that in my blog post on the occassion of our product launch. On the eve of that launch, Netapp announced their own offering: a flash-laden PCI card that plays much the same part as their DRAM-based Performance Acceleration Module (PAM). This will apparently be available sometime in 2009. EMC offers a tier 0 solution that employs very fast and very expensive flash SSDs.

What we have in ZFS today isn’t perfect. Indeed, the Hybrid Storage Pool casts the state of the art forward, and we’ll be catching up with solutions to the hard questions it raises for at least a few years. Only then will we realize the full potential of flash as a cache. What we have today though integrates flash in a way that changes the landscape of storage economics and delivers cost efficiencies that haven’t been seen before. If the drives manufacturers don’t already, it can’t be long until they hear the death knell for 15K RPM drives loud and clear. Perhaps it’s cynical or solipsistic to conclude that the timing of Dave Hitz’s and Chuck Hollis’ blogs were designed to coincide with the release of our new product and perhaps take some of the wind out of our sails, but I will — as the commenters on Dave’s Blog have — take it as a sign that we’re on the right track. For the moment, I’ll put my faith in this bit of marketing material enigmatically referenced in a number of Netapp blogs on the subject of flash:

In today’s competitive environment, bringing a product or service to market faster than the competition can make a significant difference. Releasing a product to market in a shorter time can give you first-mover advantage and result in larger market share and higher revenues.

Posted on December 1, 2008 at 11:33 pm by ahl · Permalink
In: Fishworks

8 Responses

Subscribe to comments via RSS

  1. Written by saf
    on December 3, 2008 at 1:07 pm
    Permalink

    Nice article.
    I’m asking me what have been changed in ZFS to be adapted to SSD disks?
    And when does Sun will sell Server with SSDs?

  2. Written by Adam Leventhal
    on December 4, 2008 at 12:51 pm
    Permalink

    @saf To build the hybrid storage pool, we added the L2ARC to ZFS. We also use the ability to have separate intent-log devices, but that was already part of ZFS. We already sell the Sun Storage 7000 series with flash drives; I don’t know when we’ll be selling SSDs in general-purpose servers.

  3. Written by Ts
    on December 5, 2008 at 5:54 pm
    Permalink

    Well, whether or not people should use flash ssd should depend on how fast the data gets updated. Too fast, flash will fail prematurely, driving up the cost. If data is written once and never updated, then even mlc ssd devices can be used even with 10k cycle endurance. Things like historical tick data is perfect for mlc flash. Things like 2k sized images that gets deleted once a while should be cached in l2arc with slc ssd. Things that update too fast you have to use either memcache or pure brute force 15k sas drives.
    What is interesting is that with intel ssd on the market now, other samsung mlcs are going for 150 dollars per 64gb of mlc flash. At that price zfs should try to use it as a massive l2arc anyways

  4. Written by Adam Leventhal
    on December 9, 2008 at 4:38 pm
    Permalink

    @Ts I think you’re still thinking in terms of flash as primary storage. With the Hybrid Storage Pool, you can easily limit the rate at which you write to flash. Further, the primary ARC serves data under heavy churn — transient data is unlikely to survive long enough to ever be written to flash. Again, this is like HSM without the M.

  5. Written by Storage Guy
    on December 11, 2008 at 12:43 am
    Permalink

    These day most NAS devices are being built as N-way clusters. Like Isilon, ONTAP-GX and HP storage works. These generally require an N-way clustered file system as well.
    When will the storage 7000 series from Sun be N-way clustered. Is it because ZFS is not a clustered file system.

  6. Written by Adam Leventhal
    on December 11, 2008 at 9:50 am
    Permalink

    @Storage Guy I’m not sure; you might find a blog post about clustering and ask there…

  7. Written by Dave Nicholson
    on December 18, 2008 at 1:36 am
    Permalink

    Where do you see flash as an "extension of system memory" in servers and RDMA (remote direct memory access) fitting in to your vision of the future? Flash is obviously a game changing technology that makes all of the assumptions upon which current storage systems are based a bit out of whack. Leveraging ZFS and creating a Hybrid Pool doesn’t change the plumbing between the server and the storage. Nor does it change the nature of the applications that all of this machinery is designed to support. If every node in a grid of servers had TBs of on-board flash at it’s disposal, the future for storage vendors may include monster farms of "50TB SATA-V" drives instead of the sexy flash they crave.
    On another note, delivering flash as part of a legacy architecture in the form of stuffing it into a disk can is simply a method of accomplishing the first mover advantage you reference above. No one sees it as the optimal use of flash. Least of all the folks who currently do the stuffing. :-)
    Great discussion.

  8. Written by TS
    on December 18, 2008 at 2:46 pm
    Permalink

    "Further, the primary ARC serves data under heavy churn — transient data is unlikely to survive long enough to ever be written to flash. Again, this is like HSM without the M."
    I know. But you do have to consider the cases where transient data doesn’t fit in memory. We are at an age where data is so massive, we need more memory than maxing out the 64 DIMM slots in the servers with the biggest DIMMs you can find, at exponental price steps. Take cases where in a web 2.0 site, that when user logs in, you have to show on the web site that they are logged in and a timestamp. That is one small transcient IOPS of a few bytes at most. Right now it is stored in a cluster of memcache boxes(if you wanted to reduce database IO write load). In other word, it is not persistent. You can have redundent memcache boxes, but you can’t preserve those transcient data if power suddenly goes out, which means, you can only store transcient data in memory if you can afford to lose them.
    There are use cases where the churn rate is extremely high(real time stock ticks for example) that right now, we use a delayed non-persistent model(in-memory cache), which is incorrect from database ACID perspective(Durability perspective). You cannot assume that ZFS can handle all of the high churn rate data in the ARC from a practical perspective.
    On the other hand, on the slow churn rate data such as profile pictures on dating sites(where once uploaded will hardly be updated), using non raided MLC flash as L2ARC devices makes perfect sense. If they break, you can still fetch the data from RAID10 hard drives below the L2ARC.
    I don’t ever think that even SLC SSDs can be used as primary storage. Reason being that you need minimum 2x for RAID. RAID1 or RAID10 that was really good for hard drive cannot be used for flash because that data are mirrored, which means the SSDs with similar write cycles will fail at about the same time, even with really good write leveling algorithm. RAID5/6 has a problem where one IO is spread across multiple SSDs, thus is a write IOPS amplification factor, which effectively reduces SSD life cycle by the number of drives in the array
    We will see. ZFS kicks ass.

Subscribe to comments via RSS