ZFS+10: illumos meetup

ZFS recently celebrated its informal 10th anniversary; to mark the occasion, Delphix hosted a ZFS-themed meetup for the illumos community (sponsored generously by Joyent). Many thanks to Deirdre Straughan, the new illumos community manager, for helping to organize and for filming the event. Three of my colleagues at Delphix presented work they’ve been doing in the ZFS ecosystem.

Matt Ahrens, who (with Jeff Bonwick) invented ZFS back in 2001, started the program with a discussion of a new stable interface for ZFS. Initially libzfs had been designed as a set of helper functions in support of the zfs(1M) and zpool(1M) commands; since then, it has outgrown those humble ambitions and a new, simple, stable interface is needed for programmatic consumers of ZFS. In Matt’s talk and blog post, he lays out a series of guiding principles for the new libzfs_core library; he’s already started to implement these ideas for new ZFS features in development at Delphix.

John Kennedy has been working on a relatively neglected part of illumos: automated testing. At the meetup John spoke about the work he’s been doing to revitalize the ZFS test suite, and to build a unit testing framework for illumos at large. I found the questions and enthusiasm from the people in the room particularly encouraging — everyone knows that we need to be doing more testing, but until John stepped up, no one was leading the charge. The ZFS test suite is available on github. Take a look at John’s blog post to see how to execute the ZFS test suite, and how you can contribute to illumos by helping him diagnose and fix the 60+ outstanding failures.

Chris Siden has been at Delphix just since he graduated from Brown University this past spring, but he’s already made a tremendous impact on ZFS. Chris presented both the work he’s done to finish the work started by Basil Crow (also of Brown, and soon full-time at Delphix) on ZFS feature flags (originally presented to the ZFS community by Matt back in May). Previously, ZFS features followed a single, linear versioning; with Chris and Basil’s work it’s not a land-grab for the next version, rather each feature can be enabled discretely. Chris also implemented the world’s first flagged ZFS feature, Async Destroy (also known to ZFS feature flags as com.delphix:async_destroy) which allows datasets to be destroyed asynchronously in the background — a huge boon when destroying gigantic ZFS datasets. Chris also presented some work he’s been doing on backward compatibility testing for ZFS; check out his blog post on both subjects.

The illumos meetup was a great success. Thank you to everyone who attended in person or on the web. To get involved with the ZFS community, join the illumos ZFS mailing list, and for information on the next illumos meetup, join the group.

Posted on January 20, 2012 at 9:39 pm by ahl · Permalink · Leave a comment
In: Delphix, illumos

The case of the un-unmountable tmpfs

Every once in a rare while our development machines encounter an fatal error during boot because we couldn’t unmount tmpfs. This weekend I cracked the case, so I thought I’d share my uses of boot-time DTrace, and the musty corners of the operating systems that I encountered along the way. First I should explain a little bit about what happens during boot and why we were unmounting a tmpfs filesystem.

Upgrade and boot

When you upgrade your Delphix system from one version to the next, we perserve your configuration. Part of that system configuration lives in SMF, the Service Management Facility, which is stored in the filesystem at /etc/svc. We keep /etc/svc in its own ZFS dataset so that we can snapshot and clone it for an upgrade to save the old data (in case we need to roll back) and keep the settings. This gets a tad tricky because the kernel mounts tmpfs on /etc/svc/volatile to provide scratch space for early processes like init(1); before we can mount on /etc/svc, we have to unmount /etc/svc/volatile. Here’s what that part of our boot script looks like:

#
# The kernel mounts tmpfs on /etc/svc/volatile so we we need to
# unmount that before mounting the svc dataset on /etc/svc.
#
umount /etc/svc/volatile
mount -F zfs $base/running/svc /etc/svc
mount -F tmpfs /etc/svc/volatile

The problem

Infrequently — but not never — we’d see that unmount of/etc/svc/volatile fail with EBUSY. The boot script would stop and report the error; subsequent attempts to unmount /etc/svc/volatile would succeed. So there wasn’t much to go on. The tmpfs code did reveal this:

static int
tmp_unmount(struct vfs *vfsp, int flag, struct cred *cr)
{
        ...
        tnp = tm->tm_rootnode;
        if (TNTOV(tnp)->v_count > 1) {
                mutex_exit(&tm->tm_contents);
                return (EBUSY);
        }
        for (tnp = tnp->tn_forw; tnp; tnp = tnp->tn_forw) {
                if ((vp = TNTOV(tnp))->v_count > 0) {
                        ...
                        return (EBUSY);
                }
                VN_HOLD(vp);
        }
        ...

So someone had an additional hold on either the root of the filesystem or a file in it. I looked at the contents of /etc/svc/volatile and found one file: init.state. Digging through the code for init(1) I was surprised to find that init(1) keeps a state file around with a list of processes of interest. It doesn’t keep the file descriptor open (which would prevent us from unmounting the filesystem), but it does rewrite the file from time to time. I was worried that init(1) might be racing with our script. I didn’t want to understand the brutal compexity of the code, so I amended our boot script to do the following:

pstop 1 # stop init
umount /etc/svc/volatile
mount -F zfs $base/running/svc /etc/svc
mount -F tmpfs /etc/svc/volatile
prun 1 # resume init

If the unmount failed, I’d be able to use pfiles(1) to see if init(1) did, in fact, have something open in /etc/svc/volatile. I was convinced that in trying to observe the problem, I’d chase it away — a Heisenbug — but after a short while of running a reboot loop, we hit the problem, and init(1) didn’t have anything open in /etc/svc/volatile. What next…

Boot-time DTrace

The problem was that by the time I’d get to the system, the conditions that caused the error had resolved themselves. What I wanted to do was panic the system when tmp_unmount() returned EBUSY so that I could poke around with a debugger. On many systems that would entail compiling in debug logic, but fortunately a DTrace-enabled system has a better option. My former colleague Bryan Cantrill invented anonymous DTrace for looking at boot-time issues — earlier in boot than when one could execute the dtrace(1M) command-line utility. To use boot-time DTrace, specify the D program as usual, but add the -A option to add the D program to the DTrace kernel module’s boot-time configuration. After rebooting, DTrace will enable the program whose output can later be retreived with dtrace -a. In my case, I wanted to drop into the kernel debugger when tmp_unmount() returned EBUSY, so I ran DTrace like this:

dtrace -A -w -n 'fbt:tmpfs:tmp_unmount:return/arg1 == EBUSY/{ panic(); }'

Again after many reboots, we hit the problem and dropped into the debugger. Thanks to infrastructure put into the kernel by my colleague Eric Schrock, I was able to quickly see the identity of the file whose reference count prevented us from unmounting tmpfs:

[5]> ffffff0197321b00::print vnode_t v_path
v_path = 0xffffff0197125d60 "/etc/svc/volatile/init-next.state"

It’s worth noting that v_path isn’t guaranteed to be correct, but may reflect some state state. In this case, I examined the directory structure of tmpfs and found that the filename was actually /etc/svc/volatile/init.state — v_path isn’t updated on a rename. But I couldn’t for the life of me figure out who had that file open. None of the (few) other processes were touching the file. I looked through the fsflush code which flushes cached data back to disk, but that didn’t make a lot of sense, and didn’t seem to be causing problems. The pageout thread isn’t supposed to run unless the system is low on memory. I used kmdb’s ::kgrep command to find places where the vnode_t or its associated page were referenced. There were many, and I quickly got lost in the bowels of the VM system. Rather than groveling through the kernel’s structures, I decided to turn back to DTrace. The next question I wanted to answer was this: after tmp_unmount() returns EBUSY, who is it that releases the reference on that tmpfs vnode_t? To answer it, I wrote this D script:

fbt:tmpfs:tmp_unmount:entry
{
        self->vfs = args[0];
}

fbt:tmpfs:tmp_unmount:return
/arg1 == EBUSY/
{
        gotit = self->vfs;
}

fbt:tmpfs:tmp_unmount:return
{
        self->vfs = NULL;
}

fbt:genunix:vn_rele:entry
/gotit != NULL && args[0]->v_vfsp = gotit/
{
        panic();
}

I installed that as my anonymous DTrace enabling, rebooted, and waited.

Who dunnit

Like the Mystery Inc. gang unmasking the criminal, helplessly caught by the elaborate trap, I used the kernel debugger to identify the subsystem to find that it was none other than harmless old Mr. Pageout. Gasp! Why was pageout running at all? The system had plenty of memory so it wouldn’t normally be running except there’s an exception made very very early in boot (it turns out). In the first second after boot, pageout will execute exactly four times in order to fill in certain performance-related parameters that let it predict how long it will take to page out memory in the future. When it executes, pageout will identify unused pages and take a temporary hold on them — this is exactly the pathology at the root of our problem!

Solution

I’m still working on exactly how to solve this. The simplest approach would be to sleep for a second before trying the unmount. Slightly more complicated would be to try unmounting in a loop until a second had passed (checking $SECONDS in bash). More complicated still would be to do a rethink of pageout — I still don’t fully understand how it works, but it really seems like it’s making assumptions that have been invalidated in the past decade and contains this gem of a comment:

For now, this thing is in very rough form.

Note that “now” in this case referred to 1987 or possibly earlier — as Roger Faulkner would say, “it came from New Jersey.”

Conclusion

Pageout would have gotten away with it if it hadn’t been for these meddling tools! DTrace during boot is awesome — when you need it, it’s a life saver. There are some places so early in boot that DTrace can’t help; for that VProbes can give you some DTrace-like functionality. And mature systems can have some musty corners so your tools had better be up to the task.

Posted on December 12, 2011 at 4:39 pm by ahl · Permalink · One Comment
In: DTrace · Tagged with: , , , , ,

ZFS 10th anniversary

Exactly 10 years ago today, Jeff Bonwick and Matt Ahrens got their first ZFS prototype working in user-land. Jeff had scrapped his previous attempt at reinventing filesystems, working through the established filesystem management and engineering channels at Sun, and this time started with a clean sheet of paper. Matt had joined Sun that June shortly after graduating from Brown University. Both prodigious coders, the duo, in remarkably short order, showed us a glint of what ZFS would be. A year later, the master and apprentice had ZFS working in the kernel, moving data from end to end. Three years after that, standing in front of a team of a dozen engineers, Matt typed ‘putback’ to integrate ZFS into Solaris. The distance ZFS has traveled these past 10 years has been monumental, and ZFS has indelibly impacted the industry. ZFS is one of the load-bearing pillars here at Delphix; without it, our task would have been too ambitious to even begin. Congratulations to our own Matt Ahrens on this milestone, as well as to Jeff, and everyone else who has contributed to ZFS over the last 10 years including the growing community building new products around ZFS and illumos.

Update: Check out Matt’s blog post on the subject.

Posted on November 1, 2011 at 5:43 am by ahl · Permalink · 6 Comments
In: ZFS · Tagged with: , , , , ,

illumos hackathon on October, 24

On Monday, the Delphix systems crew is rolling down the 101 to the illumos hackathon in San Jose. Anyone who’s working on illumos, developing illumos-derived technologies like ZFS or DTrace, or who wants to cut some OS code, should drop by. Here’s the sign up.

What’s a hackathon? Not exactly sure, but we’re hoping to cut a bunch of code, and hopefully build some neat stuff in a short period of time. The basic plan is that we’ll meet at the Wyndam at 10am — if you can’t track us down, message me on twitter. Bring your ideas for cool, fun projects, that a few people might be able to bang out in a marathon day. At Delphix, we’ve been thinking about time-correlated DTrace, libzfs v2, tab-completion for mdb, some build system improvements, and something goofy we call tarfs. We’ll throw the ideas at the whiteboard, break into project teams, and hack them up until we have something to demo (and Joyent has offered up build servers if you need one).

For now, think about what you’d want to build (think fun but small — we only have the day), pull down the illumos source, and hoard your favorite source of caffeine. Got a good idea? Post it in the comments. And I hope to see you Monday.

Posted on October 21, 2011 at 5:21 am by ahl · Permalink · Comments Closed
In: illumos

Welcome Matt Amdur

It’s my pleasure to welcome Matt Amdur to Delphix, to the world of DTrace, and — just today — to the blogosphere. Matt joined Delphix about two months after 10 years of software engineering, most recently at VMware. Matt and I met in at Brown University in 1997 where we worked together closely for all four years. We’ve had in the back of our minds that our professional lives would converge at some point; I couldn’t be happier to have my good friend onboard.

Matt’s first blog post is a great war story, describing his first use of DTrace on a customer system. It was vindicating to witness first hand, both in how productive an industry vet could be with DTrace after a short period, and in what a great hire Matt was for Delphix. Working with him has been evocative of our collaboration in college — while making all the more apparent the significant distance we’ve both come. Welcome, Matt!

Posted on October 12, 2011 at 10:40 pm by ahl · Permalink · Comments Closed
In: Delphix · Tagged with: , ,

Oracle’s port: this is not DTrace

After writing about Oracle’s port of DTrace to OEL, I wanted to take it for a spin. Following the directions that Wim Coekaerts spelled out, I installed and configured a VM to run OEL with Oracle’s nascent DTrace port. Setting up the system was relatively painless.

Here’s my first DTrace invocation on OEL:

[root@screven ~]# uname -a
Linux screven 2.6.32-201.0.4.el6uek.x86_64 #1 SMP Tue Oct 4 16:47:00 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
[root@screven ~]# dtrace -n 'BEGIN{ trace("howdy from linux"); }'
dtrace: description 'BEGIN' matched 1 probe
CPU     ID                    FUNCTION:NAME
0      1                           :BEGIN   howdy from linux
^C

Then I wanted to see what was on the system:

[root@screven ~]# dtrace -l | wc -l
574
Are you kidding me? For comparison, my Mac has 154,918 probe available and our illumos-derived Delphix OS has 77,320 (Mac OS X has many probes pre-created for each process). It looks like this beta only has the syscall provider, but digging around I can see that Wim didn’t mention that the profile provider is also there:
[root@screven ~]# modprobe profile
[root@screven ~]# dtrace -l | wc -l
587
Sweet.
[root@screven ~]# dtrace -n profile:::profile-997
dtrace: failed to enable 'profile:::profile-997': Failed to enable probe
Not that sweet.
At least I can run my favorite DTrace script:
[root@screven ~]# dtrace -n syscall:::entry'{ @[execname] = count(); }'
dtrace: description 'syscall:::entry' matched 285 probes
^C
pickup                                                            9
abrtd                                                            11
qmgr                                                             17
rsyslogd                                                         25
rs:main Q:Reg                                                    35
master                                                           52
tty                                                              60
dircolors                                                        80
hostname                                                         92
tput                                                             92
id                                                              198
unix_chkpwd                                                     550
auditd                                                          599
dtrace                                                          760
bash                                                           1515
sshd                                                           8327
I wanted to trace activity when I connected to the system using ssh… but ssh logins fail with all probes enabled. To repeat: ssh logins fail with DTrace probes enabled. I’d try to debug it, but I’m too dejected.

Evaluation

While I’d like to give this obviously nascent port the benefit of the doubt, its current state is frankly embarrassing. It’s very clear now why Oracle wasn’t demonstrating this at OpenWorld last week: it doesn’t stand up to the mildest level of scrutiny. It’s fine that Oracle has embarked on a port of DTrace to the so-called unbreakable kernel, but this is months away from being usable. Announcing a product of this low quality and value calls into question Oracle’s credibility as a technology provider. Further, this was entirely avoidable; there were other DTrace ports to Linux that Oracle could have used as a starting point to produce something much closer to functional.

This is not DTrace

So, OEL users, know that this is not DTrace. This is no better than one of the DTrace knockoffs and in many ways much worse. What Oracle released is worse than worthless by violating perhaps the most fundamental tenet of DTrace: don’t damage the system. And, to the OEL folks, I’m sure you’ll get there, but how about you take down your beta until it’s ready? As it is, people might get the wrong impression about what DTrace is.
Posted on October 10, 2011 at 10:43 pm by ahl · Permalink · 13 Comments
In: DTrace · Tagged with: , ,

DTrace for Linux

Yesterday (October 4, 2011) Oracle made the surprising announcement that they would be porting some key Solaris features, DTrace and Zones, to Oracle Enterprise Linux. As one of the original authors, the news about DTrace was particularly interesting to me, so I started digging.

I should note that this isn’t the first time I’ve written about DTrace for Linux. Back in 2005, I worked on Linux-branded Zones, Solaris containers that contained a Linux user environment. I wrote a coyly-titled blog post about examining Linux applications using DTrace. The subject was honest — we used precisely the same techniques to bring the benefits of DTrace to Linux applications — but the title wasn’t completely accurate. That wasn’t exactly “DTrace for Linux”, it was more precisely “The Linux user-land for Solaris where users can reap the benefits of DTrace”; I chose the snappier title.

I also wrote about DTrace knockoffs in 2007 to examine the Linux counter-effort. While the project is still in development, it hasn’t achieved the functionality or traction of DTrace. Suggesting that Linux was inferior brought out the usual NIH reactions which led me to write a subsequent blog post about a theoretical port of DTrace to Linux. While a year later Paul Fox started exactly such a port, my assumption at the time was that the primary copyright holder of DTrace wouldn’t be the one porting DTrace to Linux. Now that Oracle is claiming a port, the calculus may change a bit.

What is Oracle doing?

Even among Oracle employees, there’s uncertainty about what was announced. Ed Screven gave us just a couple of bullet points in his keynote; Sergio Leunissen, the product manager for OEL, didn’t have further details in his OpenWorld talk beyond it being a beta of limited functionality; and the entire Solaris team seemed completely taken by surprise.

What is in the port?

Leunissen stated that only the kernel components of DTrace are part of the port. It’s unclear whether that means just fbt or includes sdt and the related providers. It sounds certain, though, that it won’t pass the DTrace test suite which is the deciding criterion between a DTrace port and some sort of work in progress.

What is the license?

While I abhor GPL v. CDDL discussions, this is a pretty interesting case. According to the release manager for OEL, some small kernel components and header files will be dual-licensed while the bulk of DTrace — the kernel modules, libraries, and commands — will use the CDDL as they had under (the now defunct) OpenSolaris (and to the consernation of Linux die-hards I’m sure). Oracle already faces an interesting conundum with their CDDL-licensed files: they can’t take the fixes that others have made to, for example, ZFS without needing to release their own fixes. The DTrace port to Linux is interesting in that Oracle apparently thinks that the CDDL license will make DTrace too toxic for other Linux vendors to touch.

Conclusion

Regardless of how Oracle brings DTrace to Linux, it will be good for DTrace and good for its users — and perhaps best of all for the author of the DTrace book. I’m cautiously optimistic about what this means for the DTrace development community if Oracle does, in fact, release DTrace under the CDDL. While this won’t mean much for the broader Linux community, we in the illumos community will happily accept anything of value Oracle adds. The Solaris lover in me was worried when it appeared that OEL was raiding the Solaris pantry, but if this is Oracle’s model for porting, then I — and the entire illumos community I’m sure — hope that more and more of Solaris is open sourced under the aegis of OEL differentiation.

10/10/2011 follow-up post, Oracle’s port: this is not DTrace.

Posted on October 5, 2011 at 9:51 pm by ahl · Permalink · 32 Comments
In: DTrace · Tagged with: , , , , , , ,

On RAID-6 recovery

RAID algorithms have become a particular fascination of mine, and I recently read a very interesting paper that describes an optimization for RAID reconstruction (by Xiang, Xu, Lui, Chang, Pan, and Li). Before writing double- and triple-parity RAID algorithms for ZFS, I spent a fair bit of time researching the subject and have stayed interested since. Before describing the reconstruction optimization, some preamble is required. RAID algorithms can be divided into two buckets: one-dimensional algorithms, and multi-dimensional algorithms (terms of my own choosing; I haven’t seen this distinction discussed in literature).

One-dimensional RAID

A one-dimensional algorithm is one in which all data in a single RAID stripe is used to compute all parities. The RAID algorithm used by ZFS falls into this category as do most algorithms derived from Reed-Solomon coding. For a given RAID stripe’s set of data blocks, D, we can compute the nth parity block with some function p(D, n). For example, ZFS, roughly, uses the formula

    \[ p(D,n) = \sum_{i=1}^{\left|{D}\right|} 2^{(i-1)(n-1)} \cdot D_{i} \]

Here, addition and multiplication are defined over a Galois Field – the explanation would be far longer than it would be interesting or relevant so I’ll omit it from this post. It is worth noting that this particular approach only works for three parity disks or fewer, but that too is an entirely different subject (albeit an interesting one). Reconstruction of a missing block in a one-dimensional algorithm requires reading the available data, and performing some computation; each stripe may be reconstructed separately (and thus, in parallel).

Multi-dimensional RAID

A multi-dimensional algorithm is one in which parts of multiple logical RAID stripes may contribute to parity calculation. Examples of this include IBM’s EVENODD and NetApp’s slight simplification, Row-Diagonal Parity (RDP). These are most easily conveyed through diagrams:

EVEN-ODD

EVENODD

RDP

RDP

With both EVENODD and RDP, calculation of the first parity block simply XOR the data blocks in that RAID stripe. The second parity block is calculated by a simple XOR of data values across RAID stripes more or less diagonally. Both of these techniques place constraints on the width of a RAID stripe.

Optimizing RAID reconstruction for fewer reads

The paper, A Hybrid Approach of Failed Disk Recovery Using RAID-6 Codes: Algorithms and Performance Evaluation, describes a optimization for reconstruction under multi-dimensional RAID algorithms. The key insight is that with parity calculations that effectively overlap, a clever reconstruction algorithm can use fewer blocks, thus incurring fewer disk reads. As described in the paper, normally when a given disk fails, all remaining data blocks and blocks from the first parity disk are used for reconstruction:

simple recovery

simple recovery

It is, however, possible to read fewer total blocks by taking advantage of the fact that certain blocks can be multiply used. In the reconstruction below, blocks with circles are used for “row” reconstruction, and blocks with squares are used for “diagonal” reconstruction.

optimized recovery

optimized recovery

Note that the simple approach requires reading 36 blocks (none from disk 7) whereas the reconstruction described in the paper requires reading only 27 blocks. This applies generally: the new approach requires 25% fewer blocks to perform the same reconstruction. And the paper includes a method of balancing the reads between disks.

Disappointing results

Unfortunately, optimizing for fewer reads didn’t translate to significant performance improvements in the overall recovery. For RDP it was about 12% better in the best case, but typically closer to 7%. For EVENODD the improvement was less than 5% in all cases. Why? The naïve recovery algorithm streams data off the healthy hard drives, performs a simple computation, and streams good data onto the replacement drive. Streaming is what drives do best – 3.5” or 2.5”, 7200, 10K, or 15K RPM; SATA, SAS, or FC they all stream pretty well. There may be some contention for I/O resources, but either that contention isn’t severe or the “skips” in the I/O patterns interrupt the normal streaming efficiency.

Applicability in flash

There’s another medium, however, that has throughput and IOPS to spare where this RAID reconstruction approach could be highly effective: flash. With SSDs, it’s possible to see throughput that strains the limits of I/O systems; reading less data could be a significant improvement, and the non-contiguous read patterns wouldn’t degrade performance as they do with HDDs. For all-flash arrays, this sort of optimization may be one of many in its class; with a surplus of IOPS, compute, and memory, the RAID algorithms designed for slow disks, slow CPUs, and a dearth of DRAM, may need to be scrapped and rethought.

Posted on September 21, 2011 at 10:33 pm by ahl · Permalink · One Comment
In: Software · Tagged with: , , , , ,

Flash news I wish I could read

For a short while, I ran the flash memory strategy at Sun and then Oracle, so I still keep my ear to the ground regarding flash news. That news is often frustratingly light — journalists in the space who are fully capable of providing analysis end up brushing the surface. With a tip of the hat to the FJM crew, here’s my commentary on a recent article.
NetApp has Hybrid Aggregate drives coming, with data moved automatically in real time between flash located next to the spinning disks. The company now says that this is a better technology than PCIe flash approaches.
Sounds interesting. NetApp had previously stacked its chips on a PCIe approach for flash called the performance acceleration module (PAM); I read about it in the same publication. This apparent change of strategy is significant, and I wish that the article would have explored the issue, but it was never mentioned.
NetApp, presenting at an Analyst Day event in New York on 30 June, said that having networked storage move as it were into the host server environment was disadvantageous. This was according to Stifel Nicolaus analyst Aaron Rakers.
1. So is this a quote from NetApp or a quote from an analyst or a quote from NetApp quoting an analyst? I’m confused.
2. This is a dense and interesting statement so allow me to unpack it. Moving storage to the host server is code for Fusion-io. These guys make a flash-laden PCIe card that you put in your compute node for super-fast local data access, and they connect a bunch of them together with an IB backplane to share the contents of different cards between hosts. They recently went public, and customers love the performance they offer over traditional SANs. I assume the term “disadvantageous” was left intentionally vague as those being disadvantaged may be NTAP shareholders rather than customers implementing such a solution.
Manish Goel, NetApp’s product ops EVP, said SSDs used as hard disk drive replacements were not as interesting as using flash at the disk layer in a Hybrid Aggregate drive approach – and this was coming.
An Aggregate is the term NetApp uses for a collection of drives. A Hybrid Aggregate — presumably — is some new thing that mixes HDDs and SSDs. Maybe it’s like Sun’s hybrid storage pool. I would have liked to see Manish Goel’s statement vetted or explained, but that’s all we get.
Flash Cache in the controller is a straightforward array read I/O accelerator. PCIe flash in host servers is a complementary technology but will not decentralise the storage market and move networked storage back into the host servers.
Is this still the NetApp announcement or is this back to the journalism? It’s a new paragraph so I guess it’s the latter. Fusion-io will be happy to learn that it only took a couple of lines to be upgraded from “disadvantageous” to “complementary”. And you may be interested to know why NetApp says that host-based flash is complementary. There’s a vendor out there working with NetApp on a host-based flash PCIe card that NetApp will treat as part of its caching tier, pushing data to the card for fast access by the host. I’d need to dig up my notes from the many vendor roadmaps I saw to recall who is building this, but in the context of a public blog post it’s probably better that I don’t.
NetApp has a patent in this Hybrid Aggregate disk drive area called “Mechanisms for moving data in a Hybrid Aggregate”.
I won’t bore you by reposting the except from the patent, but the broad language of the patent does recall to mind the many recent invalidated NetApp patents…
Surely this is what we all understand as auto-placement of data in a virtual storage pool comprising SSD and fast disk tiers, such as Compellent’s block-level Data Progression? Not so, according to a person close to the situation: “It’s much more automatic, real-time and granular. Compellent needs policies and is not real-time. [NetApp] will be automatic and always move data real-time, rather than retroactively.”
What could have followed this — but didn’t — was a response from a representative from Compellent or someone familiar with their technology. Compellent, EMC, Oracle, and others all have strategies that involve mixing flash memory with conventional hard drives. It’s the rare article that discusses those types of connections. Oracle’s ZFS-products uses flash as a caching tier, automatically populating it with useful data. Compellent has a clever technique of moving data between storage tiers seamlessly — and customers seem to love it. EMC just hucks a bunch of SSDs into an array — and customers seem to grin and bear it. NetApp’s approach? It’s hard to decipher what it would mean to “move data in real-time, rather than retroactively.” Does that mean that data is moved when it’s written and then never moved again? That doesn’t sound better. My guess is that NetApp’s approach is very much like Compellent’s — something they should be touting rather than parrying. And I’d love to read that article.
Posted on July 1, 2011 at 6:46 pm by ahl · Permalink · 3 Comments
In: Flash · Tagged with: , , , , , , , , ,

Engineers and customers

Today I took the train out to Long Island to meet up with our New York sales team for a visit with a prospective customer. You never know with an initial meeting, but this one was great. I thought I’d share a bit about what made these guys so excited which is the same stuff that gets me excited about what we’re doing at Delphix.

First though, there are some engineers who have never spoken with a customer. There are some engineering organizations in which requirements are collected from customers, correlated by product managers, handed to engineering managers, and given to engineers. It’s a fine workflow, but this needs to be balanced against engineers engaging directly with customers, hearing their issues, and brainstorming solutions technologist to technologist. Engineers talking to a small number of customers may miss broad trends or fail to connect certain dots, but it’s a complementary activity and part of being a holistic engineer.

I’ve heard software engineers groan that the right technical decision was trumped by business concerns. Those people might be good engineers, but they aren’t great ones. Engineering can’t stop at the boundaries of software; it must necessarily consider the whole ecosystem of the product and the company. Yes, we might not have architected the feature this way if we didn’t have legacy customers support, but we do (and we should be happy for it). (And, of course, this logic can be taken to the other extreme with equally bad results.) This doesn’t mean that a great engineer collects all data first hand, but the whole system must be considered, and walking into a customer’s office from time to time is a reality check.

In today’s meeting, the customer was learning about Delphix for the first time. And they got it right away. As with many enterprises, they have a initiative around virtualization to enable more self-service and more empowerment of their developers. The data in their relational databases is a big anchor weighing down those efforts; the time and effort required to copy and provision databases is a huge drag. Smart guys, they oscillated between how Delphix works — a super-smart, database-optimized storage gateway — and what Delphix does — virtualizing their Oracle databases, bringing the agility and cost-savings of other virtualization technologies. And the slide-ware made real through a demo of the product GUI elicited an terse expression of comprehension: “That’s cool.”

And maybe the best reason for engineers to get into the field is to witness customers who get how cool the product is.

Posted on June 3, 2011 at 2:03 pm by ahl · Permalink · 7 Comments
In: Delphix, Software · Tagged with: