More from the storage anarchist

March 2, 2009

In my last blog post I responded to Barry Burke author of the Storage Anarchist blog. I was under the perhaps naive impression that Barry was an independent voice in the blogosphere. In fact, he’s merely Storage Anarchist by night; by day he’s the mild-mannered chief strategy officer for EMC’s Symmetrix Products Group — a fact notable for its absence from Barry’s blog. In my post, I observed that Barry had apparently picked his horse in the flash race and Chris Caldwell commented that “it would appear that not only has he chosen his horse, but that he’s planted squarely on its back wearing an EMC jersey.” Indeed.

While looking for some mention of his employment with EMC, I found this petard from Barry Burke chief strategy officer for EMC’s Symmetrix Products Group:

And [the “enterprise” differentiation] does matter – recall this video of a Fishworks JBOD suffering a 100x impact on response times just because the guy yells at a drive. You wouldn’t expect that to happen with an enterprise class disk drive, and with enterprise-class drives in an enterprise-class array, it won’t.

Barry, we wondered the same thing so we got some time on what you’d consider an enterprise-class disk drive in an enterprise-class array from an enterprise-class vendor. The results were nearly identical (of course, measuring latency on other enterprise-class solutions isn’t nearly as easy). It turns out drives don’t like being shouted at (it’s shock, not the traditional RV drives compensate for). That enterprise-class rig was not an EMC Symmetrix though I’d salivate over the opportunity to shout at one.

6 Responses

TimC says:

March 2, 2009 at 9:41 am

Adam: How exactly did you measure latency on the enterprise class array? Just wondering in case someone independent perhaps had a symm they’d be willing to test with 😉
As you said, doing it without fishworks tends to be a bit trickier.
Bryan Cantrill says:

March 2, 2009 at 10:42 am

TimC: we gathered statistics that were proprietary to the unit. While nowhere near as detailed (nor as visual nor as powerful) as analytics, they did at least show average latency on a per-second basis — and we were able to confirm that the shouting-induced latency wasn’t unique to our disks or their speed or their protocol. (As an aside: gathering even this crude amount of data was such an excruciating experience that we viscerally felt why customers have such a strong positive reaction when they first see analytics.) My recommendation for measuring this with other machines: create a LUN that has a single disk associated with it, export the LUN to an OS that has rich analysis tools (e.g. DTrace on Solaris), get the disk loaded up, and then shout at it. (And if you wish to challenge Brendan’s celebrity in this domain, video yourself doing same…)
mikee says:

March 2, 2009 at 11:00 am

I’ll bite.. Bryan, does it make sense for you to put together a "shoutAnalyzer.d" script that somewhat follows what you did for the exercise you just described?
Customers could run this on variety of devices and perhaps share their experiences. (since they’ll all go against the same source-base the results might be (at least on the surface) comparable.)
(( although I think caches in some of the larger arrays are going to mask whats going on from the host-perspective, unless driven to exhaustion… Sadly as you mentioned… getting this level of detail directly out of a proprietary frame is going to be a challenge at best… ))
— MikeE
Brendan Gregg says:

March 3, 2009 at 9:33 pm

Tim, to add to what Bryan said; you want the disks to have a heavy write workload (I’ve been told they are more sensitive when writing data.) My disks were also about 80% busy when I shouted at them – which is very busy. If you find a statistic for average disk I/O time, consider what happens if you vibrate (or as we’ve learned, ‘shock’) only a few out of many disks; you’ll still see a change, it just won’t be as obvious as it could be, as it has been averaged out. In fact, this is exactly why we implemented disk I/O latency in Analytics as a heat map (as the video showed) – so that data wasn’t lost by averaging it.
Paul Murphy says:

March 10, 2009 at 4:15 pm

One of my friends works in a place with a couple of T2000s in one room, a fairly thick door, and EMC/HP set up running Informix under HP-UX 11i in the next (heavily air conditioned!) room.
After watching the shouting demo I asked him to repeatedly run and time a script I wrote man years ago that does a lot of serial I/O from Informix to disk (Now 1E8+ records) – it sets up the source files from which the data warehouse table inversion gets done in Perl with the result then queued in another serial file for transmission.
If he stands there rapidly opening and closing that door, the script takes a few seconds longer to complete than if he stays out and thus keeps the door closed.
Not exactly a proper test, and not a highly precise measure, but suggestive.
Adam Leventhal says:

March 10, 2009 at 5:01 pm

@Paul Pretty neat. I’m still surprised that we haven’t seen a video given the mountain of anecdotal evidence I have for EMC customers screaming at their systems…

Adam Leventhal's blog

More from the storage anarchist

6 Responses

Recent Posts

Austin API Summit Wrap-up

Rust and JSON Schema: odd couple or perfect strangers

Oxide and Friends Season 4

DTrace probes in Rust

From Prometheus to Sisyphus

DTrace at Home

Archives

Archives