DTrace, node.js and the Robinson Projection

When I joined Joyent, I mentioned that I was seeking to apply DTrace to the cloud, and that I was particularly excited about the development of node.js — leaving it implict that the intersection of the two technologies would be naturally interesting, As it turns out, we have had an early opportunity to show the potential here: as you might have seen, the Node Knockout programming contest was held over the weekend; when I first joined Joyent (but four weeks ago!), Ryan was very interested in potentially using DTrace to provide a leaderboard for the competition. I got to work, adding USDT probes to node.js. To be fair, this still has some disabled overhead (namely, getting into and out of the node addon that has the true USDT probe), but it’s sufficiently modest to deploy DTrace-enabled node’s in production.

And thanks to incredibly strong work by Joyent engineers, we were able to make available a new node.js service that allocated a container per user. This service allowed us to make available a DTrace-enabled node to contestants — and then observe all of that from the global zone.

For example of the DTrace provider for node.js, here’s a simple enabling to print out HTTP requests as zones handle them (running on one of the Node Knockout machines):

# dtrace -n 'node*:::http-server-request{printf("%s: %s of %s\n", \
    zonename, args[0]->method, args[0]->url)}' -q
nodelay: GET of /poll6759.479651377309
nodelay: GET of /poll6148.392275444794
nodebodies: GET of /latest/
nodebodies: GET of /latest/
nodebodies: GET of /count/
nodebodies: GET of /count/
nodelay: GET of /poll8973.863890386003
nodelay: GET of /poll2097.9667574643568
awesometown: GET of /graphs/4c7a650eba12e9c41d000005.js
awesometown: POST of /graphs/4c7a650eba12e9c41d000005/appendValue
awesometown: GET of /graphs/4c7acd5ca121636840000002.js
awesometown: GET of /graphs/4c7a650eba12e9c41d000005.js
awesometown: GET of /graphs/4c7a650eba12e9c41d000005.js
awesometown: GET of /graphs/4c7a650eba12e9c41d000005.js
awesometown: GET of /graphs/4c7b2408546a64b81f000001.js
awesometown: POST of /faye
awesometown: POST of /faye

I added probes around both HTTP request and HTTP response; treating the file descriptor as a token that describes that uniquely describes that request while it is pending (an assumption that would only be invalid in the presence of HTTP pipelining), allows one to actually determine the latency for requests:

# cat http.d
#pragma D option quiet

        ts[this->fd = args[1]->fd] = timestamp;
        vts[this->fd] = vtimestamp;

/this->ts = ts[this->fd = args[0]->fd]/
        @t[zonename] = quantize(timestamp - this->ts);
        @v[zonename] = quantize(vtimestamp - vts[this->fd]);
        ts[this->fd] = 0;
        vts[this->fd] = 0;

        printf("Wall time:\n");

        printf("CPU time:\n");

This script makes the distinction between wall time and CPU time; for wall-time, you can see the effect of long-polling, e.g. (the values are nanoseconds):

           value  ------------- Distribution ------------- count
           32768 |                                         0
           65536 |                                         4
          131072 |@@@@@                                    52
          262144 |@@@@@@@@@@@@@@@@@@                       183
          524288 |@@@@@                                    55
         1048576 |@@@                                      27
         2097152 |@                                        9
         4194304 |                                         5
         8388608 |@                                        8
        16777216 |@                                        6
        33554432 |@                                        9
        67108864 |@                                        7
       134217728 |@                                        12
       268435456 |@                                        11
       536870912 |                                         1
      1073741824 |                                         4
      2147483648 |                                         1
      4294967296 |                                         5
      8589934592 |                                         0
     17179869184 |                                         1
     34359738368 |                                         1
     68719476736 |                                         0

You can also look at the CPU time to see those that are doing more actual work. For example, one zone with interesting CPU time outliiers:

           value  ------------- Distribution ------------- count
         4194304 |                                         0
         8388608 |@@@@@@@@@@@@                             57
        16777216 |@@@@                                     21
        33554432 |@@@@                                     18
        67108864 |@@@@@@@                                  34
       134217728 |@@@@@@@@@@@                              54
       268435456 |                                         0
       536870912 |                                         0
      1073741824 |                                         0
      2147483648 |                                         0
      4294967296 |@                                        3
      8589934592 |@                                        4
     17179869184 |                                         0

Note that because node has a single thread do all processing, we cannot assume that the requests themselves are inducing the work — only that CPU work was done between request and response. Still, this data would probably be interesting to the nodebodies team…

I also added probes around connection establishment; so here’s a simple way of looking at new connections by zone:

# dtrace -n 'node*:::net-server-connection{@[zonename] = count()}'
dtrace: description 'node*:::net-server-connection' matched 44 probes

  explorer-sox                                                      1
  nodebodies                                                        1
  anansi                                                           69
  nodelay                                                         102
  awesometown                                                     146

Or if we wanted to see which IP addresses were connecting to, say, our good friends at awesometown (with actual addresses
in the output elided):

# dtrace -n 'node*:::net-server-connection \
    /zonename == "awesometown"/{@[args[0]->remoteAddress] = count()}'
dtrace: description 'node*:::net-server-connection' matched 44 probes
  XXX.XXX.XXX.XXX                                                   1
  XX.XXX.XX.XXX                                                     1
  XX.XXX.XXX.XXX                                                    1
  XX.XXX.XXX.XX                                                     1
  XXX.XXX.XX.XXX                                                    1
  XXX.XXX.XX.XX                                                     2
  XXX.XXX.XXX.XX                                                    8

Ryan saw the DTrace support I had added, and had a great idea: what if we took the IPs of incoming connections and geolocated them, throwing them on a world map and coloring them by team name? This was an idea that was just too exciting not to take a swing at, so we got to work. For the backend, the machinery was begging to itself be written in node, so I did a libdtrace addon for node and started building a scalable backend for processing the DTrace data from the different Node Knockout machines. Meanwhile, Joni came up with some mockups that had everyone drooling, and Mark contacted Brian from Nitobi about working on the front-end. Brian and crew were as excited about it as we were, and they put front-end engineer extraordinaire Yohei on the case — who worked with Rob on the Joyent side to pull it all together. Among Rob’s other feats, he managed to implement in JavaScript the logic for plotting longitude and latitude in the beautiful Robinson projection — which is a brutally complicated transformation. It was an incredible team, and we were pulling it off in such a short period of time and with such a firm deadline that we often felt like contestants ourselves!

The result — which it must be said works best in Safari and Chrome — is at http://leaderboard.no.de. In keeping with both the spirit of node and DTrace, the leaderboard is updated in real-time; from the time you connect to one of the Joyent-hostest (no.de) contestants, you should see yourself show up in the map in no more than 700 milliseconds (plus your network latwork latency). For crowded areas like the Bay Area, it can be hard to see yourself — but try moving to Cameroon for best results. It’s fun to watch as certain contestants go viral (try both hovering over a particular data point and clicking on the team name in the leaderboard) — and you can know which continent you’re cursing at in http://saber-tooth-moose-lion.no.de (now known to the world as Swarmation).

Enjoy both the leaderboard and the terrific Node Knockout entries (be sure to vote for your favorites!) — and know that we’ve only scratched the surface of what DTrace and node.js can do together!

Posted on August 30, 2010 at 2:55 am by bmc · Permalink · 4 Comments
In: Uncategorized

The liberation of OpenSolaris

As many have seen, Oracle has elected to stop contributing to OpenSolaris. This decision is, to put it bluntly, stupid. Indeed, I would (and did) liken it to L. Paul Bremer‘s decision to disband the Iraqi military after the fall of Saddam Hussein: beyond merely a foolish decision borne out of a distorted worldview, it has created combatants unnecessarily. As with Bremer’s infamous decision, the bitter irony is that the new combatants were formerly the strongest potential allies — and in Oracle’s case, it is the community itself.

As it apparently needs to be said, one cannot close an open source project — one can only fork it. So contrary to some reports, Oracle has not decided to close OpenSolaris, they have actually decided to fork it. That is, they have (apparently) decided that it is more in their interest to compete with the community that to cooperate with it — that they can in fact out-innovate the community. This confidence is surprising (and ironic) given that it comes exactly at the moment that the historic monopoly on Solaris talent has been indisputably and irrevocably broken — as most recently demonstrated by the departure of my former colleague, Adam Leventhal.

Adam’s case is instructive: Adam is a brilliantly creative engineer — one with whom it was my pleasure to work closely over nearly a decade. Time and time again, I saw Adam not only come up with innovative solutions to tough problems, but run those innovations through the punishing gauntlet that separates idea from product. One does not replace an engineer like Adam; one can only hope to grow another. And given his nine years of experience at the company and in the guts of the system, one cannot expect to grow a replacement quickly — if at all. Oracle’s loss, however, is the community’s gain; I hope I’m not tipping his hand too much to say that Adam will continue to be deeply engaged in the system, leading a new generation of engineers — but this time within a larger community that spans multiple companies and interests.

And in this way, odd as it may be, Oracle’s decision to fork is actually a relief to those of us whose businesses depend on OpenSolaris: instead of waiting for Oracle to engage the community, we can be secure in the knowledge that no engagement is forthcoming — and we can invest and plan accordingly. So instead of waiting for Oracle to fix a nagging driver bug or address a critical request for enhancement (a wait that has more often than not ended in disappointment anyway), we can tap our collective expertise as a community. And where that expertise doesn’t exist or is otherwise unavailable, those of us who are invested in the system can explicitly invest in building it — and then use it to give back to the community and contribute.

Speaking for Joyent, all of this has been tangibly liberating: just the knowledge that we are going to be cranking our own builds has allowed us to start thinking along new dimensions of innovation, giving us a renewed sense of control over our stack and our fate. I have already seen this shift in our engineers, who have begun to conceive of ideas that might not have been thought practical in a world in which Oracle’s engagement was so uncertain. Yes, hard problems lie ahead — but ideas are flowing, and the future feels alive with possibility; in short, innovation is afoot!

Posted on August 19, 2010 at 9:49 pm by bmc · Permalink · 12 Comments
In: Uncategorized

The node.js demographic

I went to the node.js meetup last night in Palo Alto, and it was an interesting affair on several levels. First (and least surprisingly), it was packed, with the Sencha folks joking that they would need to move to a bigger space just to be able to host the event. Second, the technical content itself was intruiging, with fellow Joyeur (and node BDFL) Ryan on dealing with flow control in node, Jed on (fab), future fellow Joyeur Issac on npm; and Tim demo’ing some Connect-based apps, including a simple web-based shared world app in which the room could (and did) participate. Not surprisingly, the performance of this last demo was snappy under load — so much so that it merits repeating an observation that many are currently making: it is increasingly clear that an early space — if not the first — in which we are going to see broad deployment of node-based apps is online social gaming, a space in which node represents a decisive competitive advantage by offering the potential for much more interactive (and more social) gameplay, and one in which there is substantial code churn to begin with. (And of course, speaking from Joyent’s perspective, this is a fortunate confluence: online gaming is also a space that sorely needs the elasticity that the cloud alone can provide.)

So the attendance and content were certainly notable, but most interesting of all to me was the demographic: given that node has become something of the latest hotness (and especially given that it being in JavaScript gives it a pretty wide net), one might expect node’s enthusiasts to be amateurs or novices. That this was emphatically not the case was clear to me shortly after arriving, when I had the unexpected pleasure of reuniting with fellow CS169 head TA Peter Griess. Not to be overly chummy or clubby, but walking into a meetup and seeing one of the tribe tells you something immediate about not just the room, but the technology itself: that it is not mere syntactic sugar or iconoclasm for its own sake, but rather a true revolution in the way certain classes of systems are designed and built. And indeed, over the course of the evening, it became clear that within the room there was an impressive amount of actual experience deploying real systems, with seasoned technologists like Matt Ranney who aren’t merely writing new apps in node, they are rewriting old apps in node. This is a key point, and it goes to the fact that node is not just an easier way of doing things (though that too, certainly) but rather that it offers such a vastly improved runtime that it merits reevaluation of systems that one has already built and deployed.

To me, the systems experience in the room offered an implicit rebuttal to some of the inane criticism of node — criticism that essentially amounts to discrediting node merely because of its newness or its popularity. (And even more enlightened criticism ultimately disappoints with what essentially amounts to an attack on the basis of style, not substance.) To be sure, node is still a young technology, and there is much engineering work still to be done. (For a concrete example of this, see Paul‘s description of the SSL problem.) But with so much deep systems experience in the community — and with the healthy, collaborative vibe that was on display last night — it’s hard to be anything but optimistic!

Posted on August 11, 2010 at 10:16 am by bmc · Permalink · 3 Comments
In: Uncategorized

OpenSolaris and the power to fork

Back when Solaris was initially open sourced, there was a conscious effort to be mindful of the experiences of other projects. In particular — even though it was somewhat of a paradox — it was understood how important it was for the community to have the power to fork the operating system. As I wrote in January, 2005:

If there’s one thing we’ve learned from watching Linux, it’s to not become forkophobic. Paradoxically, in an environment where forks are actively encouraged (e.g. Linux) forking seems to be less of a problem than in environments where forking is viewed as apostasy (e.g. BSD).

Unfortunately — and now in hindsight — we know that OpenSolaris didn’t go far enough: even though the right to fork was understood, there was not enough attention paid to the power to fork. As a result, the operating system never quite got to being 100% open: there remained some annoying (but essential) little bits that could not be opened for one historical (i.e., legal) reason or another. When coupled with the fact that Sun historically had a monopoly or near-monopoly on Solaris engineering talent, the community was entirely deprived of the oxygen that it would have needed to exercise its right to fork.

But change is afoot: over the last six months, the monopoly over Solaris engineering talent has been broken. And now today, we as a community have turned an important corner with the announcement of the Illumos project. Thanks to the hard work of Garrett D’Amore and his band of co-conspirators, we have the beginning of open sourced variants of those final bits that will allow for not just the right but the power to fork. Not that anyone wants to set out to fork the system, of course, but that power is absolutely essential for the vitality of any open source community — and so will be for ours. Kudos to Garrett and crew; on behalf of all of us in the community, thank you!

Posted on August 3, 2010 at 10:34 am by bmc · Permalink · 8 Comments
In: Uncategorized

Hello Joyent!

As I mentioned in my farewell to Sun, I am excited by the future; as you may have seen, that future is joining Joyent as their VP of Engineering.

So, why Joyent? I have known Joyeurs like Jason, Dave, Mark and Ben since back when the “cloud” was still just something that you drew up on a whiteboard as a placeholder for the in-between crap that someone else was going to build and operate. But what Joyent was doing was very much what we now call cloud computing — it was just that in describing Joyent in those pre-cloud days, I found it difficult to convey exactly why what they were doing was exciting (even though to me it clearly was). I found that my conversations with others about Joyent always ended up in the ditch of “virtual hosting”, a label that grossly diminished the level of innovation that I saw in Joyent. Fortunately for my ability to explain the company, “cloud” became infused with much deeper meaning — one that matched Joyent’s vision for itself.

So Joyent was cloud before there was cloud, but so what? When I started to consider what was next for me, one of the problems that I kept coming back to was DTrace for the cloud. What does dynamic instrumentation look like in the cloud? How do you make data aggregation and retention scale across many nodes? How do you support the ad hoc capabilities that make DTrace so powerful? And how do you visualize the data that in a way that allows for those ad hoc queries to be visually phrased? To me, these are very interesting questions — but looking around the industry, it didn’t seem that too many of the cloud providers were really interested in tackling these problems. However, in a conversation at my younger son‘s third birthday party with Joyeur (and friend) Rod Boothby, it became clear that Joyent very much shared my enthusiasm for this problem — and more importantly, that they had made the right architectural decisions to allow for solving it.

My conversation with Rod kicked off more conversations, and I quickly learned that this was not the Joyent that I had known — that the company was going through a very important inflection point whereby they sought a leadership position in innovating in the cloud. To match this lofty rhetoric, the company has a very important proof point: the hiring of Ryan Dahl, inventor and author of node.js.

Before getting into the details of node.js, one should know that I am a JavaScript lover. (If you didn’t already know this about me, you might be somewhat surprised by this — and indeed, there was a time when such a confession would have to be whispered, if it could be said at all — but times have changed, and I’m loud and proud!) My affection for the language came about over a number of years, and crescendoed at Fishworks when I realized that I needed to rewrite our CLI in JavaScript. And while I’m not sure if I’m the only person or even the first to write JavaScript that was designed to be executed over a 9600 baud TTY, it sure as hell felt like I was a pioneer in some perverse way…

Given my history, I clearly have a natural predisposition towards server-side JavaScript — but node.js is much more than that: its event driven model coupled with the implicitly single-threadedness of JavaScript constrains the programmer into a model that allows for highly scalable control logic, but only with sequential code. (For more on this, see Ryan’s recent Google tech talk — though I have no idea what was meant when Ryan was introduced as “controversial”.) This idea — that one can (and should!) build a concurrent system out of sequential components — is one that Jeff and I discussed this in our ACM Queue article on real-world concurrency:

To make this concrete, in a typical MVC (model-view-controller) application, the view (typically implemented in environments such as JavaScript, PHP, or Flash) and the controller (typically implemented in environments such as J2EE or Ruby on Rails) can consist purely of sequential logic and still achieve high levels of concurrency, provided that the model (typically implemented in terms of a database) allows for parallelism. Given that most don’t write their own databases (and virtually no one writes their own operating systems), it is possible to build (and indeed, many have built) highly concurrent, highly scalable MVC systems without explicitly creating a single thread or acquiring a single lock; it is concurrency by architecture instead of by implementation.

But Ryan says all that much more concisely at 21:40 in the talk: “there’s this great thing in Unix called ‘processes.’” Amen! So node.js to me represents a confluence of many important ideas — and it’s clean, well-implemented, and just plain fun to work with.

While I am excited about node.js, it’s more than just a great idea that’s well executed — it also represents Joyent’s vision for itself as an innovator up and down the stack. One can view node.js as being to Joyent was Java was to Sun: transforming the company from one confined to a certain layer into a true systems company that innovates up and down the stack. Heady enough, but if anything this analogy understates the case: Joyent’s development of node.js is not merely an outgrowth of an innovative culture, but also a reflection of a singular focus to deliver on the economies of scale that are the great promise of cloud computing.

Add it all up — the history in the cloud space, the disposition to solving tough cloud problems that I want to solve like instrumentation and observability, and the exciting development of node.js — and you have a company in Joyent that I believe could be the next great systems company and I’m deeply honored (and incredibly excited) to be a part of it!

Posted on July 30, 2010 at 1:46 pm by bmc · Permalink · 9 Comments
In: Uncategorized

Good-bye, Sun

In Februrary 1996, I came out to Sun Microsystems to interview for a job knowing only two things: that I wanted to do operating systems kernel development — and that I didn’t particularly want to work for Sun. I was right on the first count, but knew I was wrong on the second just moments into my first conversation with Jeff. He was emphatic that I should join him in forging the future, sharing both my enthusiasm for what was possible and my disdain for the broken, busted and boogered-up. Fourteen years later, I don’t for a moment regret my decision to join Jeff and Sun: we fostered an environment where the OS was viewed not as a regrettable drag on progress, but rather as a nexus of innovation — incubating technologies that today make a real difference in people’s lives.

In 2006, itching to try something new, Mike and I talked the company into taking the risk of allowing several of us to start Fishworks. That Sun supported our endeavor so enthusiastically was the company at its finest: empowering engineers to tackle hard problems, and inspiring them to bring innovative solutions to market. And with the budding success of the 7000 Series, I would like to believe that we made good on the company’s faith in us — and more generally on its belief in innovation as differentiator.

Now the time has come for me to venture again into something new — but this time it is to be beyond the company’s walls. This is obviously with mixed emotion; while I am excited about the future, it is very difficult for me personally to leave a company in which I have had such close relationships with so many. One of Sun’s greatest strengths was that we technologists were never discouraged from interacting directly and candidly with our customers and users, and many of our most important innovations came from these relationships. This symbiosis was critically important at several junctures of my own career, and I owe many of you a profound debt of gratitude — both for your counsel over the years, and for your willingness to bet your own business and livelihood on the technologies that I helped develop. You, like us, are innovators who love nothing more than great technology, and your steadfast faith in us means more to me than I can express; thank you.

As for my virtual address, it too is changing. This post will be my last at blogs.sun.com; in the future, you can find my blog at its new (permanent) home: http://dtrace.org/blogs/bmc (where comments on this entry will be open). As for e-mail, you can find me at the first letter of my first name concatenated with my last name at acm.org.

Thank you again for everything; take care — and stay in touch!

Posted on July 25, 2010 at 5:17 pm by bmc · Permalink · 46 Comments
In: Fishworks, Solaris

Turning the corner

It’s a little hard to believe that it’s been only fifteen months since we shipped our first product. It’s been a hell of a ride; there is nothing as exhilarating nor as exhausting as having a newly developed product that is both intricate and wildly popular. Especially in the domain of enterprise storage — where perfection is not just the standard but (entirely reasonably) the expectation — this makes for some seriously spiked punch.

For my own part, I have had my head down for the last six months as the Technical Lead for our latest software release, 2010.Q1, which is publicly available as of today. In my experience, I have found that in software (if not in life), one may only ever pick two of quality, features and schedule — and for 2010.Q1, we very much picked quality and features. (As for schedule, let it be only said that this release was once known as “2009.Q4″…)

2010.Q1 Quality

You don’t often see enterprise storage vendors touting quality improvements for a very simple reason: if the product was perfect when you sold it to me, why are you talking about how much you’ve improved it? So I’m going to break a little bit with established tradition and acknowledge that the product has not been perfect, though not without good reason. With our initial development of the product, we were pushing many new technologies very aggressively: not only did we seek to build enterprise-grade storage on commodity components (a deceptively daunting challenge in its own right), we were also building on entirely new elements like flash — and then topped it all off with an ambitious, from-scratch management stack. What were we possibly thinking by making so many bets at once? We made these bets not out of recklessness, but rather because they were essential elements of our Big Bet: that customers were sick of paying monopoly rents for enterprise storage, and that we could deliver a quantum leap in price-performance. (And if nothing else, let it be said that we got that one very, very right — seemingly too right, at times.) As for the specific technology bets, some have proven to be unblemished winners, while others have been more of a struggle. Sometimes the struggle was because the problem was hard, sometimes it was because the software was immature, and sometimes it was because a component that was assumed to have known failure modes had several (or many) unanticipated (or byzantine) failure modes. And in the worst cases, of course, it was all three…

I’m pleased to report that in 2010.Q1, we turned the corner on all fronts: in addition to just fixing a boatload of bugs in key areas like clustering and networking, we engaged in fundamental work like Dave‘s rearchitecture of remote replication, adapted to new device failure modes as with Greg‘s rearchitecture around resilience to HBA logic failure, and — perhaps most importantly — integrated critical firmware upgrades to each of the essential components of the I/O path (HBAs, SIM cards and disks). Also in 2010.Q1, we changed the way the way that we run the evaluation of the software, opening the door to many in our rapidly growing customer base. As a result, this release is already running on more customer production systems than any of its predecessors were at the time that they shipped — and on many more eval and production machines within our own walls.

2010.Q1 Features

But as important as quality is to this release, it’s not the full story: the release is also packed with major features like deduplication, iSER/SRP support, Kerberized NFS support and Fibre Channel support. Of these, the last is of particular interest to me because, in addition to my role as the Technical Lead for 2010.Q1, I was also responsible for the integration of FC support into the product. There was a lot of hard work here, but much of it was born by John Forte and his COMSTAR team, who did a terrific job not only on the SCSI Target Management facility (STMF) but also on the base ALUA support necessary to allow proper FC operation in a cluster. As for my role, it was fun to cut the code to make all of this stuff work. Thanks to some great design work by Todd Patrick, along with some helpful feedback from field-facing colleagues like Ryan Matthews, I think we came up with a clean, functional interface. And working closely with both John and our test team, we have developed a rock-solid FC product. But of course (and as one might imagine), for me personally, the really gratifying bit was adding FC support to analytics. With just a pinch of DTrace and a bit of glue code, we now have visibility into FC operations by LUN, by project, by target, by initiator, by operation, by SCSI command, by size, by offset and by latency — and by any combination thereof.

As I was developing FC analytics, I would use as my source of load a silly disk benchmark I wrote back in the day when Adam and I were evaluating SSDs. Here for example, is that benchmark running against a LUN that I named “thicktail-bench”:

The initiator here is the machine “thicktail”; it’s interesting to break down by initiator and see the paths by which thicktail is accessing the LUN:

(These names are human readable because I have added aliases for each of thicktail’s two HBA ports. Had I not added those aliases, we would see WWNs here.) The above shows us that thicktail is accessing the LUN through both of its paths, which is what we would expect (but good to visually confirm). Let’s see how it’s accessing the LUN in terms of operations:

Nothing too surprising here — this is the write phase of the benchmark and we have no log devices on this system, so we fully expect this. But let’s break down by offset:

The first time I saw this, I was surprised. Not because of what it shows — I wrote this benchmark, and I know what it does — but rather because it was so eye-popping to really see its behavior for the first time. In particular, this captures an odd phase I added to this benchmark: it does random writes across an increasing large range. I did this because we had discovered that some SSDs did fine when the writes were confined to a small logical region, but broke down — badly — when the writes were over a larger region. And no, I don’t know why this was the case (presumably the firmware was in fragmented/wear-leveling/cache-busting hell); all I know is that we rejected any further exploration once the writes to the SSD were of a higher latency than that of my first hard drive: the IBM PC XT’s 10 MB ST-412, which had roughly 95 ms writes! (We felt that expecting an SSD to have better write latency than a hard drive from the first Reagan Administration was tough but fair…)

What now?

As part of our ongoing maturity as a product, we have developed a new role here at Fishworks: starting in 2010.Q1, the Technical Lead for the release will, as the release ships, transition to become the full-time Support Lead for that release in the field. This means many things for the way we support the product, but for our customers, it means that if and when you do have an issue on 2010.Q1, you should know that the buck on your support call will ultimately stop with me. We are establishing an unprecedented level of engineering integration with our support teams, and we believe that it will show in the support experience. So welcome to 2010.Q1 — and happy upgrading!

Posted on March 10, 2010 at 2:24 pm by admin · Permalink · 20 Comments
In: Fishworks

John Birrell

It is with a heavy heart that I announce that we in the DTrace community have lost one of our own: the indomitable John Birrell, who ported DTrace to FreeBSD, suffered a stroke and passed away on Friday, November 20, 2009.

We on Team DTrace knew John to be a remarkably talented and determined software engineer. As those who have attempted ports can attest, DTrace passes through rough country, and a port to a foreign system is a significant undertaking that requires mastery of both DTrace and (particularly) the target system. And in being the first to attempt a port, John’s challenge was that much greater — and his success in the endeavor a tribute to both his ability and (especially) his tenacity. For example, in performing the port, John decided that DTrace’s dependency on the cyclic subsystem was such that it, too, needed to be ported. He didn’t need to do this (and indeed, other ports have decided that an arbitrary resolution profile provider is not worth the significant trouble), but that he undertook this additional technical challenge anyway — even when any victory would remain hidden to all but the most expert eye — says a lot about John as both an engineer and a man. Later, when the port ran into some frustrating licensing issues, John once again did not give up. Rather, he backed up, and found a path forward that would satisfy all parties — even though it required significant technical reworking on his part. I have long believed that the mark of a great engineer is not how frequently they get knocked down, but rather how quickly they get back up — and in this regard, John was indisputably a giant.

John, you will be missed — not only by the FreeBSD community upon which you made an indelible mark, but by those of us in the DTrace community who only had the opportunity to work with you more recently. And while your legacy might remain anonymous to the future generations that will benefit from the fruits of your long labor, we will always know that it never would have happened without you. Thank you, and farewell.

(Those who wish to memorialize John may want to do as I did and make a donation in his memory to the FreeBSD Foundation.)

Posted on November 26, 2009 at 12:36 pm by bmc · Permalink · Comments Closed
In: Fishworks

Queue, CACM, and the rebirth of the ACM

As I have mentioned before (if in passing), I sit on the Editorial Advisory Board of ACM Queue, ACM‘s flagship publication for practitioners. In the past year, Queue has undergone a significant transformation, and now finds itself at the vanguard of a much broader shift within the ACM — one that I confess to once thinking impossible.

My story with respect to the ACM is like that of many practitioners, I suspect: I first became aware of the organization as an undergraduate computer science student, when it appeared to me as the embodiment of academic computer science. This perception was cemented by its flagship publication, Communications of the ACM, a magazine which, to a budding software engineer longing for the world beyond academia, seemed to be adrift in dreamy abstraction. So when I decided at the end of my undergraduate career to practice my craft professionally, I didn’t for a moment consider joining the ACM: it clearly had no interest in the practitioner, and I had no interest in it.

Several years into my career, my colleague David Brown mentioned that he was serving on the Editorial Board of a new ACM publication aimed at the practitioner, dubbed ACM Queue. The idea of the ACM focussing on the practitioner brought to mind a piece of Sun engineering lore from the old Mountain View days. Sometime in the early 1990s, the campus engaged itself in a water fight that pitted one building against the next. The researchers from the Sun Labs building built an elaborate catapult to launch water-filled missiles at their adversaries, while the gritty kernel engineers in legendary MTV05 assembled surgical tubing into simple but devastatingly effective three-person water balloon slingshots. As one might guess, the Labs folks never got their catapult to work — and the engineers doused them with volley after volley of water balloons. So when David first mentioned that the ACM was aiming a publication at the practitioner, my mental image was of lab-coated ACM theoreticians, soddenly tinkering with an overcomplicated contraption. I chuckled to myself at this picture, wished David good luck on what I was sure was going to be a fruitless endeavor, and didn’t think any more of it.

Several months after it launched, I happened to come across an issue of the new ACM Queue. With skepticism, I read a few of the articles. I found them to be surprisingly useful — almost embarrassingly so. I sheepishly subscribed, and I found that even the articles that I disagreed with — like this interview with an apparently insane Alan Kay — were more thought-provoking than enraging. And in soliciting articles on sordid topics like fault management from engineers like my long-time co-conspirator Mike Shapiro, the publication proved itself to be interested in both abstract principles and their practical application. So when David asked me to be a guest expert for their issue on system performance, I readily accepted. I put together an issue that I remain proud of today, with articles from Bart Smaalders on performance anti-patterns, Phil Beevers on development methodologies for high-performance software, me on DTrace — and topped off with an interview between Kirk McKusick and Jarod Jenson that, among its many lessons, warns us of the subtle perils of Java’s notifyAll.

Two years later, I was honored to be asked to join Queue’s Editorial Advisory Board, where my eyes were opened to a larger shift within the ACM: the organization — led by both its executive leadership in CEO John White and COO Pat Ryan and its past and present elected ACM leadership like Steve Bourne, Dave Patterson, Stu Feldman and Wendy Hall — were earnestly and deliberately seeking to bring the practitioner into the ACM fold. And I quickly learned that I was not alone in my undergraduate dismissal of Communications of the ACM: CACM was broadly viewed within the ACM as being woefully out of touch with both academic and practitioner alike, with one past president confessing that he himself couldn’t stomach reading it — even when his name was on the masthead. There was an active reform movement within the ACM to return the publication to its storied past, and this trajectory intersected with the now-proven success of Queue: it was decided that the in-print vehicle for Queue would shift to become the Practice section of a new, revitalized CACM. I was elated by this change, for it meant that our superlative practitioner-authored content would at last enter the walled garden of the larger academic community. And for practitioners, a newly relevant CACM would also serve to expose us to a much broader swathe of computer science.

After much preparation, the new CACM launched in July 2008. Nearly a year later, I think it can safely be called a success. To wit, I point to two specific (if personal) examples from that first issue alone: thanks to the new CACM, my colleague Adam Leventhal’s work on flash memory and our integration of it in ZFS found a much broader readership than it would have otherwise — and Adam was recently invited to join an otherwise academic symposium on flash. And thanks to the new CACM, I — and thousands of other practitioners — were treated to David Shaw’s incredible Anton, the kind of work that gives engineers an optimistic excitement uniquely induced by such moon shots. By bringing together the academic and the practitioner, the new CACM is effecting a new ACM.

So, to my fellow practitioners: I strongly encourage you to join me as a member of the ACM. While CACM is clearly a compelling and tangible benefit, it is not the only reason to join the ACM. As professionals, I believe that we have a responsibility to our craft: to learn from our peers, to offer whatever we might have to teach, and to generally leave the profession better than we found it. In other professions — in law, in medicine, and in more traditional engineering domains — this professional responsibility is indoctrinated to the point of expectation. But our discipline perhaps shows its youth in our ignorance of this kind of professional service. To be fair, this cannot be laid entirely at the practitioner’s feet: the organizations that have existed for computer scientists have simply not been interested in attracting, cultivating, or retaining the practitioner. But with the shift within the ACM embodied by the new CACM, this is changing. The ACM now aspires to be the organization that represents all computer scientists — not just those who teach students, perform research and write papers, but also those of us who cut code, deliver product and deploy systems for a living. Joining the ACM helps it make good on this aspiration; we practitioners cannot effect this essential change from outside its membership. And we must not stop at membership: if there is an article that you might like to write for the broader ACM audience, or an article that you’d like to see written, or a suggestion you might have for a CTO roundtable or a practitioner you think should be interviewed, or, for that matter, any other change that you might like to see in the ACM to further appeal to the practitioner, do not stay silent; the ACM has given us practitioners a new voice — but it is only good if we use it!

Posted on May 15, 2009 at 12:58 am by bmc · Permalink · 6 Comments
In: Fishworks

Moore’s Outlaws

My blog post eulogizing SPEC SFS has elicited quite a bit of reaction, much of it from researchers and industry observers who have drawn similar conclusions. While these responses were very positive, my polemic garnered a different reaction from SPEC SFS stalwart NetApp, where, in his response defending SPEC SFS, my former colleague Mike Eisler concocted this Alice-in-Wonderland defense of the lack of a pricing disclosure in the benchmark:

Like many industries, few storage companies have fixed pricing. As much as heads of sales departments would prefer to charge the same highest price to every customer, it isn’t going to happen. Storage is a buyers’ market. And for storage devices that serve NFS and now CIFS, the easily accessible numbers on spec.org are yet another tool for buyers. I just don’t understand why a storage vendor would advocate removing that tool.

Mike’s argument — and I’m still not sure that I’m parsing it correctly — appears to be that the infamously opaque pricing in the storage business somehow helps customers because they don’t have to pay a single “highest price”! That is, that the lack of transparent pricing somehow reflects the “buyers’ market” in storage. If that is indeed Mike’s argument, someone should let the buyers know how great they have it — those silly buyers don’t seem to realize that the endless haggling over software licensing and support contracts is for them!

And if that argument isn’t contorted enough for you, Mike takes a second tack:

In storage, the cost of the components to build the device falls continuously. Just as our customers have a buyers’ market, we storage vendors are buyers of components from our suppliers and also enjoy a buyers’ market. Re-submitting numbers after a hunk of sheet metal declines in price is silly.

His ludicrous “sheet metal” example aside (what enterprise storage product contains more than a few hundred bucks of sheet metal?), Mike’s argument appears to be that technology curves like Moore’s Law and Kryder’s Law lead to enterprise storage prices that are falling with such alarming speed that they’re wrong by the time as they are so much as written down! If it needs to be said, this argument is absurd on many levels. First, the increases in transistor density and areal storage density tend to result in more computing bandwidth and more storage capacity per dollar, not lower absolute prices. (After all, your laptop is three orders of magnitude more powerful than a personal computer circa 1980 — but it’s certainly not a thousandth of the price.)

Second, has anyone ever accused the enterprise storage vendors of dropping their prices in pace with these laws — or even abiding by them in the first place? The last time I checked, the single-core Mobile Celeron that NetApp currently ships in their FAS2020 and FAS2050 — a CPU with a criminally small 256K of L2 cache — is best described as a Moore’s Outlaw: a CPU that, even when it first shipped six (six!) years ago, was off the curve. (A single-core CPU with 256K of L2 cache was abiding by Moore’s Law circa 1997.) Though it’s no wonder that NetApp sees plummeting component costs when they’re able to source their CPUs by dumpster diving…

Getting back to SPEC SFS: even if the storage vendors were consistently reflecting technology improvements, SPEC SFS is (as I discussed) a drive latency benchmark that doesn’t realize the economics of these curves anyway; drives are not rotating any faster year-over-year, having leveled out at 15K RPM some years ago due to some nasty physical constraints (like, the sound barrier). So there’s no real reason to believe that the 2,016 15K RPM drives used in NetApp’s opulent 1,032,461 op submission are any cheaper today than when this configuration was first submitted three years ago. Yes, those same drives would likely have more capacity (being 146GB or 300GB and not the 72GB in the submission), but recall that these drives are being short-stroked to begin with — so such as additional capacity is being used at all by the benchmark, it will only be used to assure even less head movement!

Finally, even if Mike were correct that technology advances result in ever falling absolute prices, it still should not prohibit price disclosures. We all understand that prices reflect a moment in time, and if natural inflation does not dissuade us from price disclosures, nor should any technology-induced deflation.

So to be clear: SPEC SFS needs pricing disclosures. TPC has them, SPC has them — and SFS needs them if the benchmark has any aspiration to enduring relevance. While SPEC SFS’s flaws run deeper than the missing price disclosure, the disclosure would at least keep the more egregious misbehaviors in check — and it would also (I believe) show storage buyers the degree to which the systems measured by SPEC SFS do not in fact correspond to the systems that they purchase and deploy.

One final note: in his blog entry, Mike claims that “SPEC SFS performance is the minimum bar for entry into the NAS business.” If he genuinely believes this, Mike may want to write a letter to the editors of InfoWorld: in their recent review of our Sun Storage 7210, they had the audacity to ignore the lack of SPEC SFS results for the appliance, instead running their own benchmarks. Their rating for the product’s performance? 10 out of 10. What heresy!

Posted on February 19, 2009 at 9:28 pm by bmc · Permalink · 11 Comments
In: Fishworks