Standing up SmartDataCenter

Over the past year, we at Joyent have been weaving some threads that may have at times seemed disconnected: last fall, we developed a prototype of DTrace-based cloud instrumentation as part of hosting the first Node Knockout competition; in the spring, we launched our no.de service with full-fledged real-time cloud analytics; and several weeks ago, we announced the public availability of SmartOS, the first operating system to combine ZFS, DTrace, OS-based virtualization and hardware virtualization in the same operating system kernel. These activities may have seemed disjoint, but there was in fact a unifying theme: our expansion at Joyent from a public cloud operator to a software company — from a company that operates cloud software to one that also makes software for cloud operators.

This transition — from service to product — is an arduous one, for the simple reason that a service can make up for any shortcomings with operational heroics that are denied to a product. Some of these gaps may appear superficial (e.g., documentation), but many are deep and architectural; that one is unable to throw meat at a system changes many fundamental engineering decisions. (It should be said that some products don’t internalize these differences and manage to survive despite their labor-intensive shortcomings, but this is thanks to the caulking called “professional services” — and is usually greased by an enterprise software sales rep with an easy rapport and a hell of a golf handicap.) As the constraints of being a product affect engineering decisions, so do the constraints of operating as a service; if a product has no understanding or empathy for how it will be deployed into production, it will be at best difficult to operate, and at worst unacceptable for operation.

So at Joyent, we have long believed that we need both a product that can be used to provide a kick-ass service and a kick-ass service to prove to ourselves that we hit the mark. While our work over the past year has been to develop that product — SmartDataCenter — we have all known that we wouldn’t be done until that product was itself operating not just emerging services like no.de, but also the larger Joyent public cloud, and over the summer we began to plan a broad-based stand-up of SmartDataCenter. With our SmartOS-based KVM solid and on schedule, and with the substantial enabling work up the stack falling into place, we set for ourselves an incredibly ambitious target: to not only ship this revolutionary new version of the software, but to also — at the same time — stand up that product in production in the Joyent public cloud. Any engineer’s natural conservatism would resist such a plan — conflating shipping a product with a production deployment feels chancey at best and foolish at worst — but we saw an opportunity to summon a singular, company-wide focus that would force us to deliver a much better product, and would give us and our software customers the satisfaction of knowing that SmartDataCenter as shipped runs Joyent. More viscerally, we were simply too hot on getting SmartOS-based hardware virtualization in production — and oh, those DTrace-based KVM analytics! — to contemplate any path less aggressive.

Of course, it wasn’t easy; there were many intense late-night Jabber sessions as we chased bugs and implemented solutions. But as time went on — and as we cut release candidates and began to pound on the product in staging envirionments — it became clear that the product was converging, and a giddy optimism began to spread. Especially for our operations team, many of whom are long-time Joyent vets, this was a special kind of homecoming, as they saw their vision and wisdom embodied in the design decisions of the product. This sentiment was epitomized by one of the senior members of the Joyent public cloud operations team — an engineer who has tremendous respect in the company, and who is known for his candor — as he wrote:

It seems the past ideas of a better tomorrow have certainly come to light. I think I speak for all of the JPC when I say “Bravo” to the dev team for a solid version of the software. We have many rivers (hopefully small creeks) to cross but I am ultimately confident in what I have seen and deployed so far. It’s incredible to see a unified convergence on what has been the Joyent dream since I have been here.

For us in development, these words carried tremendous weight; we had endeavored to build the product that our team would want to deploy and operate, and it was singularly inspiring to see them so enthusiastic about it! (Though I will confess that this kind of overwhelmingly positive sentiment from such a battle-hardened operations engineer is so rare as to almost make me nervous — and yes, I am knocking on wood as I write this.)

With that, I’m pleased to announce that the (SmartDataCenter-based) Joyent public cloud is open for business! You can provision (via web portal or API) either virtual OS instances (SmartMachines) or full hardware-virtual VMs; they’ll provision like a rocket (ZFS clones FTW!), and you’ll get a high-performing, highly reliable environment with world-beating real-time analytics. And should you wish to stand up your own cloud with the same advantages — either a private cloud within your enterprise or a public cloud to serve your customers — know that you can own and operate your very own SmartDataCenter!

Posted on September 15, 2011 at 2:03 am by bmc · Permalink · Comments Closed
In: Uncategorized

KVM on illumos

A little over a year ago, I came to Joyent because of a shared belief that systems software innovation matters — that there is no level of the software stack that should be considered off-limits to innovation. Specifically, we ardently believe in innovating in the operating system itself — that extending and leveraging technologies like OS virtualization, DTrace and ZFS can allow one to build unique, differentiated value further up the software stack. My arrival at Joyent also coincided with the birth of illumos, an effort that Joyent was thrilled to support, as we saw that rephrasing our SmartOS as an illumos derivative would allow us to engage in a truly open community with whom we could collaborate and collectively innovate.

In terms of the OS technology itself, as powerful as OS virtualization, DTrace and ZFS are, there was clearly a missing piece to the puzzle: hardware virtualization. In particular, with advances in microprocessor technology over the last five years (and specifically with the introduction of technologies like Intel’s VT-x), one could (in principle, anyway) much more easily build highly performing hardware virtualization. The technology was out there: Fabrice Bellard‘s cross-platform QEMU had been extended to utilize the new microprocessor support by Avi Kivity and his team with their Linux kernel virtual machine (KVM).

In the fall of last year, the imperative became clear: we needed to port KVM to SmartOS. This notion is almost an oxymoron, as KVM is not “portable” in any traditional sense: it interfaces between QEMU, Linux and the hardware with (rightful) disregard for other systems. And as KVM makes heavy use of Linux-specific facilities that don’t necessarily have clean analogs in other systems, the only way to fully scope the work was to start it in earnest. However, we knew that just getting it to compile in any meaningful form would take significant effort, and with our team still ramping up and with lots of work to do elsewhere, the reality was that this would have to start as a small project. Fortunately, Joyent kernel internals veteran Max Bruning was up to the task, and late last year, he copied the KVM bits from the stable Linux 2.6.34 source, took out a sharpened machete, and started whacking back jungle…

Through the winter holidays and early spring (and when he wasn’t dispatched on other urgent tasks), Max made steady progress — but it was clear that the problem was even more grueling than we had anticipated: because KVM is so tightly integrated into the Linux kernel, it was difficult to determine dependencies — and hard to know when to pull in the Linux implementation of a particular facility or function versus writing our own or making a call to the illumos equivalent (if any). By the middle of March, we were able to execute three instructions in a virtual machine before taking an unresolved EPT violation. To the uninitiated, this might sound pathetic — but it meant that a tremendous amount of software and hardware were working properly. Shortly thereafter, with the EPT problem fixed, the virtual machine would happily run indefinitely — but clearly not making any progress booting. But with this milestone, Max had gotten to the point where the work could be more realistically parallelized, and we had gotten to the point as a team and a product where we could dedicate more engineers to this work. In short, it was time to add to the KVM effort — and Robert Mustacchi and I roped up and joined Max in the depths of KVM.

With three of us on the project and with the project now more parallelizable, the pace accelerated, and in particular we could focus on tooling: we added an mdb dmod to help us debug, 16-bit disassembler support that was essential for us to crack the case on the looping in real-mode, and DTrace support for reading the VMCS. (And we even learned some surprising things about GNU C along the way!) On April 18th, we finally broke through into daylight: we were able to boot into KMDB running on a SmartOS guest! The pace quickened again, as we now had the ability to interactively ascertain state as seen by the guest and much more quickly debug issues. Over the next month (and after many intensive debugging sessions), we were able to boot Windows, Linux, FreeBSD, Haiku, QNX, Plan 9 and every other OS we could get our hands on; by May 13th, we were ready to send the company-wide “hello world” mail that has been the hallmark of bringup teams since time immemorial.

The wind was now firmly at our backs, but there was still much work to be done: we needed to get configurations working with large numbers of VCPUs, with many VMs, over long periods of time and significant stress and so on. Over the course of the next few months, the bugs became rarer and rarer (if nastier!), and we were able to validate that performance was on par with Linux KVM — to the point that today we feel confident releasing illumos KVM to the greater community!

In terms of starting points, if you just want to take it for a spin, grab a SmartOS live image ISO. If you’re interested in the source, check out the illumos-kvm github repo for the KVM driver itself, the illumos-kvm-cmd github repo for our QEMU 0.14.1 tree, and illumos itself for some of our KVM-specific tooling.

One final point of validation: when we went to integrate these changes into illumos, we wanted to be sure that they built cleanly in a non-SmartOS (that is, stock illumos) environment. However, we couldn’t dedicate a machine for the activity, so we (of course) installed OpenIndiana running as a KVM guest on SmartOS and cranked the build to verify our KVM deltas! As engineers, we all endeavor to build useful things, and with KVM on illumos, we can assert with absolute confidence that we’ve hit that mark!

Posted on August 15, 2011 at 7:00 am by bmc · Permalink · 44 Comments
In: Uncategorized

In defense of intrapreneurialism

Trevor pointed me to a ChubbyBrain article that derided “intrapreneurs” as having nothing in common with their entrepreneurial brethren. While I don’t necessary love the term, what we did at Fishworks is probably most accurately described as “intrapreneurial” — and based on both our experiences there and the caustic tone of the Chubby polemic, I feel a responsibility to respond. (And so, I hasten to add, did Trevor.) In terms of my history: I spent ten years in the bowels of Bigco, four years engaged in an intrapreneurial effort within Bigco, and the past year at an actual, venture-funded startup. Based on my experience, I take strong exception to the Chubby piece; to take their criticisms of intrapreneurs one-by-one:

Of course, there are differences between a startup within a large organization and an actual startup. When I was pitching Fishworks to the executives at Sun, I thought of our activity as both giving up some of the upside and removing some of the downside of a startup. After living through Fishworks and now being at a startup, I would be more precise about the downside that the intrapreneur doesn’t have to endure: it’s not insecurity (or at least, not as much as I believed) — it’s the grueling task of raising money. At Fishworks, we had exactly one investor, they knew us and we knew them — and that was a blessing that gave us the luxury of focus.

On balance, I don’t regret for a moment doing it the way we did it — but I also don’t regret having left for a startup. I don’t doubt that my career will likely oscillate between smaller organizations and larger ones — I see strengths and weaknesses to both, much of which comes down to specific opportunities and conditions. In short: there may be differences between intrapreneurialism and entrepreneurialism — but I see little difference between the intrapreneur and the entrepreneur!

Posted on July 12, 2011 at 2:26 pm by bmc · Permalink · 4 Comments
In: Uncategorized

When magic collides

We had an interesting issue come up the other day, one that ended up being a pretty nasty bug in DTrace. It merits a detailed explanation, but first, a cautionary note: we’re headed into rough country; in the immortal words of Sixth Edition, you are not expected to understand this.

The problem we noticed was this: when running node and instrumenting certain hot functions (memcpy:entry was noted in particular), after some period of activity, the node process would die on a strange seg fault. And looking at the core dumps, we were (usually) lost in some code path in memcpy() with registers or a stack that was self-inconsistent. In debugging this, it was noted (with DTrace, natch) that the node process was under somewhat intense signal processing activity (we’ll get to why in a bit), and that (in particular) the instrumented function activity and the signal activity seemed to be interacting “shortly” before the fatal failure.

Before wading any deeper, it’s worth taking an aside to explain how we deal with signals in DTrace — a discussion that in turn requires some background on the way loss-less user-level tracing works in DTrace. One of the important constraints that we placed upon ourselves (albeit one that we didn’t talk about too much) is that DTrace is loss-less: if multiple threads on multiple CPUs hit the same point of instrumentation, no thread will see an uninstrumented code path. This may seem obvious, but it’s not the way traditional debuggers work: normally, a debugger stops a process (and all of its threads) when it hits a breakpoint — which it induces by modifying program text with a synchronous trap instruction (e.g., an “int 3” in x86). Then, the program text is modified to contain the original instruction, the thread that hit the breakpoint (and only that thread) is single-stepped (often using hardware support) over the original instruction, and the original instruction is again replaced with a breakpoint instruction. We felt this to be unacceptable for DTrace, and imposed a constraint that tracing always be loss-less.

For user-level instrumentation (that is, the pid provider), magic is required: we need to execute the instrumented instruction out-of-line. For an overview of the mechanism for this, see Adam‘s block comment in fasttrap_isa.c. The summary is that we have a per-thread region of memory at user-level to which we copy both the instrumented instruction and a control transfer instruction that gets us back past the point of instrumentation. Then, when we hit the probe (and after we’ve done in-kernel probe processing), we point the instruction pointer to be in the per-thread region and return to user-level; back in user-land, we execute the instruction and transfer control back to the rest of the program.

Okay, so easy, right? Well, not so fast — enter signals: if we take a signal while we’re executing this little user-level trampoline, we could corrupt program state. At first, it may not be obvious how this is the case; if we execute a signal before we execute the instruction in the trampoline we will simply execute the signal handler (whatever it may be) and return to the trampoline, finish that off, and then get back to the business of executing the program. Where’s the problem? The problem is that if the signal handler executes something that itself is instrumented with DTrace: if this is the case, we’ll hit another probe, and that probe will copy its instruction to the same trampoline. This is bad: it means that upon return from the signal, we will return to the trampoline and execute the wrong instruction. Only one of them, of course. Which, if you’re lucky, will induce an immediate crash. But more likely, the program will execute tens, hundreds or even thousands of instructions before dying on some “impossible” problem. Sound nasty to debug? It is — and next time you see Adam, show some respect.

So how do we solve this problem? For this, Adam made an important observation: the kernel is in charge of signal delivery, and it knows about all of this insanity, so can’t it prevent this from happening? Adam solved this by extending the trampoline after the control transfer instruction to consist of the original instruction (again) followed by an explicit trap into the kernel. If the kernel detects that we’re about to deliver a signal while in this window (that is, with the program counter on the original instruction in the trampoline), it moves the program counter to the second half of the trampoline and returns to user-level with the signal otherwise unhandled. This leaves the signal pending, but only allows the program to execute the one instruction before bringing it back downtown on the explicit trap in the second half of the trampoline. There, we will see that we are in this situation, update the trapped program counter to be the instruction after the instrumented instruction (telling the program a small but necessary lie about where signal delivery actually occured), and bounce back to user-level; the kernel will see the bits set on the thread structure indicating that a signal is pending, and the Right Thing will happen.

And it all works! Well, almost — and that brings us to the bug. After some intensive DTrace/mdb sessions and a long Jabber conversation with Adam, it became clear that while signals were involved, the basic mechanism was actually functioning as designed. More experimenting (and more DTrace, of course) revealed that the problem was even nastier than the lines along which we were thinking: yes, we were taking a signal in the trampoline — but that logic was working as designed, with the program counter being moved correctly into the second half of the trampoline. The problem was that we were taking a second asynchronous signal, this time in the second half of the trampoline, and this brought us to this logic in dtrace_safe_defer_signal:

        /*
         * If we've executed the original instruction, but haven't performed
         * the jmp back to t->t_dtrace_npc or the clean up of any registers
         * used to emulate %rip-relative instructions in 64-bit mode, do that
         * here and take the signal right away. We detect this condition by
         * seeing if the program counter is the range [scrpc + isz, astpc).
         */
        if (t->t_dtrace_astpc - rp->r_pc <
            t->t_dtrace_astpc - t->t_dtrace_scrpc - isz) {
                ...
                rp->r_pc = t->t_dtrace_npc;
                t->t_dtrace_ft = 0;
                return (0);
        }

Now, the reason for this clause (the meat of which I’ve elided) is the subject of its own discussion, but it is not correct as written: it works (due to some implicit use of unsigned math) when the program counter is in the first half of the trampoline, but it fails if the program counter is on the instrumented instruction in the second half of the trampoline (denoted with t_dtrace_astpc). If we hit this case — first one signal on the first instruction of the trampoline followed by another signal on the first instruction on the second half of the trampoline — we will redirect control to be the instruction after the instrumentation (t_dtrace_npc) without having executed the instruction. This, if it needs to be said, is bad — and leads to exactly the kind of mysterious death that we’re seeing.

Fortunately, the fix is simple — here’s the patch. All of this leaves just one more mystery: why are we seeing all of these signals on node, anyway? They are all SIGPROF signals, and when I pinged Ryan about this yesterday, he was as surprised as I was that node seemed to be doing them when otherwise idle. A little digging on both of our parts revealed that the signals we were seeing are an artifact of some V8 magic: the use of SIGPROF is part of the feedback-driven optimization in the new Crankshaft compilation infrastructure for V8. So the signals are goodness — and that DTrace can now correctly handle them with respect to arbitrary instrumentation that much more so!

Posted on March 9, 2011 at 11:27 pm by admin · Permalink · 9 Comments
In: Uncategorized

Log/linear quantizations in DTrace

As I alluded to when I first joined Joyent, we are using DTrace as a foundation to tackle the cloud observability problem. And as you may have seen, we’re making some serious headway on this problem. Among the gritty details of scalably instrumenting latency is a dirty little problem around data aggregation: when we instrument the system to record latency, we are (of course) using DTrace to aggregate that data in situ, but with what granularity? For some operations, visibility into microsecond resolution is critical, while for other (longer) operations, millisecond resolution is sufficient – and some classes of operations may take hundreds or thousands (!) of milliseconds to complete. If one makes the wrong choice as the resolution of aggregation, one either ends up clogging the system with unnecessarily fine-grained data, or discarding valuable information in overly coarse-grained data. One might think that one could infer the interesting latency range from the nature of the operation, but this is unfortunately not the case: I/O operations (a typically-millisecond resolution operation) can take but microseconds, and CPU scheduling operations (a typically-microsecond resolution operation) can take hundreds of milliseconds. Indeed, the problem is that one often does not know what order of magnitude of resolution is interesting for a particular operation until one has a feel for the data for the specific operation — one needs to instrument the system to best know how to instrument the system!

We had similar problems at Fishworks, of course, and my solution there was wholly dissatisfying: when instrumenting the latency of operations, I put in place not one but many (five) lquantize()‘d aggregations that (linearly) covered disjoint latency ranges at different orders of magnitude. As we approached this problem again at Joyent, Dave and Robert both balked at the clumsiness of the multiple aggregation solution (with any olfactory reaction presumably heightened by the fact that they were going to be writing the pungent code to deal with this). As we thought about the problem in the abstract, this really seemed like something that DTrace itself should be providing: an aggregation that is neither logarithmic (i.e., quantize()) nor linear (i.e., lquantize()), but rather something in between…

After playing around with some ideas and exploring some dead-ends, we came to an aggregating action that logarithmically aggregates by order of magnitude, but then linearly aggregates within an order of magnitude. The resulting aggregating action — dubbed “log/linear quantization” and denoted with the new llquantize() aggregating action — is paramaterized with a factor, a low magnitude, a high magnitude and the number of linear steps per magnitude. With the caveat that these bits are still hot from the oven, the patch for llquantize() (which should patch cleanly against illumos) is here.

To understand the problem we’re solving, it’s easiest to see llquantize() in action:

# cat > llquant.d
tick-1ms
{
	/*
	 * A log-linear quantization with factor of 10 ranging from magnitude
	 * 0 to magnitude 6 (inclusive) with twenty steps per magnitude
	 */
	@ = llquantize(i++, 10, 0, 6, 20);
}

tick-1ms
/i == 1500/
{
	exit(0);
}
^D
# dtrace -V
dtrace: Sun D 1.7
# dtrace -qs ./llquant.d

           value  ------------- Distribution ------------- count
             < 1 |                                         1
               1 |                                         1
               2 |                                         1
               3 |                                         1
               4 |                                         1
               5 |                                         1
               6 |                                         1
               7 |                                         1
               8 |                                         1
               9 |                                         1
              10 |                                         5
              15 |                                         5
              20 |                                         5
              25 |                                         5
              30 |                                         5
              35 |                                         5
              40 |                                         5
              45 |                                         5
              50 |                                         5
              55 |                                         5
              60 |                                         5
              65 |                                         5
              70 |                                         5
              75 |                                         5
              80 |                                         5
              85 |                                         5
              90 |                                         5
              95 |                                         5
             100 |@                                        50
             150 |@                                        50
             200 |@                                        50
             250 |@                                        50
             300 |@                                        50
             350 |@                                        50
             400 |@                                        50
             450 |@                                        50
             500 |@                                        50
             550 |@                                        50
             600 |@                                        50
             650 |@                                        50
             700 |@                                        50
             750 |@                                        50
             800 |@                                        50
             850 |@                                        50
             900 |@                                        50
             950 |@                                        50
            1000 |@@@@@@@@@@@@@                            500
            1500 |                                         0

As you can see in the example, at each order of magnitude (determined by the factor — 10, in this case), we have 20 steps (or the factor at orders of magnitude less than the number of steps): up to 10, the buckets each span 1 value; from 10 through 100, each spans 5 values; from 100 to 1000, each spans 50; from 1000 to 10000 each spans 500, and so on. The result is not necessarily easy on the eye (and indeed, it seems unlikely that llquantize() will find much use on the command line), but is easy on the system: it generates data that has the resolution of linearly quantized data with the efficiency of logarithmically quantized data.

Of course, you needn’t appreciate the root system here to savor the fruit: if you are a Joyent customer, you will soon be benefitting from llquantize() without ever having to wade into the details, as we bring real-time analytics to the cloud. Stay tuned!

Posted on February 8, 2011 at 9:48 pm by bmc · Permalink · 3 Comments
In: Uncategorized

The DIRT on JSConf.eu and Surge

I just got back from an exciting (if exhausting) conference double-header: JSConf.eu and Surge. These conferences made for an interesting pairing, together encompassing much of the enthusiasm and challenges in today’s systems. It started in Berlin with JSConf.eu, a conference that reflects both JavaScript’s energy and diversity. Ryan introduced Node.js to the world at JSConf.eu a year ago, and enthusiasm for Node at the conference was running high; I would estimate that at least half of the attendees were implementing server-side JavaScript in one fashion or another. Presentations from the server-side were epitomized by Felix and Philip, each of whom presented real things done with Node — expanding on the successes, limitations and futures of what they had developed. The surge of interest in the server-side is of course changing the complexion of not only JSconf but of JavaScript itself. a theme that Chris Williams hit with fervor in his closing plenary, in which he cautioned of the dangers for a technology and a community upon whom a wave of popularity is breaking. Having been been near the epicenter of Java’s irrational exuberance in the mid-1990s (UltraJava, anyone?), I certainly share some of Chris’s concerns — and I think that Ryan’s talk on the challenges in Node reflects a humility that will become essential in the coming year. That said, I am also looking forward to seeing the JavaScript and Node communities expand to include more of the systems vets that have differentiated Node to date.

Speaking of systems vets, a highlight of JSConf.eu was hanging out with Erik Corry from the V8 team. JavaScript and Node owe a tremendous debt of gratitude to V8, without whom we would have neither the high-performance VM itself, nor the broader attention to JavaScript performance so essential to the success of JavaScript on the server-side. Erik and I brainstormed ideas for deeper support of DTrace within V8, which is, of course, an excruciatingly hard problem: as I have observed in the past, magic does not layer well — and DTrace and a sophisticated VM like V8 are both conjuring deep magic that frequently operate at cross purposes. Still, it was an exciting series of conversations, and we at Joyent certainly look forward to taking a swing at this problem.

After JSConf.eu wrapped up, I spent a few days brainstorming and exploring Berlin (and its Trabants and Veyrons) with Mark, Pedro and Filip. I then headed to the inaugural Surge Conference in Baltimore, and given my experience there, I would put the call out to the systems tribe: Surge promises to be the premier conference for systems practitioners. We systems practitioners have suffered acutely at the hands of academic computer science’s foolish obsession with conferences as publishing vehicle, having seen conference after conference of ours becoming casualties on their bloody path to tenure. Surge, by contrast, was everything a systems conference should be: exciting talks on real systems by those who designed, built and debugged them — and a terrific hallway track besides.

To wit: I was joined at Surge by Joyent VP of Operations Ryan Nelson; for us, Surge more than paid for itself during Rod Cope‘s presentation on his ten lessons for Hadoop deployment. Rod’s lessons were of the ilk that only come the hard way; his was a presentation fired in a kiln of unspeakable pain. For example, he noted that his cluster had been plagued by a wide variety of seemingly random low-level problems: lost interrupts, networking errors, I/O errors, etc. Acting apparently operating on a tip in a Broadcom forum, Rod disabled C-states on all of his Dell R710s and 2970s — and the problems all lifted (and have been absent for four months). Given that we have recently lived through same (and came to the same conclusion, albeit more recently and therefore much more tentatively), I tweeted what Rod had just reported, knowing that Ryan was in John Allspaw‘s concurrent sesssion and would likely see the tweet. Sure enough, I saw him retweet it minutes later — and he practically ran in after the end of Rod’s talk to see if I was kidding. (When it comes to firmare- and BIOS-level issues, I assure you: I never kid.) Not only was Rod’s confirmation of our experience tremendously valuable, it goes to the kind of technologist at Surge: this is a conference with practitioners who are comfortable not only at lofty architectural heights, but also at the grittiest of systems’ depths.

Beyond the many thought-provoking sessions, I had great conversations with many people, including Paul, Rod, Justin, Chris, Geir, Stephen, Wez, Mark, Joe, Kate, Tom, Theo (natch) — and a reunion (if sodden) with surprise stowaway Ted Leung.

As for my formal role at Surge, I had the honor of giving one of the opening keynotes. In preparing for it, I tried to put it all together: what larger trends are responsible for the tremendous interest in Node? What are the implications of these trends for other elements of the systems we build? From Node Knockout, it is clear that a major use case for Node is web-based real-time applications. At the moment these are primarily games — but as practitioners like Matt Ranney can attest, games are by no means the only real-time domain in which Node is finding traction. In part based on my experience developing the backend for the Node knockout leaderboard, the argument that I made in the keynote is that these web-based real-time applications have a kind of orthogonal data consistency: instead of being on the ACID/BASE axis, the “consistency” of the data is its temporality. (That is, if the data is late, it’s effectively inconsistent.) This is not necessarily new (market-facing trading systems have long had these properties), but I believe we’re about to see these systems brought to a much larger audience — and developed by a broader swath of practitioners. And of course — like AJAX before it — this trend was begging for an acronym. Struggling to come up with something, I went back to what I was trying to describe: these are real-time systems — but they are notable because they are data-intensive. A ha: “data-intensive real-time” — DIRT! I dig DIRT! I don’t know if the acronym will stick or not, but I did note that it was mentioned in each of Theo‘s and Justin‘s talks — and this was just minutes after it was coined. (Video of this keynote will be available soon; slides are here.)

Beyond the keynote, I also gave a talk on the perils of building enterprise systems from commodity components. This blog entry has gone on long enough, so I won’t go into depth on this topic here, saying only this: fear the octopus!

Thanks to the organizers of both JSconf and Surge for two great conferences. I arrive home thoroughly exhausted (JSconf parties longer, but Surge parties harder; pick your poison), but also exhilarated and excited about the future. Thanks again — and see you next year!

Posted on October 2, 2010 at 3:33 pm by bmc · Permalink · 3 Comments
In: Uncategorized

A physician’s son

My father is an emergency medical physician, a fact that has had a subtle but discernable influence on my career as a software engineer. Emergency medicine and software engineering are of course very different problems, and even though there are times when a major and cascading software systems failure can make a datacenter feel like an urban emergency room on a busy Saturday night, the reality is that if we software engineers screw up, people don’t generally die. (And while code and machine can both be metaphorically uncooperative, we don’t really have occasion to make a paramedic sandwich.) Not only are the consequences less dire, but the underlying systems themselves are at opposite ends of a kind of spectrum: one is deterministic yet brittle, the other sloppy yet robust. Despite these differences, I believe that medicine has much to teach us in software engineering; two specific (if small) artifacts of medicine that I have incorporated into my career:

M&M and Journal Club are two ways that medicine has influenced software engineering (or mine, anyway); what of the influences of information systems on the way medicine is practiced? To explore this, we at ACM Queue sought to put together an issue on computing in healthcare. This is a brutally tough subject for us to tackle — what in the confluence of healthcare and computing is interesting for the software practitioner? — but we felt it to be a national priority and too important to ignore. It should be said that I was originally introduced to computing through medicine: in addition to being an attending physician, my father — who started writing code as an undergraduate in the 1960s on an IBM System/360 Model 50 — also developed the software system that ran his emergency department; the first computer we owned (a very early IBM PC-XT) was bought to allow him to develop software at home (Microsoft Pascal 1.0 FTW!). I had intended to keep my personal history out of the ACM discussion, but at some point, someone on the Board said that what we really needed was the physician’s perspective — someone who could speak to the pracitical difficulties of integrating modern information systems into our healthcare system. At that, I couldn’t hold my tongue — I obviously knew of someone who could write such an article…

Dad graciously agreed, and the resulting article, Computers in Patient Care: the Promise and the Challenge appears as the cover story on this month’s Communications of the ACM. I hope you enjoy this article as much as I did — and who knows: if you are developing software that aspires to be used in healthcare, perhaps it could be the subject of your own Journal Club. (Just be sure to slip the kids a soda or two!)

Posted on September 24, 2010 at 2:23 am by bmc · Permalink · One Comment
In: Uncategorized

DTrace, node.js and the Robinson Projection

When I joined Joyent, I mentioned that I was seeking to apply DTrace to the cloud, and that I was particularly excited about the development of node.js — leaving it implict that the intersection of the two technologies would be naturally interesting, As it turns out, we have had an early opportunity to show the potential here: as you might have seen, the Node Knockout programming contest was held over the weekend; when I first joined Joyent (but four weeks ago!), Ryan was very interested in potentially using DTrace to provide a leaderboard for the competition. I got to work, adding USDT probes to node.js. To be fair, this still has some disabled overhead (namely, getting into and out of the node addon that has the true USDT probe), but it’s sufficiently modest to deploy DTrace-enabled node’s in production.

And thanks to incredibly strong work by Joyent engineers, we were able to make available a new node.js service that allocated a container per user. This service allowed us to make available a DTrace-enabled node to contestants — and then observe all of that from the global zone.

For example of the DTrace provider for node.js, here’s a simple enabling to print out HTTP requests as zones handle them (running on one of the Node Knockout machines):

# dtrace -n 'node*:::http-server-request{printf("%s: %s of %s\n", \
    zonename, args[0]->method, args[0]->url)}' -q
nodelay: GET of /poll6759.479651377309
nodelay: GET of /poll6148.392275444794
nodebodies: GET of /latest/
nodebodies: GET of /latest/
nodebodies: GET of /count/
nodebodies: GET of /count/
nodelay: GET of /poll8973.863890386003
nodelay: GET of /poll2097.9667574643568
awesometown: GET of /graphs/4c7a650eba12e9c41d000005.js
awesometown: POST of /graphs/4c7a650eba12e9c41d000005/appendValue
awesometown: GET of /graphs/4c7acd5ca121636840000002.js
awesometown: GET of /graphs/4c7a650eba12e9c41d000005.js
awesometown: GET of /graphs/4c7a650eba12e9c41d000005.js
awesometown: GET of /graphs/4c7a650eba12e9c41d000005.js
awesometown: GET of /graphs/4c7b2408546a64b81f000001.js
awesometown: POST of /faye
awesometown: POST of /faye
...

I added probes around both HTTP request and HTTP response; treating the file descriptor as a token that describes that uniquely describes that request while it is pending (an assumption that would only be invalid in the presence of HTTP pipelining), allows one to actually determine the latency for requests:

# cat http.d
#pragma D option quiet

http-server-request
{
        ts[this->fd = args[1]->fd] = timestamp;
        vts[this->fd] = vtimestamp;
}

http-server-response
/this->ts = ts[this->fd = args[0]->fd]/
{
        @t[zonename] = quantize(timestamp - this->ts);
        @v[zonename] = quantize(vtimestamp - vts[this->fd]);
        ts[this->fd] = 0;
        vts[this->fd] = 0;
}

tick-1sec
{
        printf("Wall time:\n");
        printa(@t);

        printf("CPU time:\n");
        printa(@v);
}

This script makes the distinction between wall time and CPU time; for wall-time, you can see the effect of long-polling, e.g. (the values are nanoseconds):

    nodelay
           value  ------------- Distribution ------------- count
           32768 |                                         0
           65536 |                                         4
          131072 |@@@@@                                    52
          262144 |@@@@@@@@@@@@@@@@@@                       183
          524288 |@@@@@                                    55
         1048576 |@@@                                      27
         2097152 |@                                        9
         4194304 |                                         5
         8388608 |@                                        8
        16777216 |@                                        6
        33554432 |@                                        9
        67108864 |@                                        7
       134217728 |@                                        12
       268435456 |@                                        11
       536870912 |                                         1
      1073741824 |                                         4
      2147483648 |                                         1
      4294967296 |                                         5
      8589934592 |                                         0
     17179869184 |                                         1
     34359738368 |                                         1
     68719476736 |                                         0

You can also look at the CPU time to see those that are doing more actual work. For example, one zone with interesting CPU time outliiers:

  nodebodies
           value  ------------- Distribution ------------- count
         4194304 |                                         0
         8388608 |@@@@@@@@@@@@                             57
        16777216 |@@@@                                     21
        33554432 |@@@@                                     18
        67108864 |@@@@@@@                                  34
       134217728 |@@@@@@@@@@@                              54
       268435456 |                                         0
       536870912 |                                         0
      1073741824 |                                         0
      2147483648 |                                         0
      4294967296 |@                                        3
      8589934592 |@                                        4
     17179869184 |                                         0

Note that because node has a single thread do all processing, we cannot assume that the requests themselves are inducing the work — only that CPU work was done between request and response. Still, this data would probably be interesting to the nodebodies team…

I also added probes around connection establishment; so here’s a simple way of looking at new connections by zone:

# dtrace -n 'node*:::net-server-connection{@[zonename] = count()}'
dtrace: description 'node*:::net-server-connection' matched 44 probes
^C

  explorer-sox                                                      1
  nodebodies                                                        1
  anansi                                                           69
  nodelay                                                         102
  awesometown                                                     146

Or if we wanted to see which IP addresses were connecting to, say, our good friends at awesometown (with actual addresses
in the output elided):

# dtrace -n 'node*:::net-server-connection \
    /zonename == "awesometown"/{@[args[0]->remoteAddress] = count()}'
dtrace: description 'node*:::net-server-connection' matched 44 probes
  XXX.XXX.XXX.XXX                                                   1
  XX.XXX.XX.XXX                                                     1
  XX.XXX.XXX.XXX                                                    1
  XX.XXX.XXX.XX                                                     1
  XXX.XXX.XX.XXX                                                    1
  XXX.XXX.XX.XX                                                     2
  XXX.XXX.XXX.XX                                                    8

Ryan saw the DTrace support I had added, and had a great idea: what if we took the IPs of incoming connections and geolocated them, throwing them on a world map and coloring them by team name? This was an idea that was just too exciting not to take a swing at, so we got to work. For the backend, the machinery was begging to itself be written in node, so I did a libdtrace addon for node and started building a scalable backend for processing the DTrace data from the different Node Knockout machines. Meanwhile, Joni came up with some mockups that had everyone drooling, and Mark contacted Brian from Nitobi about working on the front-end. Brian and crew were as excited about it as we were, and they put front-end engineer extraordinaire Yohei on the case — who worked with Rob on the Joyent side to pull it all together. Among Rob’s other feats, he managed to implement in JavaScript the logic for plotting longitude and latitude in the beautiful Robinson projection — which is a brutally complicated transformation. It was an incredible team, and we were pulling it off in such a short period of time and with such a firm deadline that we often felt like contestants ourselves!

The result — which it must be said works best in Safari and Chrome — is at http://leaderboard.no.de. In keeping with both the spirit of node and DTrace, the leaderboard is updated in real-time; from the time you connect to one of the Joyent-hostest (no.de) contestants, you should see yourself show up in the map in no more than 700 milliseconds (plus your network latwork latency). For crowded areas like the Bay Area, it can be hard to see yourself — but try moving to Cameroon for best results. It’s fun to watch as certain contestants go viral (try both hovering over a particular data point and clicking on the team name in the leaderboard) — and you can know which continent you’re cursing at in http://saber-tooth-moose-lion.no.de (now known to the world as Swarmation).

Enjoy both the leaderboard and the terrific Node Knockout entries (be sure to vote for your favorites!) — and know that we’ve only scratched the surface of what DTrace and node.js can do together!

Posted on August 30, 2010 at 2:55 am by bmc · Permalink · 4 Comments
In: Uncategorized

The liberation of OpenSolaris

As many have seen, Oracle has elected to stop contributing to OpenSolaris. This decision is, to put it bluntly, stupid. Indeed, I would (and did) liken it to L. Paul Bremer‘s decision to disband the Iraqi military after the fall of Saddam Hussein: beyond merely a foolish decision borne out of a distorted worldview, it has created combatants unnecessarily. As with Bremer’s infamous decision, the bitter irony is that the new combatants were formerly the strongest potential allies — and in Oracle’s case, it is the community itself.

As it apparently needs to be said, one cannot close an open source project — one can only fork it. So contrary to some reports, Oracle has not decided to close OpenSolaris, they have actually decided to fork it. That is, they have (apparently) decided that it is more in their interest to compete with the community that to cooperate with it — that they can in fact out-innovate the community. This confidence is surprising (and ironic) given that it comes exactly at the moment that the historic monopoly on Solaris talent has been indisputably and irrevocably broken — as most recently demonstrated by the departure of my former colleague, Adam Leventhal.

Adam’s case is instructive: Adam is a brilliantly creative engineer — one with whom it was my pleasure to work closely over nearly a decade. Time and time again, I saw Adam not only come up with innovative solutions to tough problems, but run those innovations through the punishing gauntlet that separates idea from product. One does not replace an engineer like Adam; one can only hope to grow another. And given his nine years of experience at the company and in the guts of the system, one cannot expect to grow a replacement quickly — if at all. Oracle’s loss, however, is the community’s gain; I hope I’m not tipping his hand too much to say that Adam will continue to be deeply engaged in the system, leading a new generation of engineers — but this time within a larger community that spans multiple companies and interests.

And in this way, odd as it may be, Oracle’s decision to fork is actually a relief to those of us whose businesses depend on OpenSolaris: instead of waiting for Oracle to engage the community, we can be secure in the knowledge that no engagement is forthcoming — and we can invest and plan accordingly. So instead of waiting for Oracle to fix a nagging driver bug or address a critical request for enhancement (a wait that has more often than not ended in disappointment anyway), we can tap our collective expertise as a community. And where that expertise doesn’t exist or is otherwise unavailable, those of us who are invested in the system can explicitly invest in building it — and then use it to give back to the community and contribute.

Speaking for Joyent, all of this has been tangibly liberating: just the knowledge that we are going to be cranking our own builds has allowed us to start thinking along new dimensions of innovation, giving us a renewed sense of control over our stack and our fate. I have already seen this shift in our engineers, who have begun to conceive of ideas that might not have been thought practical in a world in which Oracle’s engagement was so uncertain. Yes, hard problems lie ahead — but ideas are flowing, and the future feels alive with possibility; in short, innovation is afoot!

Posted on August 19, 2010 at 9:49 pm by bmc · Permalink · 12 Comments
In: Uncategorized

The node.js demographic

I went to the node.js meetup last night in Palo Alto, and it was an interesting affair on several levels. First (and least surprisingly), it was packed, with the Sencha folks joking that they would need to move to a bigger space just to be able to host the event. Second, the technical content itself was intruiging, with fellow Joyeur (and node BDFL) Ryan on dealing with flow control in node, Jed on (fab), future fellow Joyeur Issac on npm; and Tim demo’ing some Connect-based apps, including a simple web-based shared world app in which the room could (and did) participate. Not surprisingly, the performance of this last demo was snappy under load — so much so that it merits repeating an observation that many are currently making: it is increasingly clear that an early space — if not the first — in which we are going to see broad deployment of node-based apps is online social gaming, a space in which node represents a decisive competitive advantage by offering the potential for much more interactive (and more social) gameplay, and one in which there is substantial code churn to begin with. (And of course, speaking from Joyent’s perspective, this is a fortunate confluence: online gaming is also a space that sorely needs the elasticity that the cloud alone can provide.)

So the attendance and content were certainly notable, but most interesting of all to me was the demographic: given that node has become something of the latest hotness (and especially given that it being in JavaScript gives it a pretty wide net), one might expect node’s enthusiasts to be amateurs or novices. That this was emphatically not the case was clear to me shortly after arriving, when I had the unexpected pleasure of reuniting with fellow CS169 head TA Peter Griess. Not to be overly chummy or clubby, but walking into a meetup and seeing one of the tribe tells you something immediate about not just the room, but the technology itself: that it is not mere syntactic sugar or iconoclasm for its own sake, but rather a true revolution in the way certain classes of systems are designed and built. And indeed, over the course of the evening, it became clear that within the room there was an impressive amount of actual experience deploying real systems, with seasoned technologists like Matt Ranney who aren’t merely writing new apps in node, they are rewriting old apps in node. This is a key point, and it goes to the fact that node is not just an easier way of doing things (though that too, certainly) but rather that it offers such a vastly improved runtime that it merits reevaluation of systems that one has already built and deployed.

To me, the systems experience in the room offered an implicit rebuttal to some of the inane criticism of node — criticism that essentially amounts to discrediting node merely because of its newness or its popularity. (And even more enlightened criticism ultimately disappoints with what essentially amounts to an attack on the basis of style, not substance.) To be sure, node is still a young technology, and there is much engineering work still to be done. (For a concrete example of this, see Paul‘s description of the SSL problem.) But with so much deep systems experience in the community — and with the healthy, collaborative vibe that was on display last night — it’s hard to be anything but optimistic!

Posted on August 11, 2010 at 10:16 am by bmc · Permalink · 3 Comments
In: Uncategorized