Debugging node.js memory leaks

Part of the value of dynamic and interpreted environments is that they handle the complexities of dynamic memory allocation. In particular, one needn’t explicitly free memory that is no longer in use: objects that are no longer referenceable are found automatically and destroyed via garbage collection. While garbage collection simplifies the program’s relationship with memory, it not mean the end of all memory-based pathologies: if an application retains a reference to an object that is ultimately rooted in a global scope, that object won’t be considered garbage and the memory associated with it will not be returned to the system. If enough such objects build up, allocation will ultimately fail (memory is, after all, finite) and the program will (usually) fail along with it. While this is not strictly — from the native code perspective, anyway — a memory leak (the application has not leaked memory so much as neglected to unreference a semantically unused object), the effect is nonetheless the same and the same nomenclature is used.

While all garbage collected environments create the potential to create such leaks, it can be particularly easy in JavaScript: closures create implicit references to variables within their scopes — references that might not be immediately obvious to the programmer. And node.js adds a new dimension of peril with its strictly asynchronous interface with the system: if backpressure from slow upstream services (I/O, networking, database services, etc.) isn’t carefully passed to downstream consumers, memory will begin to fill with the intermediate state. (That is, what one gains in concurrency of operations one may pay for in memory.) And of course, node.js is on the server — where the long-running nature of services means that the effect of a memory leak is much more likely to be felt and to affect service. Take all of these together, and you can easily see why virtually anyone who has stood up node.js in production will identify memory leaks as their most significant open issue.

The state of the art for node.js memory leak detection — as concisely described by Felix in his node memory leak tutorial — is to use the v8-profiler and Danny Coates’ node-inspector. While this method is a lot better than nothing, it’s very much oriented to the developer in development. This is a debilitating shortcoming because memory leaks are often observed only after days (hours, if you’re unlucky) of running in production. At Joyent, we have long wanted to tackle the problem of providing the necessary tooling to identify node.js memory leaks in production. When we are so lucky as to have these in native code (that is, node.js add-ons), we can use libumem and ::findleaks — a technology that we have used to find thousands of production memory leaks over the years. But this is (unfortunately) not the common case: the common case is a leak in JavaScript — be it node.js or application code — for which our existing production techniques are useless.

So for node.js leaks in production, one has been left with essentially a Gedanken experiment: consider your load and peruse source code looking for a leak. Given how diabolical these leaks can be, it’s amazing to me that anyone ever finds these leaks. (And it’s less surprising that they take an excruciating amount of effort over a long period of time.) That’s been the (sad) state of things for quite some time, but recently, this issue boiled over for us: Voxer, a Joyent customer and long in the vanguard of node.js development, was running into a nasty memory leak: a leak sufficiently acute that the service in question was having to be restarted on a regular basis, but not so much that it was reproducible in development. With the urgency high on their side, we again kicked around this seemingly hopeless problem, focussing on the concrete data that we had: core dumps exhibiting the high memory usage, obtained at our request via gcore(1) before service restart. Could we do anything with those?

As an aside, a few months ago, Joyent’s Dave Pacheco added some unbelievable postmortem debugging support for node. (If postmortem debugging is new to you, you might be interested in checking out my presentation on debugging production systems from QCon San Francisco. I had the privilege of demonstrating Dave’s work in that presentation — and if you listen very closely at 43:26, you can hear the audience gasp when I demonstrate the amazing ::jsprint.) But Dave hasn’t built the infrastructure for walking the data structures of the V8 heap from an MDB dmod — and it was clear that doing so would be brittle and error-prone.

As Dave, Brendan and I were kicking around increasingly desperate ideas (including running strings(1) on the dump — an idea not totally without merit — and some wild visualization ideas), a much simpler idea collectively occurred to us: given that we understood via Dave’s MDB V8 support how a given object is laid out, and given that an object needed quite a bit of self-consistency with referenced but otherwise orthogonal structures, what about just iterating over all of the anonymous memory in the core and looking for objects? That is, iterate over every single address, and see if that address could be interpreted as an object. On the one hand, it was terribly brute force — but given the level of consistency required of an object in V8, it seemed that it wouldn’t pick up too many false positives. The more we discussed it, the more plausible it became, but with Dave fully occupied (on another saucy project we have cooking at Joyent — more on that later), I roped up and headed into his MDB support for V8…

The result — ::findjsobjects — takes only a few minutes to run on large dumps, but provides some important new light on the problem. The output of ::findjsobjects consists of representative objects, the number of instances of that object and the number of properties on the object — followed by the constructor and first few properties of the objects. For example, here is the dump on a gcore of a (non-pathological) Joyent node-based facility:

> ::findjsobjects -v
findjsobjects:         elapsed time (seconds) => 20
findjsobjects:                   heap objects => 6079488
findjsobjects:             JavaScript objects => 4097
findjsobjects:              processed objects => 1734
findjsobjects:                 unique objects => 161
OBJECT   #OBJECTS #PROPS CONSTRUCTOR: PROPS
fc4671fd        1      1 Object: flags
fe68f981        1      1 Object: showVersion
fe8f64d9        1      1 Object: EventEmitter
fe690429        1      1 Object: Credentials
fc465fa1        1      1 Object: lib
fc46300d        1      1 Object: free
fc4efbb9        1      1 Object: close
fc46c2f9        1      1 Object: push
fc46bb21        1      1 Object: uncaughtException
fe8ea871        1      1 Object: _idleTimeout
fe8f3ed1        1      1 Object: _makeLong
fc4e7c95        1      1 Object: types
fc46bae9        1      1 Object: allowHalfOpen
...
fc45e249       12      4 Object: type, min, max, default
fc4f2889       12      4 Object: index, fields, methods, name
fd2b8ded       13      4 Object: enumerable, writable, configurable, value
fc0f68a5       14      1 SlowBuffer: length
fe7bac79       18      3 Object: path, fn, keys
fc0e9d21       20      5 Object: _onTimeout, _idleTimeout, _idlePrev, ...
fc45facd       21      4 NativeModule: loaded, id, exports, filename
fc45f571       23      8 Module: paths, loaded, id, parent, exports, ...
fc4607f9       35      1 Object: constructor
fc0f86c9       56      3 Buffer: length, offset, parent
fc0fc92d       57      2 Arguments: length, callee
fe696f59       91      3 Object: index, fields, name
fc4f3785       91      4 Object: fields, name, methodIndex, classIndex
fe697289      246      2 Object: domain, name
fc0f87d9      308      1 Buffer:

Now, any one of those objects can be printed with ::jsprint. For example, let’s take fc45e249 from the above output:

> fc45e249::jsprint
{
    type: number,
    min: 10,
    max: 1000,
    default: 300,
}

Note that that’s only a representative object — there are (in the above case) 12 objects that have that same property signature. ::findjsobjects can get you all of them when you specify the address of the reference object:

> fc45e249::findjsobjects
fc45e249
fc46fd31
fc467ae5
fc45ecb5
fc45ec99
fc45ec11
fc45ebb5
fc45eb41
fc45eb25
fc45e3d1
fc45e3b5
fc45e399

And because MDB is the debugger Unix was meant to have, that output can be piped to ::jsprint:

> fc45e249::findjsobjects | ::jsprint
{
    type: number,
    min: 10,
    max: 1000,
    default: 300,
}
{
    type: number,
    min: 0,
    max: 5000,
    default: 5000,
}
{
    type: number,
    min: 0,
    max: 5000,
    default: undefined,
}
...

Okay, fine — but where are these objects referenced? ::findjsobjects has an option for that:

> fc45e249::findjsobjects -r
fc45e249 referred to by fc45e1e9.height

This tells us (or tries to) who is referencing that first (representative) object. Printing that out (with the “-a” option to show the addresses of the objects):

> fc45e1e9::jsprint -a
fc45e1e9: {
    ymax: fe78e061: undefined,
    hues: fe78e061: undefined,
    height: fc45e249: {
        type: fe78e361: number,
        min: 14: 10,
        max: 7d0: 1000,
        default: 258: 300,
    },
    selected: fc45e3fd: {
        type: fe7a2465: array,
        default: fc45e439: [...],
    },
    ...

So if we want to learn where all of these objects are referenced, we can again use a pipeline within MDB:

> fc45e249::findjsobjects | ::findjsobjects -r
fc45e249 referred to by fc45e1e9.height
fc46fd31 referred to by fc46b159.timeout
fc467ae5 is not referred to by a known object.
fc45ecb5 referred to by fc45eadd.ymax
fc45ec99 is not referred to by a known object.
fc45ec11 referred to by fc45eadd.nbuckets
fc45ebb5 referred to by fc45eadd.height
fc45eb41 referred to by fc45eadd.ymin
fc45eb25 referred to by fc45eadd.width
fc45e3d1 referred to by fc45e1e9.nbuckets
fc45e3b5 referred to by fc45e1e9.ymin
fc45e399 referred to by fc45e1e9.width

Of course, the proof of a debugger is in the debugging; would ::findjsobjects actually be of use on the Voxer dumps that served as its motivation? Here is the (elided) output from running it on a big Voxer dump:

> ::findjsobjects -v
findjsobjects:         elapsed time (seconds) => 292
findjsobjects:                   heap objects => 8624128
findjsobjects:             JavaScript objects => 112501
findjsobjects:              processed objects => 100424
findjsobjects:                 unique objects => 241
OBJECT   #OBJECTS #PROPS CONSTRUCTOR: PROPS
fe806139        1      1 Object: Queue
fc424131        1      1 Object: Credentials
fc424091        1      1 Object: version
fc4e3281        1      1 Object: message
fc404f6d        1      1 Object: uncaughtException
...
fafcb229     1007     23 ClientRequest: outputEncodings, _headerSent, ...
fafc5e75     1034      5 Timing: req_start, res_end, res_bytes, req_end, ...
fafcbecd     1037      3 Object: aborted, data, end
 8045475     1060      1 Object:
fb0cee9d     1220      9 HTTPParser: socket, incoming, onHeadersComplete, ...
fafc58d5     1271     25 Socket: _connectQueue, bytesRead, _httpMessage, ...
fafc4335     1311     16 ServerResponse: outputEncodings, statusCode, ...
fafc4245     1673      1 Object: slab
fafc44d5     1702      5 Object: search, query, path, href, pathname
fafc440d     1784     14 Client: buffered_writes, name, timing, ...
fafc41c5     1796      3 Object: error, timeout, close
fafc4469     1811      3 Object: address, family, port
fafc42a1     2197      2 Object: connection, host
fbe10625     2389      2 Arguments: callee, length
fafc4251     2759     15 IncomingMessage: statusCode, httpVersionMajor, ...
fafc42ad     3652      0 Object:
fafc6785    11746      1 Object: oncomplete
fb7abc29    15155      1 Object: buffer
fb7a6081    15155      3 Object: , oncomplete, cb
fb121439    15378      3 Buffer: offset, parent, length

This immediately confirmed a hunch that Matt had had that this was a buffer leak. And for Isaac — who had been working this issue from the Gedanken side and was already zeroing in on certain subsystems — this data was surprising in as much as it was so confirming: he was already on the right path. In short order, he nailed it, and the fix is in node 0.6.17.

The fix was low risk, so Voxer redeployed with it immediately — and for the first time in quite some time, memory utilization was flat. This was a huge win — and was the reason for Matt’s tantalizing tweet. The advantages of this approach are that it requires absolutely no modification to one’s node.js programs — no special flags and no different options. And it operates purely postmortem. Thanks to help from gcore(1), core dumps can be taken over time for a single process, and those dumps can then be analyzed off-line.

Even with ::findjsobjects, debugging node.js memory leaks is still tough. And there are certainly lots of improvements to be made here — there are currently some objects that we do not know how to correctly interpret, and we know that we know that we can improve our algorithm for finding object references — but this shines a bright new light into what had previously been a black hole!

If you want to play around with this, you’ll need SmartOS or your favorite illumos distro (which, it must be said, you can get simply by provisioning on the Joyent cloud). You’ll need an updated v8.so — which you can either build yourself from illumos-joyent or you can download a binary. From there, follow Dave’s instructions. Don’t hesitate to ping me if you get stuck or have requests for enhancement — and here’s to flat memory usage on your production node.js service!

Posted on May 5, 2012 at 1:07 pm by bmc · Permalink · Comments Closed
In: Uncategorized

Standing up SmartDataCenter

Over the past year, we at Joyent have been weaving some threads that may have at times seemed disconnected: last fall, we developed a prototype of DTrace-based cloud instrumentation as part of hosting the first Node Knockout competition; in the spring, we launched our no.de service with full-fledged real-time cloud analytics; and several weeks ago, we announced the public availability of SmartOS, the first operating system to combine ZFS, DTrace, OS-based virtualization and hardware virtualization in the same operating system kernel. These activities may have seemed disjoint, but there was in fact a unifying theme: our expansion at Joyent from a public cloud operator to a software company — from a company that operates cloud software to one that also makes software for cloud operators.

This transition — from service to product — is an arduous one, for the simple reason that a service can make up for any shortcomings with operational heroics that are denied to a product. Some of these gaps may appear superficial (e.g., documentation), but many are deep and architectural; that one is unable to throw meat at a system changes many fundamental engineering decisions. (It should be said that some products don’t internalize these differences and manage to survive despite their labor-intensive shortcomings, but this is thanks to the caulking called “professional services” — and is usually greased by an enterprise software sales rep with an easy rapport and a hell of a golf handicap.) As the constraints of being a product affect engineering decisions, so do the constraints of operating as a service; if a product has no understanding or empathy for how it will be deployed into production, it will be at best difficult to operate, and at worst unacceptable for operation.

So at Joyent, we have long believed that we need both a product that can be used to provide a kick-ass service and a kick-ass service to prove to ourselves that we hit the mark. While our work over the past year has been to develop that product — SmartDataCenter — we have all known that we wouldn’t be done until that product was itself operating not just emerging services like no.de, but also the larger Joyent public cloud, and over the summer we began to plan a broad-based stand-up of SmartDataCenter. With our SmartOS-based KVM solid and on schedule, and with the substantial enabling work up the stack falling into place, we set for ourselves an incredibly ambitious target: to not only ship this revolutionary new version of the software, but to also — at the same time — stand up that product in production in the Joyent public cloud. Any engineer’s natural conservatism would resist such a plan — conflating shipping a product with a production deployment feels chancey at best and foolish at worst — but we saw an opportunity to summon a singular, company-wide focus that would force us to deliver a much better product, and would give us and our software customers the satisfaction of knowing that SmartDataCenter as shipped runs Joyent. More viscerally, we were simply too hot on getting SmartOS-based hardware virtualization in production — and oh, those DTrace-based KVM analytics! — to contemplate any path less aggressive.

Of course, it wasn’t easy; there were many intense late-night Jabber sessions as we chased bugs and implemented solutions. But as time went on — and as we cut release candidates and began to pound on the product in staging envirionments — it became clear that the product was converging, and a giddy optimism began to spread. Especially for our operations team, many of whom are long-time Joyent vets, this was a special kind of homecoming, as they saw their vision and wisdom embodied in the design decisions of the product. This sentiment was epitomized by one of the senior members of the Joyent public cloud operations team — an engineer who has tremendous respect in the company, and who is known for his candor — as he wrote:

It seems the past ideas of a better tomorrow have certainly come to light. I think I speak for all of the JPC when I say “Bravo” to the dev team for a solid version of the software. We have many rivers (hopefully small creeks) to cross but I am ultimately confident in what I have seen and deployed so far. It’s incredible to see a unified convergence on what has been the Joyent dream since I have been here.

For us in development, these words carried tremendous weight; we had endeavored to build the product that our team would want to deploy and operate, and it was singularly inspiring to see them so enthusiastic about it! (Though I will confess that this kind of overwhelmingly positive sentiment from such a battle-hardened operations engineer is so rare as to almost make me nervous — and yes, I am knocking on wood as I write this.)

With that, I’m pleased to announce that the (SmartDataCenter-based) Joyent public cloud is open for business! You can provision (via web portal or API) either virtual OS instances (SmartMachines) or full hardware-virtual VMs; they’ll provision like a rocket (ZFS clones FTW!), and you’ll get a high-performing, highly reliable environment with world-beating real-time analytics. And should you wish to stand up your own cloud with the same advantages — either a private cloud within your enterprise or a public cloud to serve your customers — know that you can own and operate your very own SmartDataCenter!

Posted on September 15, 2011 at 2:03 am by bmc · Permalink · Comments Closed
In: Uncategorized

KVM on illumos

A little over a year ago, I came to Joyent because of a shared belief that systems software innovation matters — that there is no level of the software stack that should be considered off-limits to innovation. Specifically, we ardently believe in innovating in the operating system itself — that extending and leveraging technologies like OS virtualization, DTrace and ZFS can allow one to build unique, differentiated value further up the software stack. My arrival at Joyent also coincided with the birth of illumos, an effort that Joyent was thrilled to support, as we saw that rephrasing our SmartOS as an illumos derivative would allow us to engage in a truly open community with whom we could collaborate and collectively innovate.

In terms of the OS technology itself, as powerful as OS virtualization, DTrace and ZFS are, there was clearly a missing piece to the puzzle: hardware virtualization. In particular, with advances in microprocessor technology over the last five years (and specifically with the introduction of technologies like Intel’s VT-x), one could (in principle, anyway) much more easily build highly performing hardware virtualization. The technology was out there: Fabrice Bellard‘s cross-platform QEMU had been extended to utilize the new microprocessor support by Avi Kivity and his team with their Linux kernel virtual machine (KVM).

In the fall of last year, the imperative became clear: we needed to port KVM to SmartOS. This notion is almost an oxymoron, as KVM is not “portable” in any traditional sense: it interfaces between QEMU, Linux and the hardware with (rightful) disregard for other systems. And as KVM makes heavy use of Linux-specific facilities that don’t necessarily have clean analogs in other systems, the only way to fully scope the work was to start it in earnest. However, we knew that just getting it to compile in any meaningful form would take significant effort, and with our team still ramping up and with lots of work to do elsewhere, the reality was that this would have to start as a small project. Fortunately, Joyent kernel internals veteran Max Bruning was up to the task, and late last year, he copied the KVM bits from the stable Linux 2.6.34 source, took out a sharpened machete, and started whacking back jungle…

Through the winter holidays and early spring (and when he wasn’t dispatched on other urgent tasks), Max made steady progress — but it was clear that the problem was even more grueling than we had anticipated: because KVM is so tightly integrated into the Linux kernel, it was difficult to determine dependencies — and hard to know when to pull in the Linux implementation of a particular facility or function versus writing our own or making a call to the illumos equivalent (if any). By the middle of March, we were able to execute three instructions in a virtual machine before taking an unresolved EPT violation. To the uninitiated, this might sound pathetic — but it meant that a tremendous amount of software and hardware were working properly. Shortly thereafter, with the EPT problem fixed, the virtual machine would happily run indefinitely — but clearly not making any progress booting. But with this milestone, Max had gotten to the point where the work could be more realistically parallelized, and we had gotten to the point as a team and a product where we could dedicate more engineers to this work. In short, it was time to add to the KVM effort — and Robert Mustacchi and I roped up and joined Max in the depths of KVM.

With three of us on the project and with the project now more parallelizable, the pace accelerated, and in particular we could focus on tooling: we added an mdb dmod to help us debug, 16-bit disassembler support that was essential for us to crack the case on the looping in real-mode, and DTrace support for reading the VMCS. (And we even learned some surprising things about GNU C along the way!) On April 18th, we finally broke through into daylight: we were able to boot into KMDB running on a SmartOS guest! The pace quickened again, as we now had the ability to interactively ascertain state as seen by the guest and much more quickly debug issues. Over the next month (and after many intensive debugging sessions), we were able to boot Windows, Linux, FreeBSD, Haiku, QNX, Plan 9 and every other OS we could get our hands on; by May 13th, we were ready to send the company-wide “hello world” mail that has been the hallmark of bringup teams since time immemorial.

The wind was now firmly at our backs, but there was still much work to be done: we needed to get configurations working with large numbers of VCPUs, with many VMs, over long periods of time and significant stress and so on. Over the course of the next few months, the bugs became rarer and rarer (if nastier!), and we were able to validate that performance was on par with Linux KVM — to the point that today we feel confident releasing illumos KVM to the greater community!

In terms of starting points, if you just want to take it for a spin, grab a SmartOS live image ISO. If you’re interested in the source, check out the illumos-kvm github repo for the KVM driver itself, the illumos-kvm-cmd github repo for our QEMU 0.14.1 tree, and illumos itself for some of our KVM-specific tooling.

One final point of validation: when we went to integrate these changes into illumos, we wanted to be sure that they built cleanly in a non-SmartOS (that is, stock illumos) environment. However, we couldn’t dedicate a machine for the activity, so we (of course) installed OpenIndiana running as a KVM guest on SmartOS and cranked the build to verify our KVM deltas! As engineers, we all endeavor to build useful things, and with KVM on illumos, we can assert with absolute confidence that we’ve hit that mark!

Posted on August 15, 2011 at 7:00 am by bmc · Permalink · 44 Comments
In: Uncategorized

In defense of intrapreneurialism

Trevor pointed me to a ChubbyBrain article that derided “intrapreneurs” as having nothing in common with their entrepreneurial brethren. While I don’t necessary love the term, what we did at Fishworks is probably most accurately described as “intrapreneurial” — and based on both our experiences there and the caustic tone of the Chubby polemic, I feel a responsibility to respond. (And so, I hasten to add, did Trevor.) In terms of my history: I spent ten years in the bowels of Bigco, four years engaged in an intrapreneurial effort within Bigco, and the past year at an actual, venture-funded startup. Based on my experience, I take strong exception to the Chubby piece; to take their criticisms of intrapreneurs one-by-one:

Of course, there are differences between a startup within a large organization and an actual startup. When I was pitching Fishworks to the executives at Sun, I thought of our activity as both giving up some of the upside and removing some of the downside of a startup. After living through Fishworks and now being at a startup, I would be more precise about the downside that the intrapreneur doesn’t have to endure: it’s not insecurity (or at least, not as much as I believed) — it’s the grueling task of raising money. At Fishworks, we had exactly one investor, they knew us and we knew them — and that was a blessing that gave us the luxury of focus.

On balance, I don’t regret for a moment doing it the way we did it — but I also don’t regret having left for a startup. I don’t doubt that my career will likely oscillate between smaller organizations and larger ones — I see strengths and weaknesses to both, much of which comes down to specific opportunities and conditions. In short: there may be differences between intrapreneurialism and entrepreneurialism — but I see little difference between the intrapreneur and the entrepreneur!

Posted on July 12, 2011 at 2:26 pm by bmc · Permalink · 4 Comments
In: Uncategorized

When magic collides

We had an interesting issue come up the other day, one that ended up being a pretty nasty bug in DTrace. It merits a detailed explanation, but first, a cautionary note: we’re headed into rough country; in the immortal words of Sixth Edition, you are not expected to understand this.

The problem we noticed was this: when running node and instrumenting certain hot functions (memcpy:entry was noted in particular), after some period of activity, the node process would die on a strange seg fault. And looking at the core dumps, we were (usually) lost in some code path in memcpy() with registers or a stack that was self-inconsistent. In debugging this, it was noted (with DTrace, natch) that the node process was under somewhat intense signal processing activity (we’ll get to why in a bit), and that (in particular) the instrumented function activity and the signal activity seemed to be interacting “shortly” before the fatal failure.

Before wading any deeper, it’s worth taking an aside to explain how we deal with signals in DTrace — a discussion that in turn requires some background on the way loss-less user-level tracing works in DTrace. One of the important constraints that we placed upon ourselves (albeit one that we didn’t talk about too much) is that DTrace is loss-less: if multiple threads on multiple CPUs hit the same point of instrumentation, no thread will see an uninstrumented code path. This may seem obvious, but it’s not the way traditional debuggers work: normally, a debugger stops a process (and all of its threads) when it hits a breakpoint — which it induces by modifying program text with a synchronous trap instruction (e.g., an “int 3” in x86). Then, the program text is modified to contain the original instruction, the thread that hit the breakpoint (and only that thread) is single-stepped (often using hardware support) over the original instruction, and the original instruction is again replaced with a breakpoint instruction. We felt this to be unacceptable for DTrace, and imposed a constraint that tracing always be loss-less.

For user-level instrumentation (that is, the pid provider), magic is required: we need to execute the instrumented instruction out-of-line. For an overview of the mechanism for this, see Adam‘s block comment in fasttrap_isa.c. The summary is that we have a per-thread region of memory at user-level to which we copy both the instrumented instruction and a control transfer instruction that gets us back past the point of instrumentation. Then, when we hit the probe (and after we’ve done in-kernel probe processing), we point the instruction pointer to be in the per-thread region and return to user-level; back in user-land, we execute the instruction and transfer control back to the rest of the program.

Okay, so easy, right? Well, not so fast — enter signals: if we take a signal while we’re executing this little user-level trampoline, we could corrupt program state. At first, it may not be obvious how this is the case; if we execute a signal before we execute the instruction in the trampoline we will simply execute the signal handler (whatever it may be) and return to the trampoline, finish that off, and then get back to the business of executing the program. Where’s the problem? The problem is that if the signal handler executes something that itself is instrumented with DTrace: if this is the case, we’ll hit another probe, and that probe will copy its instruction to the same trampoline. This is bad: it means that upon return from the signal, we will return to the trampoline and execute the wrong instruction. Only one of them, of course. Which, if you’re lucky, will induce an immediate crash. But more likely, the program will execute tens, hundreds or even thousands of instructions before dying on some “impossible” problem. Sound nasty to debug? It is — and next time you see Adam, show some respect.

So how do we solve this problem? For this, Adam made an important observation: the kernel is in charge of signal delivery, and it knows about all of this insanity, so can’t it prevent this from happening? Adam solved this by extending the trampoline after the control transfer instruction to consist of the original instruction (again) followed by an explicit trap into the kernel. If the kernel detects that we’re about to deliver a signal while in this window (that is, with the program counter on the original instruction in the trampoline), it moves the program counter to the second half of the trampoline and returns to user-level with the signal otherwise unhandled. This leaves the signal pending, but only allows the program to execute the one instruction before bringing it back downtown on the explicit trap in the second half of the trampoline. There, we will see that we are in this situation, update the trapped program counter to be the instruction after the instrumented instruction (telling the program a small but necessary lie about where signal delivery actually occured), and bounce back to user-level; the kernel will see the bits set on the thread structure indicating that a signal is pending, and the Right Thing will happen.

And it all works! Well, almost — and that brings us to the bug. After some intensive DTrace/mdb sessions and a long Jabber conversation with Adam, it became clear that while signals were involved, the basic mechanism was actually functioning as designed. More experimenting (and more DTrace, of course) revealed that the problem was even nastier than the lines along which we were thinking: yes, we were taking a signal in the trampoline — but that logic was working as designed, with the program counter being moved correctly into the second half of the trampoline. The problem was that we were taking a second asynchronous signal, this time in the second half of the trampoline, and this brought us to this logic in dtrace_safe_defer_signal:

        /*
         * If we've executed the original instruction, but haven't performed
         * the jmp back to t->t_dtrace_npc or the clean up of any registers
         * used to emulate %rip-relative instructions in 64-bit mode, do that
         * here and take the signal right away. We detect this condition by
         * seeing if the program counter is the range [scrpc + isz, astpc).
         */
        if (t->t_dtrace_astpc - rp->r_pc <
            t->t_dtrace_astpc - t->t_dtrace_scrpc - isz) {
                ...
                rp->r_pc = t->t_dtrace_npc;
                t->t_dtrace_ft = 0;
                return (0);
        }

Now, the reason for this clause (the meat of which I’ve elided) is the subject of its own discussion, but it is not correct as written: it works (due to some implicit use of unsigned math) when the program counter is in the first half of the trampoline, but it fails if the program counter is on the instrumented instruction in the second half of the trampoline (denoted with t_dtrace_astpc). If we hit this case — first one signal on the first instruction of the trampoline followed by another signal on the first instruction on the second half of the trampoline — we will redirect control to be the instruction after the instrumentation (t_dtrace_npc) without having executed the instruction. This, if it needs to be said, is bad — and leads to exactly the kind of mysterious death that we’re seeing.

Fortunately, the fix is simple — here’s the patch. All of this leaves just one more mystery: why are we seeing all of these signals on node, anyway? They are all SIGPROF signals, and when I pinged Ryan about this yesterday, he was as surprised as I was that node seemed to be doing them when otherwise idle. A little digging on both of our parts revealed that the signals we were seeing are an artifact of some V8 magic: the use of SIGPROF is part of the feedback-driven optimization in the new Crankshaft compilation infrastructure for V8. So the signals are goodness — and that DTrace can now correctly handle them with respect to arbitrary instrumentation that much more so!

Posted on March 9, 2011 at 11:27 pm by admin · Permalink · 9 Comments
In: Uncategorized

Log/linear quantizations in DTrace

As I alluded to when I first joined Joyent, we are using DTrace as a foundation to tackle the cloud observability problem. And as you may have seen, we’re making some serious headway on this problem. Among the gritty details of scalably instrumenting latency is a dirty little problem around data aggregation: when we instrument the system to record latency, we are (of course) using DTrace to aggregate that data in situ, but with what granularity? For some operations, visibility into microsecond resolution is critical, while for other (longer) operations, millisecond resolution is sufficient – and some classes of operations may take hundreds or thousands (!) of milliseconds to complete. If one makes the wrong choice as the resolution of aggregation, one either ends up clogging the system with unnecessarily fine-grained data, or discarding valuable information in overly coarse-grained data. One might think that one could infer the interesting latency range from the nature of the operation, but this is unfortunately not the case: I/O operations (a typically-millisecond resolution operation) can take but microseconds, and CPU scheduling operations (a typically-microsecond resolution operation) can take hundreds of milliseconds. Indeed, the problem is that one often does not know what order of magnitude of resolution is interesting for a particular operation until one has a feel for the data for the specific operation — one needs to instrument the system to best know how to instrument the system!

We had similar problems at Fishworks, of course, and my solution there was wholly dissatisfying: when instrumenting the latency of operations, I put in place not one but many (five) lquantize()‘d aggregations that (linearly) covered disjoint latency ranges at different orders of magnitude. As we approached this problem again at Joyent, Dave and Robert both balked at the clumsiness of the multiple aggregation solution (with any olfactory reaction presumably heightened by the fact that they were going to be writing the pungent code to deal with this). As we thought about the problem in the abstract, this really seemed like something that DTrace itself should be providing: an aggregation that is neither logarithmic (i.e., quantize()) nor linear (i.e., lquantize()), but rather something in between…

After playing around with some ideas and exploring some dead-ends, we came to an aggregating action that logarithmically aggregates by order of magnitude, but then linearly aggregates within an order of magnitude. The resulting aggregating action — dubbed “log/linear quantization” and denoted with the new llquantize() aggregating action — is paramaterized with a factor, a low magnitude, a high magnitude and the number of linear steps per magnitude. With the caveat that these bits are still hot from the oven, the patch for llquantize() (which should patch cleanly against illumos) is here.

To understand the problem we’re solving, it’s easiest to see llquantize() in action:

# cat > llquant.d
tick-1ms
{
	/*
	 * A log-linear quantization with factor of 10 ranging from magnitude
	 * 0 to magnitude 6 (inclusive) with twenty steps per magnitude
	 */
	@ = llquantize(i++, 10, 0, 6, 20);
}

tick-1ms
/i == 1500/
{
	exit(0);
}
^D
# dtrace -V
dtrace: Sun D 1.7
# dtrace -qs ./llquant.d

           value  ------------- Distribution ------------- count
             < 1 |                                         1
               1 |                                         1
               2 |                                         1
               3 |                                         1
               4 |                                         1
               5 |                                         1
               6 |                                         1
               7 |                                         1
               8 |                                         1
               9 |                                         1
              10 |                                         5
              15 |                                         5
              20 |                                         5
              25 |                                         5
              30 |                                         5
              35 |                                         5
              40 |                                         5
              45 |                                         5
              50 |                                         5
              55 |                                         5
              60 |                                         5
              65 |                                         5
              70 |                                         5
              75 |                                         5
              80 |                                         5
              85 |                                         5
              90 |                                         5
              95 |                                         5
             100 |@                                        50
             150 |@                                        50
             200 |@                                        50
             250 |@                                        50
             300 |@                                        50
             350 |@                                        50
             400 |@                                        50
             450 |@                                        50
             500 |@                                        50
             550 |@                                        50
             600 |@                                        50
             650 |@                                        50
             700 |@                                        50
             750 |@                                        50
             800 |@                                        50
             850 |@                                        50
             900 |@                                        50
             950 |@                                        50
            1000 |@@@@@@@@@@@@@                            500
            1500 |                                         0

As you can see in the example, at each order of magnitude (determined by the factor — 10, in this case), we have 20 steps (or the factor at orders of magnitude less than the number of steps): up to 10, the buckets each span 1 value; from 10 through 100, each spans 5 values; from 100 to 1000, each spans 50; from 1000 to 10000 each spans 500, and so on. The result is not necessarily easy on the eye (and indeed, it seems unlikely that llquantize() will find much use on the command line), but is easy on the system: it generates data that has the resolution of linearly quantized data with the efficiency of logarithmically quantized data.

Of course, you needn’t appreciate the root system here to savor the fruit: if you are a Joyent customer, you will soon be benefitting from llquantize() without ever having to wade into the details, as we bring real-time analytics to the cloud. Stay tuned!

Posted on February 8, 2011 at 9:48 pm by bmc · Permalink · 3 Comments
In: Uncategorized

The DIRT on JSConf.eu and Surge

I just got back from an exciting (if exhausting) conference double-header: JSConf.eu and Surge. These conferences made for an interesting pairing, together encompassing much of the enthusiasm and challenges in today’s systems. It started in Berlin with JSConf.eu, a conference that reflects both JavaScript’s energy and diversity. Ryan introduced Node.js to the world at JSConf.eu a year ago, and enthusiasm for Node at the conference was running high; I would estimate that at least half of the attendees were implementing server-side JavaScript in one fashion or another. Presentations from the server-side were epitomized by Felix and Philip, each of whom presented real things done with Node — expanding on the successes, limitations and futures of what they had developed. The surge of interest in the server-side is of course changing the complexion of not only JSconf but of JavaScript itself. a theme that Chris Williams hit with fervor in his closing plenary, in which he cautioned of the dangers for a technology and a community upon whom a wave of popularity is breaking. Having been been near the epicenter of Java’s irrational exuberance in the mid-1990s (UltraJava, anyone?), I certainly share some of Chris’s concerns — and I think that Ryan’s talk on the challenges in Node reflects a humility that will become essential in the coming year. That said, I am also looking forward to seeing the JavaScript and Node communities expand to include more of the systems vets that have differentiated Node to date.

Speaking of systems vets, a highlight of JSConf.eu was hanging out with Erik Corry from the V8 team. JavaScript and Node owe a tremendous debt of gratitude to V8, without whom we would have neither the high-performance VM itself, nor the broader attention to JavaScript performance so essential to the success of JavaScript on the server-side. Erik and I brainstormed ideas for deeper support of DTrace within V8, which is, of course, an excruciatingly hard problem: as I have observed in the past, magic does not layer well — and DTrace and a sophisticated VM like V8 are both conjuring deep magic that frequently operate at cross purposes. Still, it was an exciting series of conversations, and we at Joyent certainly look forward to taking a swing at this problem.

After JSConf.eu wrapped up, I spent a few days brainstorming and exploring Berlin (and its Trabants and Veyrons) with Mark, Pedro and Filip. I then headed to the inaugural Surge Conference in Baltimore, and given my experience there, I would put the call out to the systems tribe: Surge promises to be the premier conference for systems practitioners. We systems practitioners have suffered acutely at the hands of academic computer science’s foolish obsession with conferences as publishing vehicle, having seen conference after conference of ours becoming casualties on their bloody path to tenure. Surge, by contrast, was everything a systems conference should be: exciting talks on real systems by those who designed, built and debugged them — and a terrific hallway track besides.

To wit: I was joined at Surge by Joyent VP of Operations Ryan Nelson; for us, Surge more than paid for itself during Rod Cope‘s presentation on his ten lessons for Hadoop deployment. Rod’s lessons were of the ilk that only come the hard way; his was a presentation fired in a kiln of unspeakable pain. For example, he noted that his cluster had been plagued by a wide variety of seemingly random low-level problems: lost interrupts, networking errors, I/O errors, etc. Acting apparently operating on a tip in a Broadcom forum, Rod disabled C-states on all of his Dell R710s and 2970s — and the problems all lifted (and have been absent for four months). Given that we have recently lived through same (and came to the same conclusion, albeit more recently and therefore much more tentatively), I tweeted what Rod had just reported, knowing that Ryan was in John Allspaw‘s concurrent sesssion and would likely see the tweet. Sure enough, I saw him retweet it minutes later — and he practically ran in after the end of Rod’s talk to see if I was kidding. (When it comes to firmare- and BIOS-level issues, I assure you: I never kid.) Not only was Rod’s confirmation of our experience tremendously valuable, it goes to the kind of technologist at Surge: this is a conference with practitioners who are comfortable not only at lofty architectural heights, but also at the grittiest of systems’ depths.

Beyond the many thought-provoking sessions, I had great conversations with many people, including Paul, Rod, Justin, Chris, Geir, Stephen, Wez, Mark, Joe, Kate, Tom, Theo (natch) — and a reunion (if sodden) with surprise stowaway Ted Leung.

As for my formal role at Surge, I had the honor of giving one of the opening keynotes. In preparing for it, I tried to put it all together: what larger trends are responsible for the tremendous interest in Node? What are the implications of these trends for other elements of the systems we build? From Node Knockout, it is clear that a major use case for Node is web-based real-time applications. At the moment these are primarily games — but as practitioners like Matt Ranney can attest, games are by no means the only real-time domain in which Node is finding traction. In part based on my experience developing the backend for the Node knockout leaderboard, the argument that I made in the keynote is that these web-based real-time applications have a kind of orthogonal data consistency: instead of being on the ACID/BASE axis, the “consistency” of the data is its temporality. (That is, if the data is late, it’s effectively inconsistent.) This is not necessarily new (market-facing trading systems have long had these properties), but I believe we’re about to see these systems brought to a much larger audience — and developed by a broader swath of practitioners. And of course — like AJAX before it — this trend was begging for an acronym. Struggling to come up with something, I went back to what I was trying to describe: these are real-time systems — but they are notable because they are data-intensive. A ha: “data-intensive real-time” — DIRT! I dig DIRT! I don’t know if the acronym will stick or not, but I did note that it was mentioned in each of Theo‘s and Justin‘s talks — and this was just minutes after it was coined. (Video of this keynote will be available soon; slides are here.)

Beyond the keynote, I also gave a talk on the perils of building enterprise systems from commodity components. This blog entry has gone on long enough, so I won’t go into depth on this topic here, saying only this: fear the octopus!

Thanks to the organizers of both JSconf and Surge for two great conferences. I arrive home thoroughly exhausted (JSconf parties longer, but Surge parties harder; pick your poison), but also exhilarated and excited about the future. Thanks again — and see you next year!

Posted on October 2, 2010 at 3:33 pm by bmc · Permalink · 3 Comments
In: Uncategorized

A physician’s son

My father is an emergency medical physician, a fact that has had a subtle but discernable influence on my career as a software engineer. Emergency medicine and software engineering are of course very different problems, and even though there are times when a major and cascading software systems failure can make a datacenter feel like an urban emergency room on a busy Saturday night, the reality is that if we software engineers screw up, people don’t generally die. (And while code and machine can both be metaphorically uncooperative, we don’t really have occasion to make a paramedic sandwich.) Not only are the consequences less dire, but the underlying systems themselves are at opposite ends of a kind of spectrum: one is deterministic yet brittle, the other sloppy yet robust. Despite these differences, I believe that medicine has much to teach us in software engineering; two specific (if small) artifacts of medicine that I have incorporated into my career:

M&M and Journal Club are two ways that medicine has influenced software engineering (or mine, anyway); what of the influences of information systems on the way medicine is practiced? To explore this, we at ACM Queue sought to put together an issue on computing in healthcare. This is a brutally tough subject for us to tackle — what in the confluence of healthcare and computing is interesting for the software practitioner? — but we felt it to be a national priority and too important to ignore. It should be said that I was originally introduced to computing through medicine: in addition to being an attending physician, my father — who started writing code as an undergraduate in the 1960s on an IBM System/360 Model 50 — also developed the software system that ran his emergency department; the first computer we owned (a very early IBM PC-XT) was bought to allow him to develop software at home (Microsoft Pascal 1.0 FTW!). I had intended to keep my personal history out of the ACM discussion, but at some point, someone on the Board said that what we really needed was the physician’s perspective — someone who could speak to the pracitical difficulties of integrating modern information systems into our healthcare system. At that, I couldn’t hold my tongue — I obviously knew of someone who could write such an article…

Dad graciously agreed, and the resulting article, Computers in Patient Care: the Promise and the Challenge appears as the cover story on this month’s Communications of the ACM. I hope you enjoy this article as much as I did — and who knows: if you are developing software that aspires to be used in healthcare, perhaps it could be the subject of your own Journal Club. (Just be sure to slip the kids a soda or two!)

Posted on September 24, 2010 at 2:23 am by bmc · Permalink · One Comment
In: Uncategorized

DTrace, node.js and the Robinson Projection

When I joined Joyent, I mentioned that I was seeking to apply DTrace to the cloud, and that I was particularly excited about the development of node.js — leaving it implict that the intersection of the two technologies would be naturally interesting, As it turns out, we have had an early opportunity to show the potential here: as you might have seen, the Node Knockout programming contest was held over the weekend; when I first joined Joyent (but four weeks ago!), Ryan was very interested in potentially using DTrace to provide a leaderboard for the competition. I got to work, adding USDT probes to node.js. To be fair, this still has some disabled overhead (namely, getting into and out of the node addon that has the true USDT probe), but it’s sufficiently modest to deploy DTrace-enabled node’s in production.

And thanks to incredibly strong work by Joyent engineers, we were able to make available a new node.js service that allocated a container per user. This service allowed us to make available a DTrace-enabled node to contestants — and then observe all of that from the global zone.

For example of the DTrace provider for node.js, here’s a simple enabling to print out HTTP requests as zones handle them (running on one of the Node Knockout machines):

# dtrace -n 'node*:::http-server-request{printf("%s: %s of %s\n", \
    zonename, args[0]->method, args[0]->url)}' -q
nodelay: GET of /poll6759.479651377309
nodelay: GET of /poll6148.392275444794
nodebodies: GET of /latest/
nodebodies: GET of /latest/
nodebodies: GET of /count/
nodebodies: GET of /count/
nodelay: GET of /poll8973.863890386003
nodelay: GET of /poll2097.9667574643568
awesometown: GET of /graphs/4c7a650eba12e9c41d000005.js
awesometown: POST of /graphs/4c7a650eba12e9c41d000005/appendValue
awesometown: GET of /graphs/4c7acd5ca121636840000002.js
awesometown: GET of /graphs/4c7a650eba12e9c41d000005.js
awesometown: GET of /graphs/4c7a650eba12e9c41d000005.js
awesometown: GET of /graphs/4c7a650eba12e9c41d000005.js
awesometown: GET of /graphs/4c7b2408546a64b81f000001.js
awesometown: POST of /faye
awesometown: POST of /faye
...

I added probes around both HTTP request and HTTP response; treating the file descriptor as a token that describes that uniquely describes that request while it is pending (an assumption that would only be invalid in the presence of HTTP pipelining), allows one to actually determine the latency for requests:

# cat http.d
#pragma D option quiet

http-server-request
{
        ts[this->fd = args[1]->fd] = timestamp;
        vts[this->fd] = vtimestamp;
}

http-server-response
/this->ts = ts[this->fd = args[0]->fd]/
{
        @t[zonename] = quantize(timestamp - this->ts);
        @v[zonename] = quantize(vtimestamp - vts[this->fd]);
        ts[this->fd] = 0;
        vts[this->fd] = 0;
}

tick-1sec
{
        printf("Wall time:\n");
        printa(@t);

        printf("CPU time:\n");
        printa(@v);
}

This script makes the distinction between wall time and CPU time; for wall-time, you can see the effect of long-polling, e.g. (the values are nanoseconds):

    nodelay
           value  ------------- Distribution ------------- count
           32768 |                                         0
           65536 |                                         4
          131072 |@@@@@                                    52
          262144 |@@@@@@@@@@@@@@@@@@                       183
          524288 |@@@@@                                    55
         1048576 |@@@                                      27
         2097152 |@                                        9
         4194304 |                                         5
         8388608 |@                                        8
        16777216 |@                                        6
        33554432 |@                                        9
        67108864 |@                                        7
       134217728 |@                                        12
       268435456 |@                                        11
       536870912 |                                         1
      1073741824 |                                         4
      2147483648 |                                         1
      4294967296 |                                         5
      8589934592 |                                         0
     17179869184 |                                         1
     34359738368 |                                         1
     68719476736 |                                         0

You can also look at the CPU time to see those that are doing more actual work. For example, one zone with interesting CPU time outliiers:

  nodebodies
           value  ------------- Distribution ------------- count
         4194304 |                                         0
         8388608 |@@@@@@@@@@@@                             57
        16777216 |@@@@                                     21
        33554432 |@@@@                                     18
        67108864 |@@@@@@@                                  34
       134217728 |@@@@@@@@@@@                              54
       268435456 |                                         0
       536870912 |                                         0
      1073741824 |                                         0
      2147483648 |                                         0
      4294967296 |@                                        3
      8589934592 |@                                        4
     17179869184 |                                         0

Note that because node has a single thread do all processing, we cannot assume that the requests themselves are inducing the work — only that CPU work was done between request and response. Still, this data would probably be interesting to the nodebodies team…

I also added probes around connection establishment; so here’s a simple way of looking at new connections by zone:

# dtrace -n 'node*:::net-server-connection{@[zonename] = count()}'
dtrace: description 'node*:::net-server-connection' matched 44 probes
^C

  explorer-sox                                                      1
  nodebodies                                                        1
  anansi                                                           69
  nodelay                                                         102
  awesometown                                                     146

Or if we wanted to see which IP addresses were connecting to, say, our good friends at awesometown (with actual addresses
in the output elided):

# dtrace -n 'node*:::net-server-connection \
    /zonename == "awesometown"/{@[args[0]->remoteAddress] = count()}'
dtrace: description 'node*:::net-server-connection' matched 44 probes
  XXX.XXX.XXX.XXX                                                   1
  XX.XXX.XX.XXX                                                     1
  XX.XXX.XXX.XXX                                                    1
  XX.XXX.XXX.XX                                                     1
  XXX.XXX.XX.XXX                                                    1
  XXX.XXX.XX.XX                                                     2
  XXX.XXX.XXX.XX                                                    8

Ryan saw the DTrace support I had added, and had a great idea: what if we took the IPs of incoming connections and geolocated them, throwing them on a world map and coloring them by team name? This was an idea that was just too exciting not to take a swing at, so we got to work. For the backend, the machinery was begging to itself be written in node, so I did a libdtrace addon for node and started building a scalable backend for processing the DTrace data from the different Node Knockout machines. Meanwhile, Joni came up with some mockups that had everyone drooling, and Mark contacted Brian from Nitobi about working on the front-end. Brian and crew were as excited about it as we were, and they put front-end engineer extraordinaire Yohei on the case — who worked with Rob on the Joyent side to pull it all together. Among Rob’s other feats, he managed to implement in JavaScript the logic for plotting longitude and latitude in the beautiful Robinson projection — which is a brutally complicated transformation. It was an incredible team, and we were pulling it off in such a short period of time and with such a firm deadline that we often felt like contestants ourselves!

The result — which it must be said works best in Safari and Chrome — is at http://leaderboard.no.de. In keeping with both the spirit of node and DTrace, the leaderboard is updated in real-time; from the time you connect to one of the Joyent-hostest (no.de) contestants, you should see yourself show up in the map in no more than 700 milliseconds (plus your network latwork latency). For crowded areas like the Bay Area, it can be hard to see yourself — but try moving to Cameroon for best results. It’s fun to watch as certain contestants go viral (try both hovering over a particular data point and clicking on the team name in the leaderboard) — and you can know which continent you’re cursing at in http://saber-tooth-moose-lion.no.de (now known to the world as Swarmation).

Enjoy both the leaderboard and the terrific Node Knockout entries (be sure to vote for your favorites!) — and know that we’ve only scratched the surface of what DTrace and node.js can do together!

Posted on August 30, 2010 at 2:55 am by bmc · Permalink · 4 Comments
In: Uncategorized

The liberation of OpenSolaris

As many have seen, Oracle has elected to stop contributing to OpenSolaris. This decision is, to put it bluntly, stupid. Indeed, I would (and did) liken it to L. Paul Bremer‘s decision to disband the Iraqi military after the fall of Saddam Hussein: beyond merely a foolish decision borne out of a distorted worldview, it has created combatants unnecessarily. As with Bremer’s infamous decision, the bitter irony is that the new combatants were formerly the strongest potential allies — and in Oracle’s case, it is the community itself.

As it apparently needs to be said, one cannot close an open source project — one can only fork it. So contrary to some reports, Oracle has not decided to close OpenSolaris, they have actually decided to fork it. That is, they have (apparently) decided that it is more in their interest to compete with the community that to cooperate with it — that they can in fact out-innovate the community. This confidence is surprising (and ironic) given that it comes exactly at the moment that the historic monopoly on Solaris talent has been indisputably and irrevocably broken — as most recently demonstrated by the departure of my former colleague, Adam Leventhal.

Adam’s case is instructive: Adam is a brilliantly creative engineer — one with whom it was my pleasure to work closely over nearly a decade. Time and time again, I saw Adam not only come up with innovative solutions to tough problems, but run those innovations through the punishing gauntlet that separates idea from product. One does not replace an engineer like Adam; one can only hope to grow another. And given his nine years of experience at the company and in the guts of the system, one cannot expect to grow a replacement quickly — if at all. Oracle’s loss, however, is the community’s gain; I hope I’m not tipping his hand too much to say that Adam will continue to be deeply engaged in the system, leading a new generation of engineers — but this time within a larger community that spans multiple companies and interests.

And in this way, odd as it may be, Oracle’s decision to fork is actually a relief to those of us whose businesses depend on OpenSolaris: instead of waiting for Oracle to engage the community, we can be secure in the knowledge that no engagement is forthcoming — and we can invest and plan accordingly. So instead of waiting for Oracle to fix a nagging driver bug or address a critical request for enhancement (a wait that has more often than not ended in disappointment anyway), we can tap our collective expertise as a community. And where that expertise doesn’t exist or is otherwise unavailable, those of us who are invested in the system can explicitly invest in building it — and then use it to give back to the community and contribute.

Speaking for Joyent, all of this has been tangibly liberating: just the knowledge that we are going to be cranking our own builds has allowed us to start thinking along new dimensions of innovation, giving us a renewed sense of control over our stack and our fate. I have already seen this shift in our engineers, who have begun to conceive of ideas that might not have been thought practical in a world in which Oracle’s engagement was so uncertain. Yes, hard problems lie ahead — but ideas are flowing, and the future feels alive with possibility; in short, innovation is afoot!

Posted on August 19, 2010 at 9:49 pm by bmc · Permalink · 12 Comments
In: Uncategorized