A systems software double-header: Surge and GOTO

I recently returned from a systems software double-header, presenting at two of the industry’s best conferences: Surge in Baltimore and GOTO in Aarhus, Denmark. These conferences have much in common: they are both expertly run; they both seek out top technical content; they both attract top technologists; and (it must be said) they both take excellent care of their speakers!

At Surge, I presented with Brendan Gregg on DIRT in production. This was a dense presentation, but it was fun for us to recount a sampling of the kinds of issues that we have seen in the new breed latency-sensitive data-intensive application. I think Brendan and I both agreed that this presentation could have easily been two or three times the length; we have plenty of scars from finding latency outliers in DIRTy systems! Beyond presenting, the lightning talks were a highlight for me at this year’s Surge. I am an inveterate disaster porn addict, so it should come as no surprise that I particularly enjoyed Chris Burroughs‘ retelling of how his day that he had set aside to prepare his lightning talk was beset by disaster (including a PDU failure at the moment that it seemed things couldn’t get any worse), and then Scott Sanders on the surprisingly nasty failure modes that one can see in a mobile app when the root-cause is a botched datacenter upgrade in your mobile provider. For my own lightning talk, I dug into the perverse history of what may well be the most obscure two-letter Unix command: the “old and rarely used” ta; hopefully others enjoyed learning about the history of this odd little wart as much as I did!

Update: This lightning talk is now online and can be viewed here.

Dave and me with Lars Bak, Kasper Lund and the Dart (né V8) team

After Surge, Dave and I headed east, to Denmark. There, we presented on dynamic languages in production. Our purposes for going to Aarhus were more than just GOTO, however: we also went to meet with Lars Bak, Kasper Lund, Erik Corry, the inimitable Mr. Aleph and the other engineers behind V8. As the birthplace of V8, Aarhus has always had a Dagobah-like appeal to me: I feel that when on the freezing ice planet that is mdb_v8.c, Dave and I might as well have seen Obi-Wan Kenobi as an apparition, telling us we must go to Aarhus if we wanted to truly understand objects-inl.h. Dave and I immensely enjoyed our time with Lars and his team; it is rare to be in the presence of a team that has single-handedly changed the trajectory of software — and it is rarer still to be able bring one’s own technology to such an august table. We were thrilled to be able to demonstrate the confluence of our technology (specifically, MDB and DTrace) with theirs — and to brainstorm ways that we might better solve the abstract problem of production debuggability of dynamic environments. This is a problem that we’ve been after in one way or another for nearly a decade, but this was the first time that we were having the conversation early enough in the VM implementation phase (specifically, of Dart) that one can hope to get ahead of the problem instead of chasing it. In particular, Lars’s team has a very interesting idea for debuggability of Dart that — if successful — may allow it to achieve our nirvana of total postmortem debuggability in a dynamic environment. That, coupled with its enlightened (C-based!) native interface, left Dave and me optimistic about the long-term prospects of Dart on the server side. It’s obviously early days still for Dart, but given the pedigree and talent of the team and their disposition towards making it both highly performing and highly debuggable, it’s clearly something to pay attention to on the server side.

Presentation and meeting accomplished, I still had a burning question: what was all of this terrific VM talent doing in the second city of a small country with an excruciatingly difficult language to pronounce? (Aside: ask a Dane to say “dead red red-eyed rotten smoked trout” in Danish.) In talking to Lars, Kasper and Erik about how they ended up in Aarhus a common theme emerged: all had — in one way or another — been drawn to work with Professor Ole Lehrmann Madsen on the VM for the BETA language. (No, I hadn’t heard of BETA either, and yes, it’s in all caps — the name is apparently meant to be screamed.) Everyone I spoke with had such fond memories of BETA (“you know, it was actually a very interesting system…”) that I have taken to calling this school of VM engineers the BETA School. The work and influence of the engineers of the BETA School extend beyond Google, with its members having done foundational VM work at VMware, Sun and a slew of startups.

I suppose I shouldn’t be surprised that this cluster of engineers has a professor and a university at its root; Brown’s Tom Doeppner has served a similar role for us, with his CS169 giving rise to an entire generation of OS innovation. The two schools of engineers share much in common: both are highly technical and close-knit — but with a history of being open to newcomers and an emphasis on developing their junior members. (Of course, the difference is that we of the CS169 School don’t all live in Providence.) And the intersection of the two schools is particularly interesting: VMware’s seminal ASPLOS paper is a joint product of Keith Adams of the CS169 School and Ole Agesen of the BETA School.

Curiosity about Aarhus and its origins sated, I arrived home from my systems software double header exhausted but also inspired: there seems to be no limit of hard, interesting problems in our domain — and many profoundly talented engineers to tackle them!

Posted on October 8, 2012 at 12:00 am by bmc · Permalink · One Comment
In: Uncategorized

Post-revolutionary open source

I spent last week in Porto Alegre, Brazil at FISL (the Fórum Internacional Software Livre — which Americans will pronounce “fizzle” but which the Brazilians pronounce more like “fees-lay”). I was excited to go to FISL based on the past experiences of former Sun Microsystems folks like Eric Schrock and Deirdré Straughan — and I thought that it would provide an excellent opportunity to present what I’ve learned over the past decade or so about how companies screw up open source. My presentation, corporate open source anti-patterns, was well received — surprisingly so, in fact. In particular, I had expected controversy around my claim that anti-collaborative strong-copyleft licenses — which is to say, the GPLv2 — are an open source anti-pattern. As I noted in my presentation, I believe that the GPLv2 serves more to erect walls between open source projects than it does to liberate proprietary ones. To support this assertion, I pointed to the fact that, despite significant demand, Linux does not have ports of DTrace and ZFS generally available — even though they have been open source for the better part of a decade and have been ported to many other operating systems. But as I made the case for moving away from the GPLv2 (and towards collaboration-friendly licenses like MPLv2), I was surprised to see heads bobbing in agreement. And when it came time for questions, the dissent (such as it was) was not regarding my assertion about GPLv2, but rather about copyright assignment: I claimed (and claim) that one must be very careful about having copyright assignment in an open source project because it relies on the good behavior of a single actor. But one of the questioners pointed out that lack of copyright assignment has the very important drawback of making changing the license essentially impossible. The questioner’s background is relevant here: it was Jean-Baptiste Kempf, one of the core contributors to VLC. JB’s question came from his own experience: he had just led the VLC team through relicensing libVLC, an exercise made excruciating by their lack of copyright assignment. Why were they relicensing their core software? Ironically, because they had come to my same conclusion with respect to anti-collaborative licensing: they relicensed libVLC from GPLv2 to LGPLv2 to allow it to be more readily shared with other projects. And of course, VLC’s not alone — meteor.js is another high-profile project that came out as GPLv2 but has since relicensed to MIT.

So what’s going on here? I think the short answer is that open source is firmly in its post-revolutionary phase: it is so throughly mainstream that it is being entirely de-radicalized. While some extremists may resist this trend, it actually reflects the depth of victory for open source: the triumph is so complete that the young software engineers of today simply don’t see proprietary software as a threat. Like an aspiring populace after a long period of conflict, the software engineers behind today’s top open source projects are looking much more to a future of collaboration than to the past of internecine rivalries. While there are many indicators of this new era — from the license breakdown of Github’s top projects to the rise of Github itself — I think that the MPLv2 seems to particularly capture this zeitgeist: when it was designed, it was done so to be explicitly compatible with both GPL (GPLv2 and GPLv3) and Apache. So the open source revolution may be over, but that’s because the war has been won — may we not squander its spoils!

Posted on August 1, 2012 at 9:49 pm by bmc · Permalink · One Comment
In: Uncategorized

DTrace in the zone

When the zones facility was originally developed many years ago, our focus on integrating it with respect to DTrace was firmly on the global zone: we wanted to be sure that we could use DTrace from the global zone to instrument the entire system, across all zones. This was (and remains) unique functionality — no other implementation of OS-based virtualization has the whole-system dynamic instrumentation to go with it. But as that focus implies, the ability to use DTrace in anything but the global zone was not a consideration. Fortunately, in 2006, Dan Price on the zones team took it upon himself to allow DTrace to be used in the non-global zone. This was a tricky body of work that had to carefully leverage the DTrace privilege model developed by Adam, and it allowed root users in the non-global zone to have access to the syscall provider and any user-level providers in their zone (both USDT-provided and pid-provided).

As great as this initial work was, when I started at Joyent nearly two years ago, DTrace support in the zone hadn’t really advanced beyond it — and there were still significant shortcomings. In particular, out of (reasonable) paranoia, nothing in the kernel could be instrumented or recorded by an enabling in the non-global zone. The problem is that many commonly used D variables — most glaringly, cpu, curlwpsinfo, curpsinfo, and fds[] — rely on access to kernel structures. Worse, because of our success in abstracting away the kernel with these variables, users didn’t even know that they were trying to record illegal kernel data — which meant that the error message when it failed was entirely unintuitive:

  [my-non-global-zone ~]# dtrace -n BEGIN'{trace(curpsinfo->pr_psargs)}'
  dtrace: description 'BEGIN' matched 1 probe
  dtrace: error on enabled probe ID 1 (ID 1: dtrace:::BEGIN): invalid kernel access in action #1 at DIF offset 44

The upshot of this was, as one Joyent customer put it to me bluntly, “whenever I type in a script from Brendan’s book, it doesn’t work.” And just to make sure I got the message, the customer sharpened the point considerably: “I’m deployed on SmartOS because I want to be able to use DTrace, but you won’t let me have it!” At that point, I was pretty much weeping and begging for forgiveness — we clearly had to do better.

As I considered this problem, though, a somewhat obvious solution to at least a portion of the problem became clear. The most painful missing variables — curpsinfo and curlwpsinfo — are abstractions on the kernel’s proc_t and kthread_t structures, respectively. There are two important observations to be made about these structures. First, looking at these structures, there is nothing about one’s own proc_t or kthread_t structures that one can use for privilege escalation: it’s either stuff you already can figure out about yourself via /proc, or kernel addresses that you can’t do anything with anyway. Second (and essentially), we know that these structures cannot disappear or otherwise change identity while in probe context: these structures are only freed after the process or thread that they represent has exited — and we know by merits of our execution in probe context that we have not, in fact, exited. So we could solve a good portion of the problem just by allowing access to these structures for processes for which one has the dtrace_proc privilege. This change was straightforward (and the highlight of it for me personally was that it reminded me of some clever code that Adam had written that dynamically instruments DIF to restrict loads as needed). So this took care of curpsinfo, curlwpsinfo, and cpu — but it also brought into stark relief an even tougher problem: fds[].

The use of fds[] in the non-global zone has long been a problem, and it’s a brutal one, as fds[] requires grovelling around in kernel data structures. For context, here is the definition of fds[] for illumos, as defined in io.d:

inline fileinfo_t fds[int fd] = xlate  (
    fd >= 0 && fd t_procp->p_user.u_finfo.fi_nfiles ?
    curthread->t_procp->p_user.u_finfo.fi_list[fd].uf_file : NULL);

Clearly, one cannot simply allow such arbitrary kernel pointer chasing from the non-global zone without risk of privilege escalation. At dtrace.conf, I implied (mistakenly, as it turns out) that this problem depends on (or would be otherwise solved by) dynamic translators. As I thought about this problem after the conference, I realized that I was wrong: even if the logic to walk those structures were in-kernel (as it would be, for example, with the addition of a new subroutine), it did not change the fundamental problem that the structures that needed to be referenced could themselves disappear (and worse, change identity) during the execution of an enabling in probe context. As I considered the problem, I realized that I had been too hung up on making this arbitrary — and not focussed enough on the specifics of getting fds[] to work.

With the focus sharpening to fds[], I realized that some details of the the kernel’s implementation could actually make this reasonable to implement. In particular, Jeff‘s implementation of allocating slots in fi_list means that an fi_list never goes away (it is doubled but the old fi_list is not freed) and that fi_nfiles is never greater than the memory referred to by fi_list. (This technique — memory retiring — can be used to implement a highly concurrent table; the interested reader is pointed to both the ACM Queue article that Jeff and I wrote that describes it in detail and to the implementation itself.) So as long as one is only able to get the file_t for one’s own file descriptor, we could know that the array of file_t pointers would not itself be freed over the course of probe context. That brings us to the file_t itself: unlike the fi_list, the file_t can be freed while one is in probe context because one thread could be closing a file while another is in probe context referencing fds[]. Solving this problem required modifying the kernel, if one slightly: an added a hook to the closef() path that DTrace can optionally use to issue a dtrace_sync(). (A dtrace_sync() issues a synchronous cross-call to all CPUs; because DTrace disables interrupts in probe context, this can be used as a barrier with respect to probe context.)

Adding a hook of this nature requires an understanding of the degree to which the underlying code path is performance-critical. That is, to contemplate adding this hook, we needed to ask: how hot is closef(), anyway? Historically in my career as a software engineer, this kind of question would be answered with a combination of ass scratching and hand waving. Post-DTrace, of course, we can answer those questions directly — but only if we have a machine that has a representative workload (which we only had to a very limited degree at Sun). But one of my favorite things about working at Joyent is that we don’t have to guess when we have a question like this: we can just fire up DTrace on every node in the public cloud for a couple of seconds and get the actual data from (many) actual production machines. And in this case, the data was at least a little bit surprising: we had (many) nodes that were calling closef() thousands of times per second. (The hottest node saw nearly 5,000 closef()’s per second over a ten second period — and there were many that saw on the order of 2,000 per second.) This is way too hot to always be executing a dtrace_sync() when DTrace is loaded (as it effectively always is on our systems), so we had to be sure that we only executed the dtrace_sync() in the presence of an enabling that actually used fds[] with reduced privileges (e.g., from a non-global zone).

This meant that fds[] needed to be implemented in terms of a subroutine such that we could know when one of these is enabled. That is, we needed a D subroutine to translate from a file descriptor to a file_t pointer within the current context. Given this, the name of the subroutine was obvious: it had to be getf() — the Unix routine that does exactly this in the kernel, and has since Fourth Edition Unix, circa 1973. (Aside: one of the reasons I love illumos is because of this heritage: compare the block comment above the Seventh Edition implementation of getf() to that above getf() currently in illumos — yes, we still have comments written by Ken!) So the change to allow fds[] in the non-global zone
ended up being more involved, but in the end, it wasn’t too bad — especially given the functionality that it unlocked.

With these two changes made, Brendan and I brainstormed about what remained out of reach for the non-global zone — and the only other thing that we could come up with as commonly wanting in the non-global zone would be access to the sched and proc providers. Allowing use of these in-kernel providers would require allowing enabling of the probes but not firing them when outside the context of one’s own zone (similar to the syscall provider) and further explicitly forbidding getting anything about privileged context (specifically, register state). Fortunately, I had had to do exactly this to fix a somewhat nasty zones-related issue with the profile provider, and it seemed that it wouldn’t be too bad to extend this with these slightly new semantics. This proved to be the case, and the change to allow sched, proc, vminfo and sysinfo providers in the non-global zone ended up being very modest.

So with these three changes, I am relieved to report that DTrace is now completely usable in the non-global zone — and all without sacrificing the security model of zones! If you are a Joyent cloud customer, we will be rolling out a platform with this modification across the cloud (it necessitates a reboot, so don’t expect it before your next scheduled maintenance window); if you are a SmartOS user, look for this in our next SmartOS release; and if you are using another illumos-based distro (e.g., OpenIndiana or OmniOS) look for it in an upcoming release — we will be integrating these changes into illumos, so you can expect them to be in downstream distros soon. And here’s to DTrace in the zone!

Posted on June 7, 2012 at 12:18 am by bmc · Permalink · 5 Comments
In: Uncategorized

Debugging node.js memory leaks

Part of the value of dynamic and interpreted environments is that they handle the complexities of dynamic memory allocation. In particular, one needn’t explicitly free memory that is no longer in use: objects that are no longer referenceable are found automatically and destroyed via garbage collection. While garbage collection simplifies the program’s relationship with memory, it not mean the end of all memory-based pathologies: if an application retains a reference to an object that is ultimately rooted in a global scope, that object won’t be considered garbage and the memory associated with it will not be returned to the system. If enough such objects build up, allocation will ultimately fail (memory is, after all, finite) and the program will (usually) fail along with it. While this is not strictly — from the native code perspective, anyway — a memory leak (the application has not leaked memory so much as neglected to unreference a semantically unused object), the effect is nonetheless the same and the same nomenclature is used.

While all garbage collected environments create the potential to create such leaks, it can be particularly easy in JavaScript: closures create implicit references to variables within their scopes — references that might not be immediately obvious to the programmer. And node.js adds a new dimension of peril with its strictly asynchronous interface with the system: if backpressure from slow upstream services (I/O, networking, database services, etc.) isn’t carefully passed to downstream consumers, memory will begin to fill with the intermediate state. (That is, what one gains in concurrency of operations one may pay for in memory.) And of course, node.js is on the server — where the long-running nature of services means that the effect of a memory leak is much more likely to be felt and to affect service. Take all of these together, and you can easily see why virtually anyone who has stood up node.js in production will identify memory leaks as their most significant open issue.

The state of the art for node.js memory leak detection — as concisely described by Felix in his node memory leak tutorial — is to use the v8-profiler and Danny Coates’ node-inspector. While this method is a lot better than nothing, it’s very much oriented to the developer in development. This is a debilitating shortcoming because memory leaks are often observed only after days (hours, if you’re unlucky) of running in production. At Joyent, we have long wanted to tackle the problem of providing the necessary tooling to identify node.js memory leaks in production. When we are so lucky as to have these in native code (that is, node.js add-ons), we can use libumem and ::findleaks — a technology that we have used to find thousands of production memory leaks over the years. But this is (unfortunately) not the common case: the common case is a leak in JavaScript — be it node.js or application code — for which our existing production techniques are useless.

So for node.js leaks in production, one has been left with essentially a Gedanken experiment: consider your load and peruse source code looking for a leak. Given how diabolical these leaks can be, it’s amazing to me that anyone ever finds these leaks. (And it’s less surprising that they take an excruciating amount of effort over a long period of time.) That’s been the (sad) state of things for quite some time, but recently, this issue boiled over for us: Voxer, a Joyent customer and long in the vanguard of node.js development, was running into a nasty memory leak: a leak sufficiently acute that the service in question was having to be restarted on a regular basis, but not so much that it was reproducible in development. With the urgency high on their side, we again kicked around this seemingly hopeless problem, focussing on the concrete data that we had: core dumps exhibiting the high memory usage, obtained at our request via gcore(1) before service restart. Could we do anything with those?

As an aside, a few months ago, Joyent’s Dave Pacheco added some unbelievable postmortem debugging support for node. (If postmortem debugging is new to you, you might be interested in checking out my presentation on debugging production systems from QCon San Francisco. I had the privilege of demonstrating Dave’s work in that presentation — and if you listen very closely at 43:26, you can hear the audience gasp when I demonstrate the amazing ::jsprint.) But Dave hasn’t built the infrastructure for walking the data structures of the V8 heap from an MDB dmod — and it was clear that doing so would be brittle and error-prone.

As Dave, Brendan and I were kicking around increasingly desperate ideas (including running strings(1) on the dump — an idea not totally without merit — and some wild visualization ideas), a much simpler idea collectively occurred to us: given that we understood via Dave’s MDB V8 support how a given object is laid out, and given that an object needed quite a bit of self-consistency with referenced but otherwise orthogonal structures, what about just iterating over all of the anonymous memory in the core and looking for objects? That is, iterate over every single address, and see if that address could be interpreted as an object. On the one hand, it was terribly brute force — but given the level of consistency required of an object in V8, it seemed that it wouldn’t pick up too many false positives. The more we discussed it, the more plausible it became, but with Dave fully occupied (on another saucy project we have cooking at Joyent — more on that later), I roped up and headed into his MDB support for V8…

The result — ::findjsobjects — takes only a few minutes to run on large dumps, but provides some important new light on the problem. The output of ::findjsobjects consists of representative objects, the number of instances of that object and the number of properties on the object — followed by the constructor and first few properties of the objects. For example, here is the dump on a gcore of a (non-pathological) Joyent node-based facility:

> ::findjsobjects -v
findjsobjects:         elapsed time (seconds) => 20
findjsobjects:                   heap objects => 6079488
findjsobjects:             JavaScript objects => 4097
findjsobjects:              processed objects => 1734
findjsobjects:                 unique objects => 161
fc4671fd        1      1 Object: flags
fe68f981        1      1 Object: showVersion
fe8f64d9        1      1 Object: EventEmitter
fe690429        1      1 Object: Credentials
fc465fa1        1      1 Object: lib
fc46300d        1      1 Object: free
fc4efbb9        1      1 Object: close
fc46c2f9        1      1 Object: push
fc46bb21        1      1 Object: uncaughtException
fe8ea871        1      1 Object: _idleTimeout
fe8f3ed1        1      1 Object: _makeLong
fc4e7c95        1      1 Object: types
fc46bae9        1      1 Object: allowHalfOpen
fc45e249       12      4 Object: type, min, max, default
fc4f2889       12      4 Object: index, fields, methods, name
fd2b8ded       13      4 Object: enumerable, writable, configurable, value
fc0f68a5       14      1 SlowBuffer: length
fe7bac79       18      3 Object: path, fn, keys
fc0e9d21       20      5 Object: _onTimeout, _idleTimeout, _idlePrev, ...
fc45facd       21      4 NativeModule: loaded, id, exports, filename
fc45f571       23      8 Module: paths, loaded, id, parent, exports, ...
fc4607f9       35      1 Object: constructor
fc0f86c9       56      3 Buffer: length, offset, parent
fc0fc92d       57      2 Arguments: length, callee
fe696f59       91      3 Object: index, fields, name
fc4f3785       91      4 Object: fields, name, methodIndex, classIndex
fe697289      246      2 Object: domain, name
fc0f87d9      308      1 Buffer:

Now, any one of those objects can be printed with ::jsprint. For example, let’s take fc45e249 from the above output:

> fc45e249::jsprint
    type: number,
    min: 10,
    max: 1000,
    default: 300,

Note that that’s only a representative object — there are (in the above case) 12 objects that have that same property signature. ::findjsobjects can get you all of them when you specify the address of the reference object:

> fc45e249::findjsobjects

And because MDB is the debugger Unix was meant to have, that output can be piped to ::jsprint:

> fc45e249::findjsobjects | ::jsprint
    type: number,
    min: 10,
    max: 1000,
    default: 300,
    type: number,
    min: 0,
    max: 5000,
    default: 5000,
    type: number,
    min: 0,
    max: 5000,
    default: undefined,

Okay, fine — but where are these objects referenced? ::findjsobjects has an option for that:

> fc45e249::findjsobjects -r
fc45e249 referred to by fc45e1e9.height

This tells us (or tries to) who is referencing that first (representative) object. Printing that out (with the “-a” option to show the addresses of the objects):

> fc45e1e9::jsprint -a
fc45e1e9: {
    ymax: fe78e061: undefined,
    hues: fe78e061: undefined,
    height: fc45e249: {
        type: fe78e361: number,
        min: 14: 10,
        max: 7d0: 1000,
        default: 258: 300,
    selected: fc45e3fd: {
        type: fe7a2465: array,
        default: fc45e439: [...],

So if we want to learn where all of these objects are referenced, we can again use a pipeline within MDB:

> fc45e249::findjsobjects | ::findjsobjects -r
fc45e249 referred to by fc45e1e9.height
fc46fd31 referred to by fc46b159.timeout
fc467ae5 is not referred to by a known object.
fc45ecb5 referred to by fc45eadd.ymax
fc45ec99 is not referred to by a known object.
fc45ec11 referred to by fc45eadd.nbuckets
fc45ebb5 referred to by fc45eadd.height
fc45eb41 referred to by fc45eadd.ymin
fc45eb25 referred to by fc45eadd.width
fc45e3d1 referred to by fc45e1e9.nbuckets
fc45e3b5 referred to by fc45e1e9.ymin
fc45e399 referred to by fc45e1e9.width

Of course, the proof of a debugger is in the debugging; would ::findjsobjects actually be of use on the Voxer dumps that served as its motivation? Here is the (elided) output from running it on a big Voxer dump:

> ::findjsobjects -v
findjsobjects:         elapsed time (seconds) => 292
findjsobjects:                   heap objects => 8624128
findjsobjects:             JavaScript objects => 112501
findjsobjects:              processed objects => 100424
findjsobjects:                 unique objects => 241
fe806139        1      1 Object: Queue
fc424131        1      1 Object: Credentials
fc424091        1      1 Object: version
fc4e3281        1      1 Object: message
fc404f6d        1      1 Object: uncaughtException
fafcb229     1007     23 ClientRequest: outputEncodings, _headerSent, ...
fafc5e75     1034      5 Timing: req_start, res_end, res_bytes, req_end, ...
fafcbecd     1037      3 Object: aborted, data, end
 8045475     1060      1 Object:
fb0cee9d     1220      9 HTTPParser: socket, incoming, onHeadersComplete, ...
fafc58d5     1271     25 Socket: _connectQueue, bytesRead, _httpMessage, ...
fafc4335     1311     16 ServerResponse: outputEncodings, statusCode, ...
fafc4245     1673      1 Object: slab
fafc44d5     1702      5 Object: search, query, path, href, pathname
fafc440d     1784     14 Client: buffered_writes, name, timing, ...
fafc41c5     1796      3 Object: error, timeout, close
fafc4469     1811      3 Object: address, family, port
fafc42a1     2197      2 Object: connection, host
fbe10625     2389      2 Arguments: callee, length
fafc4251     2759     15 IncomingMessage: statusCode, httpVersionMajor, ...
fafc42ad     3652      0 Object:
fafc6785    11746      1 Object: oncomplete
fb7abc29    15155      1 Object: buffer
fb7a6081    15155      3 Object: , oncomplete, cb
fb121439    15378      3 Buffer: offset, parent, length

This immediately confirmed a hunch that Matt had had that this was a buffer leak. And for Isaac — who had been working this issue from the Gedanken side and was already zeroing in on certain subsystems — this data was surprising in as much as it was so confirming: he was already on the right path. In short order, he nailed it, and the fix is in node 0.6.17.

The fix was low risk, so Voxer redeployed with it immediately — and for the first time in quite some time, memory utilization was flat. This was a huge win — and was the reason for Matt’s tantalizing tweet. The advantages of this approach are that it requires absolutely no modification to one’s node.js programs — no special flags and no different options. And it operates purely postmortem. Thanks to help from gcore(1), core dumps can be taken over time for a single process, and those dumps can then be analyzed off-line.

Even with ::findjsobjects, debugging node.js memory leaks is still tough. And there are certainly lots of improvements to be made here — there are currently some objects that we do not know how to correctly interpret, and we know that we know that we can improve our algorithm for finding object references — but this shines a bright new light into what had previously been a black hole!

If you want to play around with this, you’ll need SmartOS or your favorite illumos distro (which, it must be said, you can get simply by provisioning on the Joyent cloud). You’ll need an updated v8.so — which you can either build yourself from illumos-joyent or you can download a binary. From there, follow Dave’s instructions. Don’t hesitate to ping me if you get stuck or have requests for enhancement — and here’s to flat memory usage on your production node.js service!

Posted on May 5, 2012 at 1:07 pm by bmc · Permalink · Comments Closed
In: Uncategorized

Standing up SmartDataCenter

Over the past year, we at Joyent have been weaving some threads that may have at times seemed disconnected: last fall, we developed a prototype of DTrace-based cloud instrumentation as part of hosting the first Node Knockout competition; in the spring, we launched our no.de service with full-fledged real-time cloud analytics; and several weeks ago, we announced the public availability of SmartOS, the first operating system to combine ZFS, DTrace, OS-based virtualization and hardware virtualization in the same operating system kernel. These activities may have seemed disjoint, but there was in fact a unifying theme: our expansion at Joyent from a public cloud operator to a software company — from a company that operates cloud software to one that also makes software for cloud operators.

This transition — from service to product — is an arduous one, for the simple reason that a service can make up for any shortcomings with operational heroics that are denied to a product. Some of these gaps may appear superficial (e.g., documentation), but many are deep and architectural; that one is unable to throw meat at a system changes many fundamental engineering decisions. (It should be said that some products don’t internalize these differences and manage to survive despite their labor-intensive shortcomings, but this is thanks to the caulking called “professional services” — and is usually greased by an enterprise software sales rep with an easy rapport and a hell of a golf handicap.) As the constraints of being a product affect engineering decisions, so do the constraints of operating as a service; if a product has no understanding or empathy for how it will be deployed into production, it will be at best difficult to operate, and at worst unacceptable for operation.

So at Joyent, we have long believed that we need both a product that can be used to provide a kick-ass service and a kick-ass service to prove to ourselves that we hit the mark. While our work over the past year has been to develop that product — SmartDataCenter — we have all known that we wouldn’t be done until that product was itself operating not just emerging services like no.de, but also the larger Joyent public cloud, and over the summer we began to plan a broad-based stand-up of SmartDataCenter. With our SmartOS-based KVM solid and on schedule, and with the substantial enabling work up the stack falling into place, we set for ourselves an incredibly ambitious target: to not only ship this revolutionary new version of the software, but to also — at the same time — stand up that product in production in the Joyent public cloud. Any engineer’s natural conservatism would resist such a plan — conflating shipping a product with a production deployment feels chancey at best and foolish at worst — but we saw an opportunity to summon a singular, company-wide focus that would force us to deliver a much better product, and would give us and our software customers the satisfaction of knowing that SmartDataCenter as shipped runs Joyent. More viscerally, we were simply too hot on getting SmartOS-based hardware virtualization in production — and oh, those DTrace-based KVM analytics! — to contemplate any path less aggressive.

Of course, it wasn’t easy; there were many intense late-night Jabber sessions as we chased bugs and implemented solutions. But as time went on — and as we cut release candidates and began to pound on the product in staging envirionments — it became clear that the product was converging, and a giddy optimism began to spread. Especially for our operations team, many of whom are long-time Joyent vets, this was a special kind of homecoming, as they saw their vision and wisdom embodied in the design decisions of the product. This sentiment was epitomized by one of the senior members of the Joyent public cloud operations team — an engineer who has tremendous respect in the company, and who is known for his candor — as he wrote:

It seems the past ideas of a better tomorrow have certainly come to light. I think I speak for all of the JPC when I say “Bravo” to the dev team for a solid version of the software. We have many rivers (hopefully small creeks) to cross but I am ultimately confident in what I have seen and deployed so far. It’s incredible to see a unified convergence on what has been the Joyent dream since I have been here.

For us in development, these words carried tremendous weight; we had endeavored to build the product that our team would want to deploy and operate, and it was singularly inspiring to see them so enthusiastic about it! (Though I will confess that this kind of overwhelmingly positive sentiment from such a battle-hardened operations engineer is so rare as to almost make me nervous — and yes, I am knocking on wood as I write this.)

With that, I’m pleased to announce that the (SmartDataCenter-based) Joyent public cloud is open for business! You can provision (via web portal or API) either virtual OS instances (SmartMachines) or full hardware-virtual VMs; they’ll provision like a rocket (ZFS clones FTW!), and you’ll get a high-performing, highly reliable environment with world-beating real-time analytics. And should you wish to stand up your own cloud with the same advantages — either a private cloud within your enterprise or a public cloud to serve your customers — know that you can own and operate your very own SmartDataCenter!

Posted on September 15, 2011 at 2:03 am by bmc · Permalink · Comments Closed
In: Uncategorized

KVM on illumos

A little over a year ago, I came to Joyent because of a shared belief that systems software innovation matters — that there is no level of the software stack that should be considered off-limits to innovation. Specifically, we ardently believe in innovating in the operating system itself — that extending and leveraging technologies like OS virtualization, DTrace and ZFS can allow one to build unique, differentiated value further up the software stack. My arrival at Joyent also coincided with the birth of illumos, an effort that Joyent was thrilled to support, as we saw that rephrasing our SmartOS as an illumos derivative would allow us to engage in a truly open community with whom we could collaborate and collectively innovate.

In terms of the OS technology itself, as powerful as OS virtualization, DTrace and ZFS are, there was clearly a missing piece to the puzzle: hardware virtualization. In particular, with advances in microprocessor technology over the last five years (and specifically with the introduction of technologies like Intel’s VT-x), one could (in principle, anyway) much more easily build highly performing hardware virtualization. The technology was out there: Fabrice Bellard‘s cross-platform QEMU had been extended to utilize the new microprocessor support by Avi Kivity and his team with their Linux kernel virtual machine (KVM).

In the fall of last year, the imperative became clear: we needed to port KVM to SmartOS. This notion is almost an oxymoron, as KVM is not “portable” in any traditional sense: it interfaces between QEMU, Linux and the hardware with (rightful) disregard for other systems. And as KVM makes heavy use of Linux-specific facilities that don’t necessarily have clean analogs in other systems, the only way to fully scope the work was to start it in earnest. However, we knew that just getting it to compile in any meaningful form would take significant effort, and with our team still ramping up and with lots of work to do elsewhere, the reality was that this would have to start as a small project. Fortunately, Joyent kernel internals veteran Max Bruning was up to the task, and late last year, he copied the KVM bits from the stable Linux 2.6.34 source, took out a sharpened machete, and started whacking back jungle…

Through the winter holidays and early spring (and when he wasn’t dispatched on other urgent tasks), Max made steady progress — but it was clear that the problem was even more grueling than we had anticipated: because KVM is so tightly integrated into the Linux kernel, it was difficult to determine dependencies — and hard to know when to pull in the Linux implementation of a particular facility or function versus writing our own or making a call to the illumos equivalent (if any). By the middle of March, we were able to execute three instructions in a virtual machine before taking an unresolved EPT violation. To the uninitiated, this might sound pathetic — but it meant that a tremendous amount of software and hardware were working properly. Shortly thereafter, with the EPT problem fixed, the virtual machine would happily run indefinitely — but clearly not making any progress booting. But with this milestone, Max had gotten to the point where the work could be more realistically parallelized, and we had gotten to the point as a team and a product where we could dedicate more engineers to this work. In short, it was time to add to the KVM effort — and Robert Mustacchi and I roped up and joined Max in the depths of KVM.

With three of us on the project and with the project now more parallelizable, the pace accelerated, and in particular we could focus on tooling: we added an mdb dmod to help us debug, 16-bit disassembler support that was essential for us to crack the case on the looping in real-mode, and DTrace support for reading the VMCS. (And we even learned some surprising things about GNU C along the way!) On April 18th, we finally broke through into daylight: we were able to boot into KMDB running on a SmartOS guest! The pace quickened again, as we now had the ability to interactively ascertain state as seen by the guest and much more quickly debug issues. Over the next month (and after many intensive debugging sessions), we were able to boot Windows, Linux, FreeBSD, Haiku, QNX, Plan 9 and every other OS we could get our hands on; by May 13th, we were ready to send the company-wide “hello world” mail that has been the hallmark of bringup teams since time immemorial.

The wind was now firmly at our backs, but there was still much work to be done: we needed to get configurations working with large numbers of VCPUs, with many VMs, over long periods of time and significant stress and so on. Over the course of the next few months, the bugs became rarer and rarer (if nastier!), and we were able to validate that performance was on par with Linux KVM — to the point that today we feel confident releasing illumos KVM to the greater community!

In terms of starting points, if you just want to take it for a spin, grab a SmartOS live image ISO. If you’re interested in the source, check out the illumos-kvm github repo for the KVM driver itself, the illumos-kvm-cmd github repo for our QEMU 0.14.1 tree, and illumos itself for some of our KVM-specific tooling.

One final point of validation: when we went to integrate these changes into illumos, we wanted to be sure that they built cleanly in a non-SmartOS (that is, stock illumos) environment. However, we couldn’t dedicate a machine for the activity, so we (of course) installed OpenIndiana running as a KVM guest on SmartOS and cranked the build to verify our KVM deltas! As engineers, we all endeavor to build useful things, and with KVM on illumos, we can assert with absolute confidence that we’ve hit that mark!

Posted on August 15, 2011 at 7:00 am by bmc · Permalink · 44 Comments
In: Uncategorized

In defense of intrapreneurialism

Trevor pointed me to a ChubbyBrain article that derided “intrapreneurs” as having nothing in common with their entrepreneurial brethren. While I don’t necessary love the term, what we did at Fishworks is probably most accurately described as “intrapreneurial” — and based on both our experiences there and the caustic tone of the Chubby polemic, I feel a responsibility to respond. (And so, I hasten to add, did Trevor.) In terms of my history: I spent ten years in the bowels of Bigco, four years engaged in an intrapreneurial effort within Bigco, and the past year at an actual, venture-funded startup. Based on my experience, I take strong exception to the Chubby piece; to take their criticisms of intrapreneurs one-by-one:

Of course, there are differences between a startup within a large organization and an actual startup. When I was pitching Fishworks to the executives at Sun, I thought of our activity as both giving up some of the upside and removing some of the downside of a startup. After living through Fishworks and now being at a startup, I would be more precise about the downside that the intrapreneur doesn’t have to endure: it’s not insecurity (or at least, not as much as I believed) — it’s the grueling task of raising money. At Fishworks, we had exactly one investor, they knew us and we knew them — and that was a blessing that gave us the luxury of focus.

On balance, I don’t regret for a moment doing it the way we did it — but I also don’t regret having left for a startup. I don’t doubt that my career will likely oscillate between smaller organizations and larger ones — I see strengths and weaknesses to both, much of which comes down to specific opportunities and conditions. In short: there may be differences between intrapreneurialism and entrepreneurialism — but I see little difference between the intrapreneur and the entrepreneur!

Posted on July 12, 2011 at 2:26 pm by bmc · Permalink · 4 Comments
In: Uncategorized

When magic collides

We had an interesting issue come up the other day, one that ended up being a pretty nasty bug in DTrace. It merits a detailed explanation, but first, a cautionary note: we’re headed into rough country; in the immortal words of Sixth Edition, you are not expected to understand this.

The problem we noticed was this: when running node and instrumenting certain hot functions (memcpy:entry was noted in particular), after some period of activity, the node process would die on a strange seg fault. And looking at the core dumps, we were (usually) lost in some code path in memcpy() with registers or a stack that was self-inconsistent. In debugging this, it was noted (with DTrace, natch) that the node process was under somewhat intense signal processing activity (we’ll get to why in a bit), and that (in particular) the instrumented function activity and the signal activity seemed to be interacting “shortly” before the fatal failure.

Before wading any deeper, it’s worth taking an aside to explain how we deal with signals in DTrace — a discussion that in turn requires some background on the way loss-less user-level tracing works in DTrace. One of the important constraints that we placed upon ourselves (albeit one that we didn’t talk about too much) is that DTrace is loss-less: if multiple threads on multiple CPUs hit the same point of instrumentation, no thread will see an uninstrumented code path. This may seem obvious, but it’s not the way traditional debuggers work: normally, a debugger stops a process (and all of its threads) when it hits a breakpoint — which it induces by modifying program text with a synchronous trap instruction (e.g., an “int 3” in x86). Then, the program text is modified to contain the original instruction, the thread that hit the breakpoint (and only that thread) is single-stepped (often using hardware support) over the original instruction, and the original instruction is again replaced with a breakpoint instruction. We felt this to be unacceptable for DTrace, and imposed a constraint that tracing always be loss-less.

For user-level instrumentation (that is, the pid provider), magic is required: we need to execute the instrumented instruction out-of-line. For an overview of the mechanism for this, see Adam‘s block comment in fasttrap_isa.c. The summary is that we have a per-thread region of memory at user-level to which we copy both the instrumented instruction and a control transfer instruction that gets us back past the point of instrumentation. Then, when we hit the probe (and after we’ve done in-kernel probe processing), we point the instruction pointer to be in the per-thread region and return to user-level; back in user-land, we execute the instruction and transfer control back to the rest of the program.

Okay, so easy, right? Well, not so fast — enter signals: if we take a signal while we’re executing this little user-level trampoline, we could corrupt program state. At first, it may not be obvious how this is the case; if we execute a signal before we execute the instruction in the trampoline we will simply execute the signal handler (whatever it may be) and return to the trampoline, finish that off, and then get back to the business of executing the program. Where’s the problem? The problem is that if the signal handler executes something that itself is instrumented with DTrace: if this is the case, we’ll hit another probe, and that probe will copy its instruction to the same trampoline. This is bad: it means that upon return from the signal, we will return to the trampoline and execute the wrong instruction. Only one of them, of course. Which, if you’re lucky, will induce an immediate crash. But more likely, the program will execute tens, hundreds or even thousands of instructions before dying on some “impossible” problem. Sound nasty to debug? It is — and next time you see Adam, show some respect.

So how do we solve this problem? For this, Adam made an important observation: the kernel is in charge of signal delivery, and it knows about all of this insanity, so can’t it prevent this from happening? Adam solved this by extending the trampoline after the control transfer instruction to consist of the original instruction (again) followed by an explicit trap into the kernel. If the kernel detects that we’re about to deliver a signal while in this window (that is, with the program counter on the original instruction in the trampoline), it moves the program counter to the second half of the trampoline and returns to user-level with the signal otherwise unhandled. This leaves the signal pending, but only allows the program to execute the one instruction before bringing it back downtown on the explicit trap in the second half of the trampoline. There, we will see that we are in this situation, update the trapped program counter to be the instruction after the instrumented instruction (telling the program a small but necessary lie about where signal delivery actually occured), and bounce back to user-level; the kernel will see the bits set on the thread structure indicating that a signal is pending, and the Right Thing will happen.

And it all works! Well, almost — and that brings us to the bug. After some intensive DTrace/mdb sessions and a long Jabber conversation with Adam, it became clear that while signals were involved, the basic mechanism was actually functioning as designed. More experimenting (and more DTrace, of course) revealed that the problem was even nastier than the lines along which we were thinking: yes, we were taking a signal in the trampoline — but that logic was working as designed, with the program counter being moved correctly into the second half of the trampoline. The problem was that we were taking a second asynchronous signal, this time in the second half of the trampoline, and this brought us to this logic in dtrace_safe_defer_signal:

         * If we've executed the original instruction, but haven't performed
         * the jmp back to t->t_dtrace_npc or the clean up of any registers
         * used to emulate %rip-relative instructions in 64-bit mode, do that
         * here and take the signal right away. We detect this condition by
         * seeing if the program counter is the range [scrpc + isz, astpc).
        if (t->t_dtrace_astpc - rp->r_pc <
            t->t_dtrace_astpc - t->t_dtrace_scrpc - isz) {
                rp->r_pc = t->t_dtrace_npc;
                t->t_dtrace_ft = 0;
                return (0);

Now, the reason for this clause (the meat of which I’ve elided) is the subject of its own discussion, but it is not correct as written: it works (due to some implicit use of unsigned math) when the program counter is in the first half of the trampoline, but it fails if the program counter is on the instrumented instruction in the second half of the trampoline (denoted with t_dtrace_astpc). If we hit this case — first one signal on the first instruction of the trampoline followed by another signal on the first instruction on the second half of the trampoline — we will redirect control to be the instruction after the instrumentation (t_dtrace_npc) without having executed the instruction. This, if it needs to be said, is bad — and leads to exactly the kind of mysterious death that we’re seeing.

Fortunately, the fix is simple — here’s the patch. All of this leaves just one more mystery: why are we seeing all of these signals on node, anyway? They are all SIGPROF signals, and when I pinged Ryan about this yesterday, he was as surprised as I was that node seemed to be doing them when otherwise idle. A little digging on both of our parts revealed that the signals we were seeing are an artifact of some V8 magic: the use of SIGPROF is part of the feedback-driven optimization in the new Crankshaft compilation infrastructure for V8. So the signals are goodness — and that DTrace can now correctly handle them with respect to arbitrary instrumentation that much more so!

Posted on March 9, 2011 at 11:27 pm by admin · Permalink · 9 Comments
In: Uncategorized

Log/linear quantizations in DTrace

As I alluded to when I first joined Joyent, we are using DTrace as a foundation to tackle the cloud observability problem. And as you may have seen, we’re making some serious headway on this problem. Among the gritty details of scalably instrumenting latency is a dirty little problem around data aggregation: when we instrument the system to record latency, we are (of course) using DTrace to aggregate that data in situ, but with what granularity? For some operations, visibility into microsecond resolution is critical, while for other (longer) operations, millisecond resolution is sufficient – and some classes of operations may take hundreds or thousands (!) of milliseconds to complete. If one makes the wrong choice as the resolution of aggregation, one either ends up clogging the system with unnecessarily fine-grained data, or discarding valuable information in overly coarse-grained data. One might think that one could infer the interesting latency range from the nature of the operation, but this is unfortunately not the case: I/O operations (a typically-millisecond resolution operation) can take but microseconds, and CPU scheduling operations (a typically-microsecond resolution operation) can take hundreds of milliseconds. Indeed, the problem is that one often does not know what order of magnitude of resolution is interesting for a particular operation until one has a feel for the data for the specific operation — one needs to instrument the system to best know how to instrument the system!

We had similar problems at Fishworks, of course, and my solution there was wholly dissatisfying: when instrumenting the latency of operations, I put in place not one but many (five) lquantize()‘d aggregations that (linearly) covered disjoint latency ranges at different orders of magnitude. As we approached this problem again at Joyent, Dave and Robert both balked at the clumsiness of the multiple aggregation solution (with any olfactory reaction presumably heightened by the fact that they were going to be writing the pungent code to deal with this). As we thought about the problem in the abstract, this really seemed like something that DTrace itself should be providing: an aggregation that is neither logarithmic (i.e., quantize()) nor linear (i.e., lquantize()), but rather something in between…

After playing around with some ideas and exploring some dead-ends, we came to an aggregating action that logarithmically aggregates by order of magnitude, but then linearly aggregates within an order of magnitude. The resulting aggregating action — dubbed “log/linear quantization” and denoted with the new llquantize() aggregating action — is paramaterized with a factor, a low magnitude, a high magnitude and the number of linear steps per magnitude. With the caveat that these bits are still hot from the oven, the patch for llquantize() (which should patch cleanly against illumos) is here.

To understand the problem we’re solving, it’s easiest to see llquantize() in action:

# cat > llquant.d
	 * A log-linear quantization with factor of 10 ranging from magnitude
	 * 0 to magnitude 6 (inclusive) with twenty steps per magnitude
	@ = llquantize(i++, 10, 0, 6, 20);

/i == 1500/
# dtrace -V
dtrace: Sun D 1.7
# dtrace -qs ./llquant.d

           value  ------------- Distribution ------------- count
             < 1 |                                         1
               1 |                                         1
               2 |                                         1
               3 |                                         1
               4 |                                         1
               5 |                                         1
               6 |                                         1
               7 |                                         1
               8 |                                         1
               9 |                                         1
              10 |                                         5
              15 |                                         5
              20 |                                         5
              25 |                                         5
              30 |                                         5
              35 |                                         5
              40 |                                         5
              45 |                                         5
              50 |                                         5
              55 |                                         5
              60 |                                         5
              65 |                                         5
              70 |                                         5
              75 |                                         5
              80 |                                         5
              85 |                                         5
              90 |                                         5
              95 |                                         5
             100 |@                                        50
             150 |@                                        50
             200 |@                                        50
             250 |@                                        50
             300 |@                                        50
             350 |@                                        50
             400 |@                                        50
             450 |@                                        50
             500 |@                                        50
             550 |@                                        50
             600 |@                                        50
             650 |@                                        50
             700 |@                                        50
             750 |@                                        50
             800 |@                                        50
             850 |@                                        50
             900 |@                                        50
             950 |@                                        50
            1000 |@@@@@@@@@@@@@                            500
            1500 |                                         0

As you can see in the example, at each order of magnitude (determined by the factor — 10, in this case), we have 20 steps (or the factor at orders of magnitude less than the number of steps): up to 10, the buckets each span 1 value; from 10 through 100, each spans 5 values; from 100 to 1000, each spans 50; from 1000 to 10000 each spans 500, and so on. The result is not necessarily easy on the eye (and indeed, it seems unlikely that llquantize() will find much use on the command line), but is easy on the system: it generates data that has the resolution of linearly quantized data with the efficiency of logarithmically quantized data.

Of course, you needn’t appreciate the root system here to savor the fruit: if you are a Joyent customer, you will soon be benefitting from llquantize() without ever having to wade into the details, as we bring real-time analytics to the cloud. Stay tuned!

Posted on February 8, 2011 at 9:48 pm by bmc · Permalink · 3 Comments
In: Uncategorized

The DIRT on JSConf.eu and Surge

I just got back from an exciting (if exhausting) conference double-header: JSConf.eu and Surge. These conferences made for an interesting pairing, together encompassing much of the enthusiasm and challenges in today’s systems. It started in Berlin with JSConf.eu, a conference that reflects both JavaScript’s energy and diversity. Ryan introduced Node.js to the world at JSConf.eu a year ago, and enthusiasm for Node at the conference was running high; I would estimate that at least half of the attendees were implementing server-side JavaScript in one fashion or another. Presentations from the server-side were epitomized by Felix and Philip, each of whom presented real things done with Node — expanding on the successes, limitations and futures of what they had developed. The surge of interest in the server-side is of course changing the complexion of not only JSconf but of JavaScript itself. a theme that Chris Williams hit with fervor in his closing plenary, in which he cautioned of the dangers for a technology and a community upon whom a wave of popularity is breaking. Having been been near the epicenter of Java’s irrational exuberance in the mid-1990s (UltraJava, anyone?), I certainly share some of Chris’s concerns — and I think that Ryan’s talk on the challenges in Node reflects a humility that will become essential in the coming year. That said, I am also looking forward to seeing the JavaScript and Node communities expand to include more of the systems vets that have differentiated Node to date.

Speaking of systems vets, a highlight of JSConf.eu was hanging out with Erik Corry from the V8 team. JavaScript and Node owe a tremendous debt of gratitude to V8, without whom we would have neither the high-performance VM itself, nor the broader attention to JavaScript performance so essential to the success of JavaScript on the server-side. Erik and I brainstormed ideas for deeper support of DTrace within V8, which is, of course, an excruciatingly hard problem: as I have observed in the past, magic does not layer well — and DTrace and a sophisticated VM like V8 are both conjuring deep magic that frequently operate at cross purposes. Still, it was an exciting series of conversations, and we at Joyent certainly look forward to taking a swing at this problem.

After JSConf.eu wrapped up, I spent a few days brainstorming and exploring Berlin (and its Trabants and Veyrons) with Mark, Pedro and Filip. I then headed to the inaugural Surge Conference in Baltimore, and given my experience there, I would put the call out to the systems tribe: Surge promises to be the premier conference for systems practitioners. We systems practitioners have suffered acutely at the hands of academic computer science’s foolish obsession with conferences as publishing vehicle, having seen conference after conference of ours becoming casualties on their bloody path to tenure. Surge, by contrast, was everything a systems conference should be: exciting talks on real systems by those who designed, built and debugged them — and a terrific hallway track besides.

To wit: I was joined at Surge by Joyent VP of Operations Ryan Nelson; for us, Surge more than paid for itself during Rod Cope‘s presentation on his ten lessons for Hadoop deployment. Rod’s lessons were of the ilk that only come the hard way; his was a presentation fired in a kiln of unspeakable pain. For example, he noted that his cluster had been plagued by a wide variety of seemingly random low-level problems: lost interrupts, networking errors, I/O errors, etc. Acting apparently operating on a tip in a Broadcom forum, Rod disabled C-states on all of his Dell R710s and 2970s — and the problems all lifted (and have been absent for four months). Given that we have recently lived through same (and came to the same conclusion, albeit more recently and therefore much more tentatively), I tweeted what Rod had just reported, knowing that Ryan was in John Allspaw‘s concurrent sesssion and would likely see the tweet. Sure enough, I saw him retweet it minutes later — and he practically ran in after the end of Rod’s talk to see if I was kidding. (When it comes to firmare- and BIOS-level issues, I assure you: I never kid.) Not only was Rod’s confirmation of our experience tremendously valuable, it goes to the kind of technologist at Surge: this is a conference with practitioners who are comfortable not only at lofty architectural heights, but also at the grittiest of systems’ depths.

Beyond the many thought-provoking sessions, I had great conversations with many people, including Paul, Rod, Justin, Chris, Geir, Stephen, Wez, Mark, Joe, Kate, Tom, Theo (natch) — and a reunion (if sodden) with surprise stowaway Ted Leung.

As for my formal role at Surge, I had the honor of giving one of the opening keynotes. In preparing for it, I tried to put it all together: what larger trends are responsible for the tremendous interest in Node? What are the implications of these trends for other elements of the systems we build? From Node Knockout, it is clear that a major use case for Node is web-based real-time applications. At the moment these are primarily games — but as practitioners like Matt Ranney can attest, games are by no means the only real-time domain in which Node is finding traction. In part based on my experience developing the backend for the Node knockout leaderboard, the argument that I made in the keynote is that these web-based real-time applications have a kind of orthogonal data consistency: instead of being on the ACID/BASE axis, the “consistency” of the data is its temporality. (That is, if the data is late, it’s effectively inconsistent.) This is not necessarily new (market-facing trading systems have long had these properties), but I believe we’re about to see these systems brought to a much larger audience — and developed by a broader swath of practitioners. And of course — like AJAX before it — this trend was begging for an acronym. Struggling to come up with something, I went back to what I was trying to describe: these are real-time systems — but they are notable because they are data-intensive. A ha: “data-intensive real-time” — DIRT! I dig DIRT! I don’t know if the acronym will stick or not, but I did note that it was mentioned in each of Theo‘s and Justin‘s talks — and this was just minutes after it was coined. (Video of this keynote will be available soon; slides are here.)

Beyond the keynote, I also gave a talk on the perils of building enterprise systems from commodity components. This blog entry has gone on long enough, so I won’t go into depth on this topic here, saying only this: fear the octopus!

Thanks to the organizers of both JSconf and Surge for two great conferences. I arrive home thoroughly exhausted (JSconf parties longer, but Surge parties harder; pick your poison), but also exhilarated and excited about the future. Thanks again — and see you next year!

Posted on October 2, 2010 at 3:33 pm by bmc · Permalink · 3 Comments
In: Uncategorized