Serving up disaster porn with Manta

Years ago, Ben Fried liberated me by giving me the words to describe myself: I am a disaster porn addict. The sordid specifics of my addiction would surely shock the unafflicted: there are well-thumbed NTSB final accident reports hidden under my matress; I prowl the internet late at night for pictures of dermoid cysts; and I routinely binge on the vexed Soviet missions to Mars.

In terms of my software work, my predilection for (if not addiction to) systems failure has manifested itself in an acute interest in postmortem debugging and debuggability. You can see this through my career, be it postmortem memory leak detection in 1999, postmortem object type identifcation in 2002, postmortem DTrace in 2003 or postmortem debugging of JavaScript in 2012.

That said, postmortem debugging does have a dark side — or at least, a tedious side: the dumps themselves are annoyingly large artifacts to deal with. Small ones are only tens of megabytes, but a core dump from a modestly large VM — and certainly a crash dump of a running operating system — will easily be gigabytes or tens of gigabyes in size. This problem is not new, and my career has been pockmarked by centralized dump servers that were constantly running low of space[1]. To free up space in the virtual morgue, dumps from resolved issues are inevitably torched on a regular basis. This act strikes me as a a kind of desecration; every dump — even one whose immediate failure is understood — is a sacred snapshot of a running system, and we can’t know what questions we might have in the future that may be answerable by the past. There have been many times in my career when debugging a new problem has led to my asking new questions of old dumps.

At Joyent, we have historically gotten by on the dump management problem with the traditional centralized dump servers managed with creaky shell scripts — but it wasn’t pretty, and I lived in fear of dumps slipping through the cracks. Ironically, it was the development of Manta that made our kludged-together solution entirely untenable: while in the Joyent public cloud we may have the luxury of (broadly) ignoring core dumps that may correspond to errant user code, in Manta, we care about every core dump from every service — we always want to understand why any component fails. But because it’s a distributed service, a single bug could cause many components to fail — and generate many, many dumps. In the earlier stages of Manta development, we quickly found ourselves buried under an avalanche of hundreds of dumps. Many of them were due to known issues, of course, but I was concerned that a hithertofore unseen issue would remain hidden among the rubble — and that we would lose our one shot to debug a problem that would return to haunt us in production.

A sexy solution

Of course, the answer to the Manta dump problem was clear: Manta may have been presenting us with a big data problem, but it also provided us a big data solution. Indeed, Manta is perfectly suited for the dump problem: dumps are big, but they also require computation (if in modest measure) to make any real use of them, and Manta’s compute abstraction of the operating system is perfect for running the debugger. There were just two small wrinkles, the first being that Manta as designed only allowed computation access to an object via a (non-interactive) job, posing a problem for the fundamentally interactive act of debugging. This wasn’t a technological limitation per se (after all, the operating system naturally includes interactive elements like the shell and so on), but allowing an interactive shell to be easily created from a Manta job was going to require some new plumbing. Fortunately, some lunchtime moaning on my part managed to convince Joshua Clulow to write this (if only to quiet my bleating), and the unadulterated awesomeness that is mlogin was born.

The second wrinkle was even smaller: for historical (if not antiquated) reasons, kernel crash dumps don’t have just one file, but two — a crash dump and a “name list” that contained the symbol table. But for over a decade we have also had the symbol table in the dump itself, and it was clear the vestigial appendix that was unix.0 needed to be removed and libkvm modified to dig it out of the dump — a straightforward fix.

Wrinkles smoothed, we had the necessary foundation in Manta to implement Thoth, a Manta-based system for dump management. Using the node-manta SDK to interact with Manta, Thoth allows for uploading dumps to Manta, tracking auxiliary information about each dump, debugging a dump (natch), querying dumps, and (most interestingly) automatically analyzing many dumps in parallel. (If you are in the SmartOS or broader illumos community and you want to use Thoth, it’s open source and available on GitHub.)

While the problem Thoth solves may have but limited appeal, some of the patterns it uses may be more broadly applicable to those building Manta-based systems. As you might imagine, Thoth uploads a dump by calculating a unique hash for the dump on the client[2]. Once the hash is calculated, a Manta directory is created that contains the dump itself. For the metadata associated with the dump (machine, application, datacenter and so on), I took a quick-and-dirty approach: Thoth stores the metadata about a dump in a JSON payload that lives beside the dump in the same directory. Here is what a JSON payload might look like:

% thoth info 6433c1ccfb41929d
  "name": "/thoth/stor/thoth/6433c1ccfb41929dedf3257a8d9160ea",
  "dump": "/thoth/stor/thoth/6433c1ccfb41929dedf3257a8d9160ea/core.svc.startd.70377",
  "pid": "70377",
  "cmd": "svc.startd",
  "psargs": "/lib/svc/bin/svc.startd",
  "platform": "joyent_20130625T221319Z",
  "node": "eb9ca020-77a6-41fd-aabb-8f95305f9aa6",
  "version": "1",
  "time": 1373566304,
  "stack": [
  "type": "core",

To actually debug a dump, one can use the “debug” command, which simply uses mlogin to allow interactive debugging of the dump via mdb:

% thoth debug 6433c1ccfb41929d
thoth: debugging 6433c1ccfb41929dedf3257a8d9160ea
 * created interactive job -- 8ea7ce47-a58d-4862-a070-a220ae23e7ce
 * waiting for session... - established
thoth: dump info in $THOTH_INFO
Loading modules: [ svc.startd ]

Importantly, this allows you to share a dump with others without moving the dump and without needing to grant ssh access or punch holes in firewalls — they need only be able to have access to the object. For an open source system like ours, this alone is huge: if we see a panic in (say) ZFS, I would like to enlist (say) Matt or George to help debug it — even if it’s to verify a hypothesis that we have developed. Transporting tens of gigs around becomes so painful that we simply don’t do it — and the quality of software suffers.

To query dumps, Thoth runs simple Manta compute jobs on the (small) metadata objects. For example, if you wanted to find all of the dumps from a particular machine, you would run:

% thoth ls node=eb9ca020-77a6-41fd-aabb-8f95305f9aa6
thoth: creating job to list
thoth: created job 2e8af859-ffbf-4757-8b54-b7865307d4d9
thoth: waiting for completion of job 2e8af859-ffbf-4757-8b54-b7865307d4d9
DUMP             TYPE  TIME                NODE/CMD         TICKET
0ccf27af56fe18b9 core  2013-07-11T18:11:43 svc.startd       OS-2359
6433c1ccfb41929d core  2013-07-11T18:11:44 svc.startd       OS-2359

This kicks off a job that looks at all of the metadata objects in parallel in a map phase, pulls out those that match the specified machine and passes them to a reduce phase that simply serves to coalesce the results into a larger JSON object. We can use the mjob command along with Trent Mick‘s excellent json tool to see the first phase:

% mjob get 2e8af859-ffbf-4757-8b54-b7865307d4d9 | json phases[0].exec
mls /thoth/stor/thoth | awk '{ printf("/thoth/stor/thoth/%s/info.json\n", $1) }' | xargs mcat

So this, as it turns out, is a tad fancy: in the context of a job, it runs mls on the thoth directory, formats that as an absolute path name to the JSON object, and then uses the venerable xargs to send that output as arguments to mcat. mcat interprets its arguments as Manta objects, and runs the next phase of a job in parallel on those objects. We could — if we wanted — do the mls on the client and then pass in the results, but it would require an unnecessary roundtrip between server and client for each object; the power of Manta is that you can get the server to do work for you! Now let’s take a look at the next phase:

% mjob get 2e8af859-ffbf-4757-8b54-b7865307d4d9 | json phases[1].exec
json -c 'node=="eb9ca020-77a6-41fd-aabb-8f95305f9aa6"'

This is very simple: it just uses json‘s -c option to pass the entire JSON blob if the “node” property matches the specified value. Finally, the last phase:

% mjob get 2e8af859-ffbf-4757-8b54-b7865307d4d9 | json phases[2].exec
json -g

This uses json (again) to aggregate all of the JSON blobs into one array that contains all of them — and this array is what is returned as the output of the job. Now, as one might imagine, all of this is a slight abuse of Manta, as it’s using Manta to implement a database (albeit crudely) — but Manta is high performing enough that this works well, and it’s much simpler than having to spin up infrastructure dedicated to Thoth (which would require sizing a database, making it available, backing it up, etc). Manta loves it quick and dirty!

Getting kinky

Leveraging Manta’s unique strengths, Thoth also supports a notion of analyzers, simple shell scripts that run in the context of a Manta job. For example, here is an analyzer that determines if a dump is a duplicate of a known svc.startd issue:

# This analyzer only applies to core files
if [[ "$THOTH_TYPE" != "core" ]]; then
	exit 0;

# This is only relevant for svc.startd
if [[ `cat $THOTH_INFO | json cmd` != "svc.startd" ]]; then
	exit 0;

# This is only OS-2359 if we have utmpx_postfork in our stack
if ( ! echo ::stack | mdb $THOTH_DUMP | grep utmpx_postfork > /dev/null ); then
	exit 0;

# We have a winner! Set the ticket.
thoth_ticket OS-2359
echo $THOTH_NAME: successfully diagnosed as OS-2359

Thoth’s “analyze” command can then be used to run this against one dump, any dumps that match a particular specification, or all dumps.

The money shot

Here’s a more involved analyzer that pulls an SMF FMRI out of the dump and — looking at the time of the dump — finds the log (in Manta!) that corresponds to that service at that hour and looks for a string (“PANIC”) in that log that corresponds to service failure. This log snippet is then attached to the dump via a custom property (“paniclog”) that can then be queried by other analyzers:

if [[ "$THOTH_TYPE" != "core" ]]; then
	exit 0

if ( ! pargs -e $THOTH_DUMP | grep -w SMF_FMRI > /dev/null ); then
	exit 0

fmri=`pargs -e $THOTH_DUMP | grep -w SMF_FMRI | cut -d= -f2-`
time=`cat $THOTH_INFO | json time`
node=`cat $THOTH_INFO | json node`

path=`basename $fmri | cut -d: -f1`/`date -d @$time +%Y/%m/%d/%H`/${node}.log

echo === $THOTH_NAME ===

echo log: $path

if ( ! mget /$MANTA_USER/stor/logs/$path > log.out ); then
	echo "  (not found)"
	exit 0

grep -B 10 -A 10 PANIC log.out > panic.out

echo paniclog:
cat panic.out

thoth_set paniclog < panic.out

In a distributed system, logs are a critical component of system state — being able to quickly (and automatically) couple those logs with core dumps has been huge for us in terms of quickly diagnosing issues.

Pillow talk

Thoth is definitely a quick-and-dirty system — I was really designing it to satisfy our own immediate needs (and yes, addictions). If the usage patterns changed (if, for example, one were going to have millions of dumps instead of thousands), one would clearly want a proper database fronting Manta. Of course, it must also be said that one would then need to deal with annoying infrastructure issues like sizing that database, making it available, backing it up, etc. And if (when?) this needs to be implemented, my first approach would be to broadly keep the structure as it is, but add a Manta job that iterates over all metadata and adds the necessary rows in the appropriate tables of a centralized reporting database. This allows Manta to store the canonical data, leveraging the in situ compute to populate caching tiers and/or reporting databases.

As an engineer, it’s always gratifying to build something that you yourself want to use. Strange as it may sound to some, Thoth epitomizes that for me: it is the dump management and analysis framework that I never dreamed possible — and thanks to Manta, it was amazingly easy (and fun!) to build. Most importantly, in just the short while that we’ve had Thoth fully wired up, we have already caught a few issues that would have likely escaped notice before. (Thanks to a dump harvested by Thoth, we caught a pretty nasty OS bug that had managed to escape detection for many years.) Even though Thoth’s purpose is admittedly narrow, it’s an example of the kinds of things that you can build on top of Manta. We have many more such ideas in the works — and we’re not the only ones; Manta is inspiring many with their own big data problems — from e-commerce companies to to scientific computing and much in between!

[1] Pouring one out for my dead homie, cores2.central

[2] Shout-out to Robert‘s excellent node-ctype which allows for binary dumps to be easily parsed client-side

Posted on July 23, 2013 at 5:05 pm by bmc · Permalink · 2 Comments
In: Uncategorized

Manta: From revelation to product

If you haven’t seen it, you should read Mark’s blog entry on Manta, the revolutionary new object storage service we announced today that features in situ compute. The idea for this is beautifully simple: couple a ZFS-based distributed object storage system with the OS-level virtualization in SmartOS to deliver a system that allows arbitrary compute to be spun up where objects live. That is, not only can you store and retrieve objects (as you can with any internet-facing object store), you can also specify compute jobs to be operated upon those objects without requiring data motion. (If you still need to be convinced that this represents a new paradigm of object storage, check out this screencast.)

Sometimes simple ideas can seem obvious in hindsight — especially as the nuance of historical context is lost to time — so for the record let me be clear on this point: this idea wasn’t obvious. I say this unequivocally because I myself was trying think about how we could use the technological differentiators in SmartOS to yield a better object store — and it was very hard to not seek that differentiator in ZFS. As myopic as it may seem now in retrospect, I simply couldn’t look beyond ZFS — it just had to hold the riddle to a next generation object store!

But while ZFS is essential to Manta, it is ultimately as an implementation detail; the technology that serves as the essential enabler is the OS-level virtualization provided by Zones, which allow us to easily, quickly and securely spin up on-the-metal, multi-tenant compute on storage nodes without fear of compromising the integrity of the system. Zones hit the sweet spot: hardware virtualization (e.g. KVM) is at too low a level of abstraction to allow this efficiently, and higher levels of virtualization (e.g. PaaS offerings) sacrifice expressive power and introduce significant multi-tenancy complexity and risk.

Of course, all of this was obvious once Mark had the initial insight to build on Zones; what we needed to build was instantly self-evident. This flash of total clarity is rare in a career; I have only felt it a handful of times and it’s such an intense moment that it becomes locked in memory. I remember exactly where I was when Bonwick described to me the first ideas for what became ZFS (my dimly lit office in MPK17, circa 2000) — and I remember exactly where I was when I described to Bonwick my first ideas for what became DTrace (in Bart’s old blue van on Willow Road crossing the 101, February 1996). Given this, there was one thing about Manta that troubled me: I couldn’t remember where I was when Mark described the idea to me. I knew that it came from Mark, and I knew that it was sometime in the fall of 2011, but I couldn’t remember the details of the conversation. In talking to Mark about this, he couldn’t remember either — so I decided to go through old IM logs to determine when we first started talking about it to help us both date it.

And in going through my logs, it became clear why I couldn’t remember that initial conversation — because there wasn’t one, at least not in the traditional sense: it happened over IM. This (accidentally) captured for posterity a moment of which one has so few: having one’s brain blasted by enlightenment. (I know that it will disappoint my mother that I dealt with this essentially by swearing, so let me pause to explain that this isn’t her fault; as she herself points out from time to time, I wasn’t raised this way.) So here is my initial conversation with Mark, with some sensitive details redacted:

Of course, a flash of insight is a long way from a product — and that conversation over a year and a half ago was just the beginning of a long journey. As Mark mentioned, shortly after this conversation, he was joined by Dave who led the charge on Manta computation. To this duo, we added Yunong, Nate and Fred. Beyond that core Manta team, Keith developed the hardware architecture for Manta, Jerry developed the first actual Manta code with hyprlofs, and Bill and Matt wrote a deployment management system for Manta — and Matt further developed the DNS service that is at the heart of the system.

As Mark mentioned, Manta is built on top of SmartDataCanter, and as such the engineering work behind it was crucial to Manta: Orlando developed the compute node service that is involved with the provision of every Manta service, Pedro built the workflow engine that actually implements provisioning and Kevin developed the operator console that you’ll have to just trust me is awesome. John developed the auth mechanism that many first-time users will use to create their Joyent accounts today, and Andrés developed the name service replication that assures that those new users will be able to store to Manta.

In terms of SDKs, Marsell developed the ruby-manta SDK, Trent developed both the Python SDK (including mantash!) and Bunyan — the node.js logging service that has been proven indispensable time and time again when debugging Manta issues. Speaking of node.js, no one will be surprised to learn that Manta is largely implemented in node — and that TJ and Isaac were both clutch in helping us debug some nasty issues down the stretch, reinforcing our conviction that the entire stack should be under one (virtual) roof!

Josh Wilsdon developed the vmadm that lies at the heart of SmartOS provisioning — and deserves special mention for the particularly heavy hand he applied to Manta in some of our final load testing; any system that can survive Josh is production-ready! Robert and Rob both jumped in on countless networking issues — they were both critical for implementing some necessary but complicated changes to Manta’s physical networking topology. Brendan provided the performance analysis that is his hallmark, joining up with Robert to form Team Tunable — from whom no TCP tunable is safe!

Jonathan developed the multi-archecture package support which became necessary for the Manta implementation and Filip made sure all sorts of arcane software ran on SmartOS in the Manta compute zone. (When you find that what you’re looking for is already installed in a Manta compute zone, you have Filip to thank!) Finally, Josh Clulow developed mlogin — which really must be tried to be believed. If you’re trying to understand the Manta compute model or if you just want to play around, give mlogin a whirl!

From Mark’s initial flash of insight (and my barrage of swearing) to finished product, it has been a long road, but we are proud to deliver it to you today; welcome to Manta — and dare we say to the next generation of object storage!

Posted on June 25, 2013 at 6:16 am by bmc · Permalink · One Comment
In: Uncategorized

A systems software double-header: Surge and GOTO

I recently returned from a systems software double-header, presenting at two of the industry’s best conferences: Surge in Baltimore and GOTO in Aarhus, Denmark. These conferences have much in common: they are both expertly run; they both seek out top technical content; they both attract top technologists; and (it must be said) they both take excellent care of their speakers!

At Surge, I presented with Brendan Gregg on DIRT in production. This was a dense presentation, but it was fun for us to recount a sampling of the kinds of issues that we have seen in the new breed latency-sensitive data-intensive application. I think Brendan and I both agreed that this presentation could have easily been two or three times the length; we have plenty of scars from finding latency outliers in DIRTy systems! Beyond presenting, the lightning talks were a highlight for me at this year’s Surge. I am an inveterate disaster porn addict, so it should come as no surprise that I particularly enjoyed Chris Burroughs‘ retelling of how his day that he had set aside to prepare his lightning talk was beset by disaster (including a PDU failure at the moment that it seemed things couldn’t get any worse), and then Scott Sanders on the surprisingly nasty failure modes that one can see in a mobile app when the root-cause is a botched datacenter upgrade in your mobile provider. For my own lightning talk, I dug into the perverse history of what may well be the most obscure two-letter Unix command: the “old and rarely used” ta; hopefully others enjoyed learning about the history of this odd little wart as much as I did!

Update: This lightning talk is now online and can be viewed here.

Dave and me with Lars Bak, Kasper Lund and the Dart (né V8) team

After Surge, Dave and I headed east, to Denmark. There, we presented on dynamic languages in production. Our purposes for going to Aarhus were more than just GOTO, however: we also went to meet with Lars Bak, Kasper Lund, Erik Corry, the inimitable Mr. Aleph and the other engineers behind V8. As the birthplace of V8, Aarhus has always had a Dagobah-like appeal to me: I feel that when on the freezing ice planet that is mdb_v8.c, Dave and I might as well have seen Obi-Wan Kenobi as an apparition, telling us we must go to Aarhus if we wanted to truly understand objects-inl.h. Dave and I immensely enjoyed our time with Lars and his team; it is rare to be in the presence of a team that has single-handedly changed the trajectory of software — and it is rarer still to be able bring one’s own technology to such an august table. We were thrilled to be able to demonstrate the confluence of our technology (specifically, MDB and DTrace) with theirs — and to brainstorm ways that we might better solve the abstract problem of production debuggability of dynamic environments. This is a problem that we’ve been after in one way or another for nearly a decade, but this was the first time that we were having the conversation early enough in the VM implementation phase (specifically, of Dart) that one can hope to get ahead of the problem instead of chasing it. In particular, Lars’s team has a very interesting idea for debuggability of Dart that — if successful — may allow it to achieve our nirvana of total postmortem debuggability in a dynamic environment. That, coupled with its enlightened (C-based!) native interface, left Dave and me optimistic about the long-term prospects of Dart on the server side. It’s obviously early days still for Dart, but given the pedigree and talent of the team and their disposition towards making it both highly performing and highly debuggable, it’s clearly something to pay attention to on the server side.

Presentation and meeting accomplished, I still had a burning question: what was all of this terrific VM talent doing in the second city of a small country with an excruciatingly difficult language to pronounce? (Aside: ask a Dane to say “dead red red-eyed rotten smoked trout” in Danish.) In talking to Lars, Kasper and Erik about how they ended up in Aarhus a common theme emerged: all had — in one way or another — been drawn to work with Professor Ole Lehrmann Madsen on the VM for the BETA language. (No, I hadn’t heard of BETA either, and yes, it’s in all caps — the name is apparently meant to be screamed.) Everyone I spoke with had such fond memories of BETA (“you know, it was actually a very interesting system…”) that I have taken to calling this school of VM engineers the BETA School. The work and influence of the engineers of the BETA School extend beyond Google, with its members having done foundational VM work at VMware, Sun and a slew of startups.

I suppose I shouldn’t be surprised that this cluster of engineers has a professor and a university at its root; Brown’s Tom Doeppner has served a similar role for us, with his CS169 giving rise to an entire generation of OS innovation. The two schools of engineers share much in common: both are highly technical and close-knit — but with a history of being open to newcomers and an emphasis on developing their junior members. (Of course, the difference is that we of the CS169 School don’t all live in Providence.) And the intersection of the two schools is particularly interesting: VMware’s seminal ASPLOS paper is a joint product of Keith Adams of the CS169 School and Ole Agesen of the BETA School.

Curiosity about Aarhus and its origins sated, I arrived home from my systems software double header exhausted but also inspired: there seems to be no limit of hard, interesting problems in our domain — and many profoundly talented engineers to tackle them!

Posted on October 8, 2012 at 12:00 am by bmc · Permalink · One Comment
In: Uncategorized

Post-revolutionary open source

I spent last week in Porto Alegre, Brazil at FISL (the Fórum Internacional Software Livre — which Americans will pronounce “fizzle” but which the Brazilians pronounce more like “fees-lay”). I was excited to go to FISL based on the past experiences of former Sun Microsystems folks like Eric Schrock and Deirdré Straughan — and I thought that it would provide an excellent opportunity to present what I’ve learned over the past decade or so about how companies screw up open source. My presentation, corporate open source anti-patterns, was well received — surprisingly so, in fact. In particular, I had expected controversy around my claim that anti-collaborative strong-copyleft licenses — which is to say, the GPLv2 — are an open source anti-pattern. As I noted in my presentation, I believe that the GPLv2 serves more to erect walls between open source projects than it does to liberate proprietary ones. To support this assertion, I pointed to the fact that, despite significant demand, Linux does not have ports of DTrace and ZFS generally available — even though they have been open source for the better part of a decade and have been ported to many other operating systems. But as I made the case for moving away from the GPLv2 (and towards collaboration-friendly licenses like MPLv2), I was surprised to see heads bobbing in agreement. And when it came time for questions, the dissent (such as it was) was not regarding my assertion about GPLv2, but rather about copyright assignment: I claimed (and claim) that one must be very careful about having copyright assignment in an open source project because it relies on the good behavior of a single actor. But one of the questioners pointed out that lack of copyright assignment has the very important drawback of making changing the license essentially impossible. The questioner’s background is relevant here: it was Jean-Baptiste Kempf, one of the core contributors to VLC. JB’s question came from his own experience: he had just led the VLC team through relicensing libVLC, an exercise made excruciating by their lack of copyright assignment. Why were they relicensing their core software? Ironically, because they had come to my same conclusion with respect to anti-collaborative licensing: they relicensed libVLC from GPLv2 to LGPLv2 to allow it to be more readily shared with other projects. And of course, VLC’s not alone — meteor.js is another high-profile project that came out as GPLv2 but has since relicensed to MIT.

So what’s going on here? I think the short answer is that open source is firmly in its post-revolutionary phase: it is so throughly mainstream that it is being entirely de-radicalized. While some extremists may resist this trend, it actually reflects the depth of victory for open source: the triumph is so complete that the young software engineers of today simply don’t see proprietary software as a threat. Like an aspiring populace after a long period of conflict, the software engineers behind today’s top open source projects are looking much more to a future of collaboration than to the past of internecine rivalries. While there are many indicators of this new era — from the license breakdown of Github’s top projects to the rise of Github itself — I think that the MPLv2 seems to particularly capture this zeitgeist: when it was designed, it was done so to be explicitly compatible with both GPL (GPLv2 and GPLv3) and Apache. So the open source revolution may be over, but that’s because the war has been won — may we not squander its spoils!

Posted on August 1, 2012 at 9:49 pm by bmc · Permalink · One Comment
In: Uncategorized

DTrace in the zone

When the zones facility was originally developed many years ago, our focus on integrating it with respect to DTrace was firmly on the global zone: we wanted to be sure that we could use DTrace from the global zone to instrument the entire system, across all zones. This was (and remains) unique functionality — no other implementation of OS-based virtualization has the whole-system dynamic instrumentation to go with it. But as that focus implies, the ability to use DTrace in anything but the global zone was not a consideration. Fortunately, in 2006, Dan Price on the zones team took it upon himself to allow DTrace to be used in the non-global zone. This was a tricky body of work that had to carefully leverage the DTrace privilege model developed by Adam, and it allowed root users in the non-global zone to have access to the syscall provider and any user-level providers in their zone (both USDT-provided and pid-provided).

As great as this initial work was, when I started at Joyent nearly two years ago, DTrace support in the zone hadn’t really advanced beyond it — and there were still significant shortcomings. In particular, out of (reasonable) paranoia, nothing in the kernel could be instrumented or recorded by an enabling in the non-global zone. The problem is that many commonly used D variables — most glaringly, cpu, curlwpsinfo, curpsinfo, and fds[] — rely on access to kernel structures. Worse, because of our success in abstracting away the kernel with these variables, users didn’t even know that they were trying to record illegal kernel data — which meant that the error message when it failed was entirely unintuitive:

  [my-non-global-zone ~]# dtrace -n BEGIN'{trace(curpsinfo->pr_psargs)}'
  dtrace: description 'BEGIN' matched 1 probe
  dtrace: error on enabled probe ID 1 (ID 1: dtrace:::BEGIN): invalid kernel access in action #1 at DIF offset 44

The upshot of this was, as one Joyent customer put it to me bluntly, “whenever I type in a script from Brendan’s book, it doesn’t work.” And just to make sure I got the message, the customer sharpened the point considerably: “I’m deployed on SmartOS because I want to be able to use DTrace, but you won’t let me have it!” At that point, I was pretty much weeping and begging for forgiveness — we clearly had to do better.

As I considered this problem, though, a somewhat obvious solution to at least a portion of the problem became clear. The most painful missing variables — curpsinfo and curlwpsinfo — are abstractions on the kernel’s proc_t and kthread_t structures, respectively. There are two important observations to be made about these structures. First, looking at these structures, there is nothing about one’s own proc_t or kthread_t structures that one can use for privilege escalation: it’s either stuff you already can figure out about yourself via /proc, or kernel addresses that you can’t do anything with anyway. Second (and essentially), we know that these structures cannot disappear or otherwise change identity while in probe context: these structures are only freed after the process or thread that they represent has exited — and we know by merits of our execution in probe context that we have not, in fact, exited. So we could solve a good portion of the problem just by allowing access to these structures for processes for which one has the dtrace_proc privilege. This change was straightforward (and the highlight of it for me personally was that it reminded me of some clever code that Adam had written that dynamically instruments DIF to restrict loads as needed). So this took care of curpsinfo, curlwpsinfo, and cpu — but it also brought into stark relief an even tougher problem: fds[].

The use of fds[] in the non-global zone has long been a problem, and it’s a brutal one, as fds[] requires grovelling around in kernel data structures. For context, here is the definition of fds[] for illumos, as defined in io.d:

inline fileinfo_t fds[int fd] = xlate  (
    fd >= 0 && fd t_procp->p_user.u_finfo.fi_nfiles ?
    curthread->t_procp->p_user.u_finfo.fi_list[fd].uf_file : NULL);

Clearly, one cannot simply allow such arbitrary kernel pointer chasing from the non-global zone without risk of privilege escalation. At dtrace.conf, I implied (mistakenly, as it turns out) that this problem depends on (or would be otherwise solved by) dynamic translators. As I thought about this problem after the conference, I realized that I was wrong: even if the logic to walk those structures were in-kernel (as it would be, for example, with the addition of a new subroutine), it did not change the fundamental problem that the structures that needed to be referenced could themselves disappear (and worse, change identity) during the execution of an enabling in probe context. As I considered the problem, I realized that I had been too hung up on making this arbitrary — and not focussed enough on the specifics of getting fds[] to work.

With the focus sharpening to fds[], I realized that some details of the the kernel’s implementation could actually make this reasonable to implement. In particular, Jeff‘s implementation of allocating slots in fi_list means that an fi_list never goes away (it is doubled but the old fi_list is not freed) and that fi_nfiles is never greater than the memory referred to by fi_list. (This technique — memory retiring — can be used to implement a highly concurrent table; the interested reader is pointed to both the ACM Queue article that Jeff and I wrote that describes it in detail and to the implementation itself.) So as long as one is only able to get the file_t for one’s own file descriptor, we could know that the array of file_t pointers would not itself be freed over the course of probe context. That brings us to the file_t itself: unlike the fi_list, the file_t can be freed while one is in probe context because one thread could be closing a file while another is in probe context referencing fds[]. Solving this problem required modifying the kernel, if one slightly: an added a hook to the closef() path that DTrace can optionally use to issue a dtrace_sync(). (A dtrace_sync() issues a synchronous cross-call to all CPUs; because DTrace disables interrupts in probe context, this can be used as a barrier with respect to probe context.)

Adding a hook of this nature requires an understanding of the degree to which the underlying code path is performance-critical. That is, to contemplate adding this hook, we needed to ask: how hot is closef(), anyway? Historically in my career as a software engineer, this kind of question would be answered with a combination of ass scratching and hand waving. Post-DTrace, of course, we can answer those questions directly — but only if we have a machine that has a representative workload (which we only had to a very limited degree at Sun). But one of my favorite things about working at Joyent is that we don’t have to guess when we have a question like this: we can just fire up DTrace on every node in the public cloud for a couple of seconds and get the actual data from (many) actual production machines. And in this case, the data was at least a little bit surprising: we had (many) nodes that were calling closef() thousands of times per second. (The hottest node saw nearly 5,000 closef()’s per second over a ten second period — and there were many that saw on the order of 2,000 per second.) This is way too hot to always be executing a dtrace_sync() when DTrace is loaded (as it effectively always is on our systems), so we had to be sure that we only executed the dtrace_sync() in the presence of an enabling that actually used fds[] with reduced privileges (e.g., from a non-global zone).

This meant that fds[] needed to be implemented in terms of a subroutine such that we could know when one of these is enabled. That is, we needed a D subroutine to translate from a file descriptor to a file_t pointer within the current context. Given this, the name of the subroutine was obvious: it had to be getf() — the Unix routine that does exactly this in the kernel, and has since Fourth Edition Unix, circa 1973. (Aside: one of the reasons I love illumos is because of this heritage: compare the block comment above the Seventh Edition implementation of getf() to that above getf() currently in illumos — yes, we still have comments written by Ken!) So the change to allow fds[] in the non-global zone
ended up being more involved, but in the end, it wasn’t too bad — especially given the functionality that it unlocked.

With these two changes made, Brendan and I brainstormed about what remained out of reach for the non-global zone — and the only other thing that we could come up with as commonly wanting in the non-global zone would be access to the sched and proc providers. Allowing use of these in-kernel providers would require allowing enabling of the probes but not firing them when outside the context of one’s own zone (similar to the syscall provider) and further explicitly forbidding getting anything about privileged context (specifically, register state). Fortunately, I had had to do exactly this to fix a somewhat nasty zones-related issue with the profile provider, and it seemed that it wouldn’t be too bad to extend this with these slightly new semantics. This proved to be the case, and the change to allow sched, proc, vminfo and sysinfo providers in the non-global zone ended up being very modest.

So with these three changes, I am relieved to report that DTrace is now completely usable in the non-global zone — and all without sacrificing the security model of zones! If you are a Joyent cloud customer, we will be rolling out a platform with this modification across the cloud (it necessitates a reboot, so don’t expect it before your next scheduled maintenance window); if you are a SmartOS user, look for this in our next SmartOS release; and if you are using another illumos-based distro (e.g., OpenIndiana or OmniOS) look for it in an upcoming release — we will be integrating these changes into illumos, so you can expect them to be in downstream distros soon. And here’s to DTrace in the zone!

Posted on June 7, 2012 at 12:18 am by bmc · Permalink · 5 Comments
In: Uncategorized

Debugging node.js memory leaks

Part of the value of dynamic and interpreted environments is that they handle the complexities of dynamic memory allocation. In particular, one needn’t explicitly free memory that is no longer in use: objects that are no longer referenceable are found automatically and destroyed via garbage collection. While garbage collection simplifies the program’s relationship with memory, it not mean the end of all memory-based pathologies: if an application retains a reference to an object that is ultimately rooted in a global scope, that object won’t be considered garbage and the memory associated with it will not be returned to the system. If enough such objects build up, allocation will ultimately fail (memory is, after all, finite) and the program will (usually) fail along with it. While this is not strictly — from the native code perspective, anyway — a memory leak (the application has not leaked memory so much as neglected to unreference a semantically unused object), the effect is nonetheless the same and the same nomenclature is used.

While all garbage collected environments create the potential to create such leaks, it can be particularly easy in JavaScript: closures create implicit references to variables within their scopes — references that might not be immediately obvious to the programmer. And node.js adds a new dimension of peril with its strictly asynchronous interface with the system: if backpressure from slow upstream services (I/O, networking, database services, etc.) isn’t carefully passed to downstream consumers, memory will begin to fill with the intermediate state. (That is, what one gains in concurrency of operations one may pay for in memory.) And of course, node.js is on the server — where the long-running nature of services means that the effect of a memory leak is much more likely to be felt and to affect service. Take all of these together, and you can easily see why virtually anyone who has stood up node.js in production will identify memory leaks as their most significant open issue.

The state of the art for node.js memory leak detection — as concisely described by Felix in his node memory leak tutorial — is to use the v8-profiler and Danny Coates’ node-inspector. While this method is a lot better than nothing, it’s very much oriented to the developer in development. This is a debilitating shortcoming because memory leaks are often observed only after days (hours, if you’re unlucky) of running in production. At Joyent, we have long wanted to tackle the problem of providing the necessary tooling to identify node.js memory leaks in production. When we are so lucky as to have these in native code (that is, node.js add-ons), we can use libumem and ::findleaks — a technology that we have used to find thousands of production memory leaks over the years. But this is (unfortunately) not the common case: the common case is a leak in JavaScript — be it node.js or application code — for which our existing production techniques are useless.

So for node.js leaks in production, one has been left with essentially a Gedanken experiment: consider your load and peruse source code looking for a leak. Given how diabolical these leaks can be, it’s amazing to me that anyone ever finds these leaks. (And it’s less surprising that they take an excruciating amount of effort over a long period of time.) That’s been the (sad) state of things for quite some time, but recently, this issue boiled over for us: Voxer, a Joyent customer and long in the vanguard of node.js development, was running into a nasty memory leak: a leak sufficiently acute that the service in question was having to be restarted on a regular basis, but not so much that it was reproducible in development. With the urgency high on their side, we again kicked around this seemingly hopeless problem, focussing on the concrete data that we had: core dumps exhibiting the high memory usage, obtained at our request via gcore(1) before service restart. Could we do anything with those?

As an aside, a few months ago, Joyent’s Dave Pacheco added some unbelievable postmortem debugging support for node. (If postmortem debugging is new to you, you might be interested in checking out my presentation on debugging production systems from QCon San Francisco. I had the privilege of demonstrating Dave’s work in that presentation — and if you listen very closely at 43:26, you can hear the audience gasp when I demonstrate the amazing ::jsprint.) But Dave hasn’t built the infrastructure for walking the data structures of the V8 heap from an MDB dmod — and it was clear that doing so would be brittle and error-prone.

As Dave, Brendan and I were kicking around increasingly desperate ideas (including running strings(1) on the dump — an idea not totally without merit — and some wild visualization ideas), a much simpler idea collectively occurred to us: given that we understood via Dave’s MDB V8 support how a given object is laid out, and given that an object needed quite a bit of self-consistency with referenced but otherwise orthogonal structures, what about just iterating over all of the anonymous memory in the core and looking for objects? That is, iterate over every single address, and see if that address could be interpreted as an object. On the one hand, it was terribly brute force — but given the level of consistency required of an object in V8, it seemed that it wouldn’t pick up too many false positives. The more we discussed it, the more plausible it became, but with Dave fully occupied (on another saucy project we have cooking at Joyent — more on that later), I roped up and headed into his MDB support for V8…

The result — ::findjsobjects — takes only a few minutes to run on large dumps, but provides some important new light on the problem. The output of ::findjsobjects consists of representative objects, the number of instances of that object and the number of properties on the object — followed by the constructor and first few properties of the objects. For example, here is the dump on a gcore of a (non-pathological) Joyent node-based facility:

> ::findjsobjects -v
findjsobjects:         elapsed time (seconds) => 20
findjsobjects:                   heap objects => 6079488
findjsobjects:             JavaScript objects => 4097
findjsobjects:              processed objects => 1734
findjsobjects:                 unique objects => 161
fc4671fd        1      1 Object: flags
fe68f981        1      1 Object: showVersion
fe8f64d9        1      1 Object: EventEmitter
fe690429        1      1 Object: Credentials
fc465fa1        1      1 Object: lib
fc46300d        1      1 Object: free
fc4efbb9        1      1 Object: close
fc46c2f9        1      1 Object: push
fc46bb21        1      1 Object: uncaughtException
fe8ea871        1      1 Object: _idleTimeout
fe8f3ed1        1      1 Object: _makeLong
fc4e7c95        1      1 Object: types
fc46bae9        1      1 Object: allowHalfOpen
fc45e249       12      4 Object: type, min, max, default
fc4f2889       12      4 Object: index, fields, methods, name
fd2b8ded       13      4 Object: enumerable, writable, configurable, value
fc0f68a5       14      1 SlowBuffer: length
fe7bac79       18      3 Object: path, fn, keys
fc0e9d21       20      5 Object: _onTimeout, _idleTimeout, _idlePrev, ...
fc45facd       21      4 NativeModule: loaded, id, exports, filename
fc45f571       23      8 Module: paths, loaded, id, parent, exports, ...
fc4607f9       35      1 Object: constructor
fc0f86c9       56      3 Buffer: length, offset, parent
fc0fc92d       57      2 Arguments: length, callee
fe696f59       91      3 Object: index, fields, name
fc4f3785       91      4 Object: fields, name, methodIndex, classIndex
fe697289      246      2 Object: domain, name
fc0f87d9      308      1 Buffer:

Now, any one of those objects can be printed with ::jsprint. For example, let’s take fc45e249 from the above output:

> fc45e249::jsprint
    type: number,
    min: 10,
    max: 1000,
    default: 300,

Note that that’s only a representative object — there are (in the above case) 12 objects that have that same property signature. ::findjsobjects can get you all of them when you specify the address of the reference object:

> fc45e249::findjsobjects

And because MDB is the debugger Unix was meant to have, that output can be piped to ::jsprint:

> fc45e249::findjsobjects | ::jsprint
    type: number,
    min: 10,
    max: 1000,
    default: 300,
    type: number,
    min: 0,
    max: 5000,
    default: 5000,
    type: number,
    min: 0,
    max: 5000,
    default: undefined,

Okay, fine — but where are these objects referenced? ::findjsobjects has an option for that:

> fc45e249::findjsobjects -r
fc45e249 referred to by fc45e1e9.height

This tells us (or tries to) who is referencing that first (representative) object. Printing that out (with the “-a” option to show the addresses of the objects):

> fc45e1e9::jsprint -a
fc45e1e9: {
    ymax: fe78e061: undefined,
    hues: fe78e061: undefined,
    height: fc45e249: {
        type: fe78e361: number,
        min: 14: 10,
        max: 7d0: 1000,
        default: 258: 300,
    selected: fc45e3fd: {
        type: fe7a2465: array,
        default: fc45e439: [...],

So if we want to learn where all of these objects are referenced, we can again use a pipeline within MDB:

> fc45e249::findjsobjects | ::findjsobjects -r
fc45e249 referred to by fc45e1e9.height
fc46fd31 referred to by fc46b159.timeout
fc467ae5 is not referred to by a known object.
fc45ecb5 referred to by fc45eadd.ymax
fc45ec99 is not referred to by a known object.
fc45ec11 referred to by fc45eadd.nbuckets
fc45ebb5 referred to by fc45eadd.height
fc45eb41 referred to by fc45eadd.ymin
fc45eb25 referred to by fc45eadd.width
fc45e3d1 referred to by fc45e1e9.nbuckets
fc45e3b5 referred to by fc45e1e9.ymin
fc45e399 referred to by fc45e1e9.width

Of course, the proof of a debugger is in the debugging; would ::findjsobjects actually be of use on the Voxer dumps that served as its motivation? Here is the (elided) output from running it on a big Voxer dump:

> ::findjsobjects -v
findjsobjects:         elapsed time (seconds) => 292
findjsobjects:                   heap objects => 8624128
findjsobjects:             JavaScript objects => 112501
findjsobjects:              processed objects => 100424
findjsobjects:                 unique objects => 241
fe806139        1      1 Object: Queue
fc424131        1      1 Object: Credentials
fc424091        1      1 Object: version
fc4e3281        1      1 Object: message
fc404f6d        1      1 Object: uncaughtException
fafcb229     1007     23 ClientRequest: outputEncodings, _headerSent, ...
fafc5e75     1034      5 Timing: req_start, res_end, res_bytes, req_end, ...
fafcbecd     1037      3 Object: aborted, data, end
 8045475     1060      1 Object:
fb0cee9d     1220      9 HTTPParser: socket, incoming, onHeadersComplete, ...
fafc58d5     1271     25 Socket: _connectQueue, bytesRead, _httpMessage, ...
fafc4335     1311     16 ServerResponse: outputEncodings, statusCode, ...
fafc4245     1673      1 Object: slab
fafc44d5     1702      5 Object: search, query, path, href, pathname
fafc440d     1784     14 Client: buffered_writes, name, timing, ...
fafc41c5     1796      3 Object: error, timeout, close
fafc4469     1811      3 Object: address, family, port
fafc42a1     2197      2 Object: connection, host
fbe10625     2389      2 Arguments: callee, length
fafc4251     2759     15 IncomingMessage: statusCode, httpVersionMajor, ...
fafc42ad     3652      0 Object:
fafc6785    11746      1 Object: oncomplete
fb7abc29    15155      1 Object: buffer
fb7a6081    15155      3 Object: , oncomplete, cb
fb121439    15378      3 Buffer: offset, parent, length

This immediately confirmed a hunch that Matt had had that this was a buffer leak. And for Isaac — who had been working this issue from the Gedanken side and was already zeroing in on certain subsystems — this data was surprising in as much as it was so confirming: he was already on the right path. In short order, he nailed it, and the fix is in node 0.6.17.

The fix was low risk, so Voxer redeployed with it immediately — and for the first time in quite some time, memory utilization was flat. This was a huge win — and was the reason for Matt’s tantalizing tweet. The advantages of this approach are that it requires absolutely no modification to one’s node.js programs — no special flags and no different options. And it operates purely postmortem. Thanks to help from gcore(1), core dumps can be taken over time for a single process, and those dumps can then be analyzed off-line.

Even with ::findjsobjects, debugging node.js memory leaks is still tough. And there are certainly lots of improvements to be made here — there are currently some objects that we do not know how to correctly interpret, and we know that we know that we can improve our algorithm for finding object references — but this shines a bright new light into what had previously been a black hole!

If you want to play around with this, you’ll need SmartOS or your favorite illumos distro (which, it must be said, you can get simply by provisioning on the Joyent cloud). You’ll need an updated — which you can either build yourself from illumos-joyent or you can download a binary. From there, follow Dave’s instructions. Don’t hesitate to ping me if you get stuck or have requests for enhancement — and here’s to flat memory usage on your production node.js service!

Posted on May 5, 2012 at 1:07 pm by bmc · Permalink · Comments Closed
In: Uncategorized

Standing up SmartDataCenter

Over the past year, we at Joyent have been weaving some threads that may have at times seemed disconnected: last fall, we developed a prototype of DTrace-based cloud instrumentation as part of hosting the first Node Knockout competition; in the spring, we launched our service with full-fledged real-time cloud analytics; and several weeks ago, we announced the public availability of SmartOS, the first operating system to combine ZFS, DTrace, OS-based virtualization and hardware virtualization in the same operating system kernel. These activities may have seemed disjoint, but there was in fact a unifying theme: our expansion at Joyent from a public cloud operator to a software company — from a company that operates cloud software to one that also makes software for cloud operators.

This transition — from service to product — is an arduous one, for the simple reason that a service can make up for any shortcomings with operational heroics that are denied to a product. Some of these gaps may appear superficial (e.g., documentation), but many are deep and architectural; that one is unable to throw meat at a system changes many fundamental engineering decisions. (It should be said that some products don’t internalize these differences and manage to survive despite their labor-intensive shortcomings, but this is thanks to the caulking called “professional services” — and is usually greased by an enterprise software sales rep with an easy rapport and a hell of a golf handicap.) As the constraints of being a product affect engineering decisions, so do the constraints of operating as a service; if a product has no understanding or empathy for how it will be deployed into production, it will be at best difficult to operate, and at worst unacceptable for operation.

So at Joyent, we have long believed that we need both a product that can be used to provide a kick-ass service and a kick-ass service to prove to ourselves that we hit the mark. While our work over the past year has been to develop that product — SmartDataCenter — we have all known that we wouldn’t be done until that product was itself operating not just emerging services like, but also the larger Joyent public cloud, and over the summer we began to plan a broad-based stand-up of SmartDataCenter. With our SmartOS-based KVM solid and on schedule, and with the substantial enabling work up the stack falling into place, we set for ourselves an incredibly ambitious target: to not only ship this revolutionary new version of the software, but to also — at the same time — stand up that product in production in the Joyent public cloud. Any engineer’s natural conservatism would resist such a plan — conflating shipping a product with a production deployment feels chancey at best and foolish at worst — but we saw an opportunity to summon a singular, company-wide focus that would force us to deliver a much better product, and would give us and our software customers the satisfaction of knowing that SmartDataCenter as shipped runs Joyent. More viscerally, we were simply too hot on getting SmartOS-based hardware virtualization in production — and oh, those DTrace-based KVM analytics! — to contemplate any path less aggressive.

Of course, it wasn’t easy; there were many intense late-night Jabber sessions as we chased bugs and implemented solutions. But as time went on — and as we cut release candidates and began to pound on the product in staging envirionments — it became clear that the product was converging, and a giddy optimism began to spread. Especially for our operations team, many of whom are long-time Joyent vets, this was a special kind of homecoming, as they saw their vision and wisdom embodied in the design decisions of the product. This sentiment was epitomized by one of the senior members of the Joyent public cloud operations team — an engineer who has tremendous respect in the company, and who is known for his candor — as he wrote:

It seems the past ideas of a better tomorrow have certainly come to light. I think I speak for all of the JPC when I say “Bravo” to the dev team for a solid version of the software. We have many rivers (hopefully small creeks) to cross but I am ultimately confident in what I have seen and deployed so far. It’s incredible to see a unified convergence on what has been the Joyent dream since I have been here.

For us in development, these words carried tremendous weight; we had endeavored to build the product that our team would want to deploy and operate, and it was singularly inspiring to see them so enthusiastic about it! (Though I will confess that this kind of overwhelmingly positive sentiment from such a battle-hardened operations engineer is so rare as to almost make me nervous — and yes, I am knocking on wood as I write this.)

With that, I’m pleased to announce that the (SmartDataCenter-based) Joyent public cloud is open for business! You can provision (via web portal or API) either virtual OS instances (SmartMachines) or full hardware-virtual VMs; they’ll provision like a rocket (ZFS clones FTW!), and you’ll get a high-performing, highly reliable environment with world-beating real-time analytics. And should you wish to stand up your own cloud with the same advantages — either a private cloud within your enterprise or a public cloud to serve your customers — know that you can own and operate your very own SmartDataCenter!

Posted on September 15, 2011 at 2:03 am by bmc · Permalink · Comments Closed
In: Uncategorized

KVM on illumos

A little over a year ago, I came to Joyent because of a shared belief that systems software innovation matters — that there is no level of the software stack that should be considered off-limits to innovation. Specifically, we ardently believe in innovating in the operating system itself — that extending and leveraging technologies like OS virtualization, DTrace and ZFS can allow one to build unique, differentiated value further up the software stack. My arrival at Joyent also coincided with the birth of illumos, an effort that Joyent was thrilled to support, as we saw that rephrasing our SmartOS as an illumos derivative would allow us to engage in a truly open community with whom we could collaborate and collectively innovate.

In terms of the OS technology itself, as powerful as OS virtualization, DTrace and ZFS are, there was clearly a missing piece to the puzzle: hardware virtualization. In particular, with advances in microprocessor technology over the last five years (and specifically with the introduction of technologies like Intel’s VT-x), one could (in principle, anyway) much more easily build highly performing hardware virtualization. The technology was out there: Fabrice Bellard‘s cross-platform QEMU had been extended to utilize the new microprocessor support by Avi Kivity and his team with their Linux kernel virtual machine (KVM).

In the fall of last year, the imperative became clear: we needed to port KVM to SmartOS. This notion is almost an oxymoron, as KVM is not “portable” in any traditional sense: it interfaces between QEMU, Linux and the hardware with (rightful) disregard for other systems. And as KVM makes heavy use of Linux-specific facilities that don’t necessarily have clean analogs in other systems, the only way to fully scope the work was to start it in earnest. However, we knew that just getting it to compile in any meaningful form would take significant effort, and with our team still ramping up and with lots of work to do elsewhere, the reality was that this would have to start as a small project. Fortunately, Joyent kernel internals veteran Max Bruning was up to the task, and late last year, he copied the KVM bits from the stable Linux 2.6.34 source, took out a sharpened machete, and started whacking back jungle…

Through the winter holidays and early spring (and when he wasn’t dispatched on other urgent tasks), Max made steady progress — but it was clear that the problem was even more grueling than we had anticipated: because KVM is so tightly integrated into the Linux kernel, it was difficult to determine dependencies — and hard to know when to pull in the Linux implementation of a particular facility or function versus writing our own or making a call to the illumos equivalent (if any). By the middle of March, we were able to execute three instructions in a virtual machine before taking an unresolved EPT violation. To the uninitiated, this might sound pathetic — but it meant that a tremendous amount of software and hardware were working properly. Shortly thereafter, with the EPT problem fixed, the virtual machine would happily run indefinitely — but clearly not making any progress booting. But with this milestone, Max had gotten to the point where the work could be more realistically parallelized, and we had gotten to the point as a team and a product where we could dedicate more engineers to this work. In short, it was time to add to the KVM effort — and Robert Mustacchi and I roped up and joined Max in the depths of KVM.

With three of us on the project and with the project now more parallelizable, the pace accelerated, and in particular we could focus on tooling: we added an mdb dmod to help us debug, 16-bit disassembler support that was essential for us to crack the case on the looping in real-mode, and DTrace support for reading the VMCS. (And we even learned some surprising things about GNU C along the way!) On April 18th, we finally broke through into daylight: we were able to boot into KMDB running on a SmartOS guest! The pace quickened again, as we now had the ability to interactively ascertain state as seen by the guest and much more quickly debug issues. Over the next month (and after many intensive debugging sessions), we were able to boot Windows, Linux, FreeBSD, Haiku, QNX, Plan 9 and every other OS we could get our hands on; by May 13th, we were ready to send the company-wide “hello world” mail that has been the hallmark of bringup teams since time immemorial.

The wind was now firmly at our backs, but there was still much work to be done: we needed to get configurations working with large numbers of VCPUs, with many VMs, over long periods of time and significant stress and so on. Over the course of the next few months, the bugs became rarer and rarer (if nastier!), and we were able to validate that performance was on par with Linux KVM — to the point that today we feel confident releasing illumos KVM to the greater community!

In terms of starting points, if you just want to take it for a spin, grab a SmartOS live image ISO. If you’re interested in the source, check out the illumos-kvm github repo for the KVM driver itself, the illumos-kvm-cmd github repo for our QEMU 0.14.1 tree, and illumos itself for some of our KVM-specific tooling.

One final point of validation: when we went to integrate these changes into illumos, we wanted to be sure that they built cleanly in a non-SmartOS (that is, stock illumos) environment. However, we couldn’t dedicate a machine for the activity, so we (of course) installed OpenIndiana running as a KVM guest on SmartOS and cranked the build to verify our KVM deltas! As engineers, we all endeavor to build useful things, and with KVM on illumos, we can assert with absolute confidence that we’ve hit that mark!

Posted on August 15, 2011 at 7:00 am by bmc · Permalink · 44 Comments
In: Uncategorized

In defense of intrapreneurialism

Trevor pointed me to a ChubbyBrain article that derided “intrapreneurs” as having nothing in common with their entrepreneurial brethren. While I don’t necessary love the term, what we did at Fishworks is probably most accurately described as “intrapreneurial” — and based on both our experiences there and the caustic tone of the Chubby polemic, I feel a responsibility to respond. (And so, I hasten to add, did Trevor.) In terms of my history: I spent ten years in the bowels of Bigco, four years engaged in an intrapreneurial effort within Bigco, and the past year at an actual, venture-funded startup. Based on my experience, I take strong exception to the Chubby piece; to take their criticisms of intrapreneurs one-by-one:

Of course, there are differences between a startup within a large organization and an actual startup. When I was pitching Fishworks to the executives at Sun, I thought of our activity as both giving up some of the upside and removing some of the downside of a startup. After living through Fishworks and now being at a startup, I would be more precise about the downside that the intrapreneur doesn’t have to endure: it’s not insecurity (or at least, not as much as I believed) — it’s the grueling task of raising money. At Fishworks, we had exactly one investor, they knew us and we knew them — and that was a blessing that gave us the luxury of focus.

On balance, I don’t regret for a moment doing it the way we did it — but I also don’t regret having left for a startup. I don’t doubt that my career will likely oscillate between smaller organizations and larger ones — I see strengths and weaknesses to both, much of which comes down to specific opportunities and conditions. In short: there may be differences between intrapreneurialism and entrepreneurialism — but I see little difference between the intrapreneur and the entrepreneur!

Posted on July 12, 2011 at 2:26 pm by bmc · Permalink · 4 Comments
In: Uncategorized

When magic collides

We had an interesting issue come up the other day, one that ended up being a pretty nasty bug in DTrace. It merits a detailed explanation, but first, a cautionary note: we’re headed into rough country; in the immortal words of Sixth Edition, you are not expected to understand this.

The problem we noticed was this: when running node and instrumenting certain hot functions (memcpy:entry was noted in particular), after some period of activity, the node process would die on a strange seg fault. And looking at the core dumps, we were (usually) lost in some code path in memcpy() with registers or a stack that was self-inconsistent. In debugging this, it was noted (with DTrace, natch) that the node process was under somewhat intense signal processing activity (we’ll get to why in a bit), and that (in particular) the instrumented function activity and the signal activity seemed to be interacting “shortly” before the fatal failure.

Before wading any deeper, it’s worth taking an aside to explain how we deal with signals in DTrace — a discussion that in turn requires some background on the way loss-less user-level tracing works in DTrace. One of the important constraints that we placed upon ourselves (albeit one that we didn’t talk about too much) is that DTrace is loss-less: if multiple threads on multiple CPUs hit the same point of instrumentation, no thread will see an uninstrumented code path. This may seem obvious, but it’s not the way traditional debuggers work: normally, a debugger stops a process (and all of its threads) when it hits a breakpoint — which it induces by modifying program text with a synchronous trap instruction (e.g., an “int 3” in x86). Then, the program text is modified to contain the original instruction, the thread that hit the breakpoint (and only that thread) is single-stepped (often using hardware support) over the original instruction, and the original instruction is again replaced with a breakpoint instruction. We felt this to be unacceptable for DTrace, and imposed a constraint that tracing always be loss-less.

For user-level instrumentation (that is, the pid provider), magic is required: we need to execute the instrumented instruction out-of-line. For an overview of the mechanism for this, see Adam‘s block comment in fasttrap_isa.c. The summary is that we have a per-thread region of memory at user-level to which we copy both the instrumented instruction and a control transfer instruction that gets us back past the point of instrumentation. Then, when we hit the probe (and after we’ve done in-kernel probe processing), we point the instruction pointer to be in the per-thread region and return to user-level; back in user-land, we execute the instruction and transfer control back to the rest of the program.

Okay, so easy, right? Well, not so fast — enter signals: if we take a signal while we’re executing this little user-level trampoline, we could corrupt program state. At first, it may not be obvious how this is the case; if we execute a signal before we execute the instruction in the trampoline we will simply execute the signal handler (whatever it may be) and return to the trampoline, finish that off, and then get back to the business of executing the program. Where’s the problem? The problem is that if the signal handler executes something that itself is instrumented with DTrace: if this is the case, we’ll hit another probe, and that probe will copy its instruction to the same trampoline. This is bad: it means that upon return from the signal, we will return to the trampoline and execute the wrong instruction. Only one of them, of course. Which, if you’re lucky, will induce an immediate crash. But more likely, the program will execute tens, hundreds or even thousands of instructions before dying on some “impossible” problem. Sound nasty to debug? It is — and next time you see Adam, show some respect.

So how do we solve this problem? For this, Adam made an important observation: the kernel is in charge of signal delivery, and it knows about all of this insanity, so can’t it prevent this from happening? Adam solved this by extending the trampoline after the control transfer instruction to consist of the original instruction (again) followed by an explicit trap into the kernel. If the kernel detects that we’re about to deliver a signal while in this window (that is, with the program counter on the original instruction in the trampoline), it moves the program counter to the second half of the trampoline and returns to user-level with the signal otherwise unhandled. This leaves the signal pending, but only allows the program to execute the one instruction before bringing it back downtown on the explicit trap in the second half of the trampoline. There, we will see that we are in this situation, update the trapped program counter to be the instruction after the instrumented instruction (telling the program a small but necessary lie about where signal delivery actually occured), and bounce back to user-level; the kernel will see the bits set on the thread structure indicating that a signal is pending, and the Right Thing will happen.

And it all works! Well, almost — and that brings us to the bug. After some intensive DTrace/mdb sessions and a long Jabber conversation with Adam, it became clear that while signals were involved, the basic mechanism was actually functioning as designed. More experimenting (and more DTrace, of course) revealed that the problem was even nastier than the lines along which we were thinking: yes, we were taking a signal in the trampoline — but that logic was working as designed, with the program counter being moved correctly into the second half of the trampoline. The problem was that we were taking a second asynchronous signal, this time in the second half of the trampoline, and this brought us to this logic in dtrace_safe_defer_signal:

         * If we've executed the original instruction, but haven't performed
         * the jmp back to t->t_dtrace_npc or the clean up of any registers
         * used to emulate %rip-relative instructions in 64-bit mode, do that
         * here and take the signal right away. We detect this condition by
         * seeing if the program counter is the range [scrpc + isz, astpc).
        if (t->t_dtrace_astpc - rp->r_pc <
            t->t_dtrace_astpc - t->t_dtrace_scrpc - isz) {
                rp->r_pc = t->t_dtrace_npc;
                t->t_dtrace_ft = 0;
                return (0);

Now, the reason for this clause (the meat of which I’ve elided) is the subject of its own discussion, but it is not correct as written: it works (due to some implicit use of unsigned math) when the program counter is in the first half of the trampoline, but it fails if the program counter is on the instrumented instruction in the second half of the trampoline (denoted with t_dtrace_astpc). If we hit this case — first one signal on the first instruction of the trampoline followed by another signal on the first instruction on the second half of the trampoline — we will redirect control to be the instruction after the instrumentation (t_dtrace_npc) without having executed the instruction. This, if it needs to be said, is bad — and leads to exactly the kind of mysterious death that we’re seeing.

Fortunately, the fix is simple — here’s the patch. All of this leaves just one more mystery: why are we seeing all of these signals on node, anyway? They are all SIGPROF signals, and when I pinged Ryan about this yesterday, he was as surprised as I was that node seemed to be doing them when otherwise idle. A little digging on both of our parts revealed that the signals we were seeing are an artifact of some V8 magic: the use of SIGPROF is part of the feedback-driven optimization in the new Crankshaft compilation infrastructure for V8. So the signals are goodness — and that DTrace can now correctly handle them with respect to arbitrary instrumentation that much more so!

Posted on March 9, 2011 at 11:27 pm by admin · Permalink · 9 Comments
In: Uncategorized