Understanding DTrace ustack helpers

or, everything you ever wanted to know about stack traces

I promised this post over a year ago, and now that someone’s actually working on a new ustack helper, I thought it was finally time to write about what ustack helpers are, how they work, and how I went about building one for Node.js. Only a handful of ustack helpers have ever been written: Node, Java, Python, and PHP (the last of which is believed lost to the sands of time), so this post is mainly for a narrow audience of developers, plus anyone who’s interested in how this all works.

This post covers a lot of the background you need to understand the details. For more information, check out my ACM article on Postmortem Debugging in Dynamic Environments. Though this post covers dynamic tracing, the challenges are similar to those for postmortem debugging, since both involve operating on a snapshot of program state without the aid of the VM itself.

The value of stack traces

Quick review: DTrace is a facility for dynamically instrumenting all kinds of operations on a system — particularly systems in production. It’s available on OS X, illumos distributions (including SmartOS and OmniOS), Solaris 10 and later, and BSD.

The hundreds of thousands of probes on a typical system can be combined with various DTrace actions to gather incredibly specific data. Some examples:

This is just a sampling, of course. Among probes that work “out of the box” today are:

With a tiny amount of work, you can also add your own probes to Node.js, Lua, Ruby, and Perl.

With the ability to grab a stack trace when any of these events happens, you can analyze performance (e.g., profiling on-CPU time) or debug a particular problem (e.g., “why does my Node process keep calling gettimeofday?”). But while DTrace users just call ustack() to get a stack trace, under the hood the process of recording a stack trace at an arbitrary point in a process running in production is deceptively tricky, and that’s what this post is about.

Aside: x86 stacks

What exactly is a stack trace? And how are debugging tools like DTrace able to print them out?1

Within a process, each thread has a stack, which keeps track of functions called, their arguments, and their local variables. On Intel systems (including x86 and amd64), there are two important pieces of state related to the stack:

There’s also an instruction pointer (register %eip (32-bit) or %rip (64-bit)) that points to the currently executing instruction in memory.

When one function calls another function, here’s what happens with the stack:

This is what the whole thing looks like:

Stack diagram

If this is new to you, it’s worth noting how deep this is: this is how control flows between functions in native programs. There’s no magic: the notion of functions in C essentially fall out of a stack and a few basic instructions.

You can see this pattern by disassembling any native function. Here’s an example, looking at the code for fork in libc (the userland function, which calls forkx() to do most of the work and then invoke the fork system call):

$ mdb -p $$
Loading modules: [ ld.so.1 libc.so.1 ]
> fork::dis
libc.so.1`fork:                 pushl  %ebp
libc.so.1`fork+1:               movl   %esp,%ebp
...
libc.so.1`fork+0x19:            call   -0x246   <libc.so.1`forkx>
...
libc.so.1`fork+0x22:            popl   %ebp
libc.so.1`fork+0x23:            ret

The result of these stack manipulations is that at any given time, the frame pointer register points to the head of a linked list that has one entry for every function on the call stack, up to the top of the stack. The frame pointers pushed onto the stack in each function represent the “next” pointers of the linked list, and the return address pointers pushed by the call instructions denote the address that called the next function.

If you had a snapshot of a process memory state (i.e., a core dump), you can imagine a simple algorithm for constructing a stack trace:

callstack = [ current_%eip ];
frameptr = current_%ebp

while (frameptr != NULL) {
     callstack.push(value adjacent to frameptr);
     frameptr = *frameptr
}

In other words, the top frame is denoted by the current instruction pointer. Then we start with the current frame pointer register, and follow the linked list of frame pointers until we get to the top of the stack. Along the way, we record the instruction pointer that was saved on the stack.

There’s one critical step left. This algorithm gives us a bunch of instruction addresses — but we want human-readable function names, not memory addresses. In a traditional debugger, a memory address is pretty easy to translate to a function name, because the symbol table that’s in the process (or core file) necessarily includes the address of each function and the size of each function. In MDB, you can see this with the “::nm”

> ::nm ! grep -C2 -w fork
...
0xfee02581|0x00000094|FUNC |GLOB |0x3  |15      |execlp
0xfedece61|0x0000015b|FUNC |GLOB |0x3  |15      |_D_cplx_div_ix
0xfee76395|0x00000024|FUNC |GLOB |0x3  |15      |fork
0xfee1b6f3|0x00000057|FUNC |GLOB |0x3  |15      |nss_endent
0xfedef150|0x00000019|FUNC |GLOB |0x3  |15      |atomic_or_ushort_nv

In this case, the fork function is stored at address 0xfee76395 and is 0×24 bytes long. From this, the debugger knows that when the instruction pointer is 0xfee763b3, that’s inside the “fork” function, and is more conveniently printed as fork+0x1e (instruction offset “0x1e” inside the “fork” function).

As I mentioned above, this all sounds highly specific to native code, but as we’ll see, dynamic environments like Node.js do basically the same thing, with some twists that we’ll get to later.

Native stack traces in DTrace

It’s well-understood how to grab stack traces from a debugger, but saving a stack trace in the context of a DTrace probe is a bit more complicated. Recall that DTrace records and buffers data in the kernel for later consumption by the “dtrace” command. This buffering decouples the data source (events of interest) from the consumer (the “dtrace” command), which allows the system to perform much better than traditional tools like “strace” or “truss”, which actually stop the target process for long enough for the tool itself to step in and record data. This decoupling also allows DTrace to instrument extremely delicate contexts, including the kernel’s pagefault handler and other critical interrupt handlers. It wouldn’t be possible to instrument these contexts if the instrumentation itself had to wait for the “dtrace” process to step in and execute.

It might seem like these contexts are uninteresting to application developers, but they’re actually quite relevant for a number of use cases:

In order to support instrumenting these contexts, the actions that you’re allowed to take in a probe are pretty constrained. You can’t do anything that requires servicing a pagefault, for example, or enter an infinite loop. And since DTrace can’t know what loops will be infinite, it doesn’t allow explicit loops at all.

These constraints make saving a userland stack trace from DTrace tricky. Walking the stack as we described above almost always works, because the stack memory is rarely paged out, but that whole bit about accessing the process’s symbol table to convert memory addresses to their human-readable names? That can easily involve a bunch of operations that are at the very least expensive, and at worst impossible in probe context (because the pages may not be available).

To work around this problem, DTrace defers the resolution of symbol names until the data is consumed. That is, when a probe fires and you’ve asked DTrace to print out a stack trace, all that’s done immediately is to record the process id and the list of memory addresses that make up the call stack. When your dtrace process in userland consumes the data, it uses system debugger interfaces to look up the addresses in the process that was traced and translate them to the appropriate function name as described above.3

What about dynamic environments?

All this works well for native code, but what about dynamic environments? These can vary wildly. I’m going to focus on Node.js, since that’s the one I’m very familiar with.

Node.js is based on V8, the JavaScript VM that also powers Google’s Chrome browser. With V8, the basic process for walking the stack works basically the same as for a native program (walking frame pointers, recording the adjacent instruction pointers), but there’s a huge problem when it comes to resolving instruction pointer addresses to human-readable names: the compiled code for JavaScript functions don’t correspond to symbols in the process’s symbol table!

For a native process, we knew 0xfee76395 was inside the fork function because the process has a symbol table (built when the program was compiled, linked, and loaded) that says that fork starts at 0xfee76395 and is 0×24 bytes long. But in a JavaScript program, we may have a function at address 0×8004039, and that address doesn’t correspond to anything in the process’s symbol table. That’s because that function didn’t exist when the program started: it was dynamically created when some JavaScript code used the function keyword to define a new function, and V8 compiled that to native code, and stored those instructions in the heap. There’s no way for a native code debugger to “know” that this corresponds to, say, the fs.readFile JavaScript function.

Suppose you had some way to ask V8: what’s the name of the JavaScript function at 0×8004039? DTrace could do the same thing it does for native stack traces, which is to record just the addresses and resolve these names later, right? Unfortunately, that doesn’t work for dynamic environments because functions themselves are stored on the runtime heap and can actually move around during execution as a result of garbage collection or reoptimization. So the function at 0×8004039 may no longer be at 0×8004039 when the “dtrace” command gets around to resolving the name. We have to resolve the name when we actually record the stack trace.

Enter ustack helpers

So we have these constraints:

These constraints essentially define the solution: VM implementors write a chunk of code in D that knows how to translate a (frame pointer, instruction pointer) pair into a human-readable function name. The code is safe by virtue of being in D, which can only express operations that are safe in DTrace probes. That code (the helper) gets glued onto the binary during the build process and loaded into the kernel when the VM process starts up. When a user needs to get a stack trace from that process, the kernel executes the helper to resolve the function names.

The helper translates a frame pointer and instruction pointer into a human-readable function name. In this above example, it translates 0×8004039 to “fs.readFile”. (Since JavaScript function names are not unique, the Node ustack helper actually translates it to something more complete like “(anon) as fs.readFile at fs.js line 123″.)

The guts of a ustack helper are totally dependent on the dynamic environment it’s targeted at. The V8 helper uses the fact that when V8 generates the instructions to call a JavaScript function that’s been compiled to native code, it pushes onto the stack a pointer to a C++ object that it uses internally to keep track of the function being called. From that object, we can follow pointers to get the name of the function, the location where it was defined, and so on.

So you want to build a ustack helper

As I mentioned above, a few ustack helpers have been written: for Java, for Python, and for Node.js. I know two more that people have expressed interest in developing: Erlang and Ruby. In general, I’d break the process down into a few steps.

  1. Collect some example core files from known programs, where you know what the correct stack trace should look like.
  2. Study the source code and the core files until you can construct the correct stack trace by hand from the core file. That is, it’s not necessarily an automated procedure, but you can at least identify the right frames and for each one, a set of manual steps that will get you to the function names.
  3. Automate printing of the stack trace, based on the manual algorithm you came up with in step 2.
  4. Implement (3) in a D script: that will become the ustack helper.

Step 1: collect some example cores

The process of figuring out how a dynamic environment lays out its stack can be grueling and time-consuming. To explore at your own pace, it’s crucial to have a core file from a known program, where you know the correct stack trace. When I started, I spent enough time reading the V8 source to discover that there were a few different types of frames in V8, including regular JavaScript frames, constructor frames, argument adaptor frames, and a few others. Then I wrote a fairly simple Node program that exercised a bunch of these cases: from the top-level, it calls a regular top-level function, which calls a constructor, which calls another constructor, a method (defined as an anonymous function), and so on, until the last function goes into an infinite loop. That way, once I started the program, I could use gcore(1M) to create a core file from the running program. The result is that I had a memory snapshot I could play with in the debugger that had most types of frames I would care about. I could play around with this at my leisure. Later I would profile the same program to test the ustack helper.

Step 2: figure out how to manually produce a stack trace

This is where you’ll spend much of your time. The difficulty here depends a lot on the quality of the runtime’s source code and its documentation and how complex the environment is. (For some environments (like Perl), it may be impossible to write a ustack helper, at least without new DTrace features. ustack helpers assume at the most fundamental level that stacks are laid out just as they are in native code, using frame pointers and instruction pointers. There’s nothing that says a runtime environment has to actually do it that way.)

For V8, the basic process was simple, though it still took a while to work out. I started with the code that V8 itself uses when constructing a stack trace, as when you print an Error’s stack trace. I studied it for a while, took a lot of notes, and tried to replicate the algorithm by hand from the core file.

I strongly recommed building tools to help yourself. I used MDB, specifically because it makes it easy to write new commands with C code. I quickly wrote a bunch of commands to tell me for a given memory address, what I was looking at. This was critical: long before I was able to print a stack trace, I had learned that the first step was to print out the “Function” object that V8 stores on the stack, and that that object refers to a SharedFunctionInfo object that includes the name, and that that points to a Script object that includes the script name where the function was defined. The function and script names are stored as Strings, which were AsciiStrings or ConsStrings. So the first thing I did was to write debugger commands that could identify what kind of object I was looking at. This became the “::v8type” MDB command:

> a7790941::v8type
0xa7790941: JSFunction

Then I wrote commands to print out the C++ objects so I could inspect them. This became the “::v8print” MDB command:

> 0xa7790941::v8print
a7790941 JSFunction {
    a7790941 JSObject {
        a7790941 JSReceiver {
            a7790941 HeapObject < Object  {
                a7790940 map = 9f009749 (Map)
            }
        }
        a7790944 properties = a26080a1 (FixedArray)
        a7790948 elements = a26080a1 (FixedArray)
    }
    a7790950 prototype_or_initial_map = ba9080a1 (Oddball: "hole")
    a7790954 shared = a777b6f5 (SharedFunctionInfo)
    a779095c literals_or_bindings = a7790965 (FixedArray)
    a7790960 next_function_link = ba9299a1 (JSFunction)
}

Then I wrote commands for decoding the string objects as strings. This became “::v8str”:

> 0xa7790f81::v8str
"WriteStream.write"

> 0xa7790f81::v8str -v
ConsString
    ptr1: a2615b7d
    ptr2: 892ef331
    SeqAsciiString
        length: 11 chars (11 bytes), will read 11 bytes from offset 0
    SeqAsciiString
        length: 6 chars (6 bytes), will read 6 bytes from offset 0
"WriteStream.write"

It sounds like a lot of work up front, but it paid off big when I could poke around much more easily: I could start with a pointer from the stack that I thought should be a Function object, and explore what information it pointed to. For example, if I have this frame from the native stack (which you can get with $C in MDB):

Frame ptr   Instruction ptr
0804792c    0x7560e19a

I discovered from the source that it looked like a JSFunction object was pushed two words below the frame pointer, so I checked that out:

> 0804792c-0x8/p
0x8047924:      0xa7737d65      

> 0xa7737d65::v8type
0xa7737d65: JSFunction

> 0xa7737d65::v8print
a7737d65 JSFunction {
    a7737d65 JSObject {
        a7737d65 JSReceiver {
            a7737d65 HeapObject < Object  {
                a7737d64 map = 9f009749 (Map)
            }
        }
        a7737d68 properties = a26080a1 (FixedArray)
        a7737d6c elements = a26080a1 (FixedArray)
    }
    a7737d74 prototype_or_initial_map = ba9080a1 (Oddball: "hole")
    a7737d78 shared = ba941471 (SharedFunctionInfo)
    a7737d80 literals_or_bindings = a26080a1 (FixedArray)
}

and so on.

Besides the ability to explore more easily, with not much more work, I wrote a few commands to print the V8 representations of objects, arrays, and so on as their JavaScript values — which gave me postmortem debugging for JavaScript as well. This became “::jsprint”:

> 1f712ffd5601::jsprint
{
    protocol: "http:",
    slashes: true,
    auth: null,
    host: "www.snpp.com",
    port: null,
    hostname: "www.snpp.com",
    hash: null,
    search: null,
    query: null,
    pathname: "/episodes/3F02.html",
    path: "/episodes/3F02.html",
    href: "http://www.snpp.com/episodes/3F02.html",
}

This brings up a related point: writing a ustack helper is grueling, and I found it never paid to take shortcuts. I’d rather make only a little progress each day, knowing what each step was doing, than to try to keep it all in my head and be confused why things didn’t work as expected.

Back to the task at hand. Since you wrote the example program, you know what the stack trace in the core file should look like. (You can even have the program print the stack trace using the runtime’s built-in mechanism — in JavaScript, this would be console.log(new Error().stack). The first milestone will be when you can construct that stack trace by hand from the core file. That is, when you can look at the frame pointer in %ebp, follow that to the other frames, and for each one, find the right pointers to follow to get you to the name of the function at that frame.

Step 3: automate printing a stack trace

The next step is to automate that process of printing out the stack trace. As with the rest of this project, I’d strongly recommend building this incrementally. First build commands that can print out a frame pointer with a useful description of the function it references:

> 0804792c::jsframe -v
8047940 0x756266db _tickFromSpinner (a7737dad)
    file: node.js
    posn: position 13051

and then implement something that walks the stack and labels each frame. You could skip the first step, but you need to automate this procedure to build the ustack helper anyway, and it will help significantly to iron out the details in a traditional language like C, which is easier to write and has rich facilities for debugging.

Once this works reliably, create some more example programs and test it on those. Create core files from production processes and test it on those, too. You may find that there were some cases you missed in your example programs.

Step 4: build the ustack helper

Once you have an automated command that reliably prints a stack trace for an arbitrary core file, you’ve got to implement that same logic in D.

The basic idea is that you define a D program with pseudo-probes called “dtrace:helper:ustack:”. The interface is:

There are some patterns that have emerged in the existing ustack helpers:

The V8 ustack helper built up several more complicated layers of APPEND macros for the various kinds of strings in V8.

The rest is entirely runtime-specific, so all I can offer are some general tips:

You can also consider writing a plain old D script to iterate on most of the logic for the helper. The downside is that once you get it working perfectly, if there’s an error after you translate the script into a helper, it’ll be hard to track down where the error was. I usually found it easier to develop the helper itself.

Profit

There’s no doubt this is all a lot of work, but the resulting observability has proven tremendously valuable for our work with Node.js. We use the ustack helper primarily to profile Node programs, but also to debug them (i.e., to find out what part of a program is responsible for some other system event), and we use it both in development and in production.

Besides that, the result of this project was not just the ustack helper, but a complete postmortem debugging environment for Node programs. We configure most of our services to dump core automatically when an uncaught exception is thrown, and we’ve root-caused dozens of nasty bugs (including several in Node core) from the core files.

Caveats

If you’ve used ustack helpers at all before, you’ve probably already discovered that they don’t work on OS X. I’m told that the way to get this changed is to contact your Apple developer liaison (if you’re lucky enough to have one) or file a bug report at bugreport.apple.com. I’d suggest referencing existing bugs 5273057 and 11206497. I’m told that more bugs filed (even if closed as dups) show more interest and make it more likely Apple will choose to fix this.


That’s all I’ve got. If you’ve got questions, your best bet is the dtrace-discuss list. I follow that, as do many others working with DTrace. You can also comment below or tweet me @dapsays.

Many thanks to Bryan, who implemented support for ustack helpers, encouraged me to take on the V8 ustack helper project, and helped enormously along the way.


Footnotes

1 This section looks specific to C/C++, but the details are largely the same in dynamic environments like Java, Node.js, and Python.

2 I say “typically” because it’s possible to disable this behavior at compile-time with gcc. This is ostensibly done for the performance improvement of having an additional register available and to avoid pushing and popping the frame pointer, though reports of performance improvements never appear to be based on hard data, and it’s highly doubtful that the effect would be measurable on most programs. On the other hand, the resulting breakage prevents traditional debuggers and DTrace from grabbing stack traces on such programs. (Even if one could show an actual performance benefit, it’s important to consider that a few percent performance improvement likely does not justify giving up the ability to observe the performance of an application in production, since that precludes many other future improvements.)

3 This leads to the slightly annoying behavior that if you try to trace callstacks from a short-lived process, the process may be gone by the time the userland processes goes to resolve the function names, and you wind up with only the memory addresses. You can work around this by using “-c” to start the process, or “-p” to attach to an existing one. This causes DTrace to attach to the process so that after it exits, the process will stick around until DTrace gets what it needs from the process. While slightly annoying, it’s a small price to pay for instrumenting arbitrary contexts in production.

Posted on November 20, 2013 at 4:50 pm by dap · Permalink · Comments Closed
In: DTrace, Joyent, Node.js, SmartOS

Tracing Node.js add-on latency

Node.js has some built-in DTrace probes for looking at HTTP request start/done and GC start/done, and I wrote nhttpsnoop to report latency based on these probes. There’s also a future project that will make it easy to trace any asynchronous operations. But in the meantime, there’s one category of latency that’s missing, which is latency of asynchronous operations kicked off by add-ons.

As an example, I’ll use node-sqlite3. Below, I have sqlite-demo.js, which is a version of the demo program that’s been modified to run in a loop where it opens the database, inserts some records, selects some records, and closes the database. As this program runs, the add-on kicks off async operations to open the database, run queries, and close the database. We’re going to trace those operations to see how long each type of operation takes.

Below that, I have trace-addon-latency.d, a D script that traces the libuv uv_queue_work function as well as all functions in all add-ons in the target Node process. This is a little heavy-handed for what we’re trying to do, and can have a noticeable impact on program performance, depending on the add-on. For many of the small add-ons I use, the impact is negligible.

By running the program and the D script for a few minutes, then stopping the D script, you get output that looks like this:

  node_sqlite3.node`node_sqlite3::Database::Work_BeginExec(node_sqlite3::Database::Baton*)
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@@@@@@@@@@@@@@@@@                        6018
            8192 |@@@@@                                    1590
           16384 |@                                        302
           32768 |@@@@@@@@@@@@@@@@@                        5722
           65536 |                                         156
          131072 |                                         0

There’s a histogram like this for each “kind” of asynchronous operation that each add-on kicks off. By “kind”, I’m referring to the C++ function that kicks off the operation. In this case, Work_BeginExec kicked off 6018 operations that took between 4us and 8us.

This is a quick way to get a look at what async operations your Node program is doing (from add-ons) and how long they’re taking. This works on OS X and SmartOS, and should work on BSD and any other system with DTrace.

Full steps to reproduce

$ mkdir demo
$ cd demo
$ npm install sqlite3
$ curl -O https://gist.github.com/davepacheco/104e29e6aa2a5a3c1dca/raw/bbcd859c7469ed20b7126c56f3a5c45dc43986e1/sqlite-demo.js
$ curl -O https://gist.github.com/davepacheco/104e29e6aa2a5a3c1dca/raw/45c7e0da1b5aeb0cced2163e525e7caffb1a6ebe/trace-addon-latency.d
$ node sqlite-demo.js &
$ dtrace -s trace-addon-latency.d -p $!     # you may need "sudo" here, particularly on OS X

Let that run for a little while, then kill either one of the “node” or “dtrace” processes. (Killing the “node” one causes “dtrace” to exit, since it was attached to it.) That will cause “dtrace” to exit and print the results.

sqlite-demo.js

var sqlite3 = require('sqlite3').verbose();
go();

function go()
{
    var db = new sqlite3.Database(':memory:');
    db.serialize(function () {
            db.run("CREATE TABLE lorem (info TEXT)");
            var stmt = db.prepare("INSERT INTO lorem VALUES (?)");
            for (var i = 0; i < 10; i++) {
                    stmt.run("Ipsum " + i);
            }
            stmt.finalize();
            db.exec("SELECT rowid AS id, info FROM lorem", function () {
                    db.close();
                    go();
            });
    });
}

trace-addon-latency.d

#!/usr/sbin/dtrace -s

/*
 * trace-addon-latency.d: Run this script as:
 *
 *     dtrace -s trace-addon-latency.d -p PID
 *
 * to trace the latency of asynchronous operations kicked off by all add-ons in
 * the node process PID.  NOTE: this will instrument all native functions in all
 * add-ons, which can have a noticeable performance impact, depending on the
 * add-on.
 */

pid$target::uv_queue_work:entry
{
    callers[arg1] = ucaller;
    donefunc[arg1] = arg3;
    starttime[arg1] = timestamp;
}

pid$target:*.node::entry
/donefunc[arg0]/
{
    @[ufunc(callers[arg0])] = quantize(timestamp - starttime[arg0]);
    callers[arg0] = 0;
    starttime[arg0] = 0;
    donefunc[arg0] = 0;
}

trace-addon-latency.out

  node_sqlite3.node`node_sqlite3::Database::Work_BeginClose(node_sqlite3::Database::Baton*)
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |                                         167
            8192 |@@@@@@@@@@@@@@@@@@@@@@@@@                8529
           16384 |@@@@@@@                                  2326
           32768 |@@@@@@@@                                 2740
           65536 |                                         26
          131072 |                                         0

  node_sqlite3.node`node_sqlite3::Database::Work_BeginExec(node_sqlite3::Database::Baton*)
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@@@@@@@@@@@@@@@@@                        6018
            8192 |@@@@@                                    1590
           16384 |@                                        302
           32768 |@@@@@@@@@@@@@@@@@                        5722
           65536 |                                         156
          131072 |                                         0

  node_sqlite3.node`node_sqlite3::Database::Work_BeginOpen(node_sqlite3::Database::Baton*)
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@                                        286
            8192 |@@@@@@@@@@@@@@@@@@                       6064
           16384 |@@                                       642
           32768 |@@@@@@@@@@@@@@@@@@@                      6701
           65536 |                                         94
          131072 |                                         1
          262144 |                                         0

  node_sqlite3.node`node_sqlite3::Statement::Work_BeginPrepare(node_sqlite3::Database::Baton*)
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@@@@@                                    3604
            8192 |@@@@@@@@@@@@                             8376
           16384 |@@                                       1341
           32768 |@@@@@@@@@@@@@@@@                         10831
           65536 |@@@@@                                    3423
          131072 |                                         2
          262144 |                                         0

  node_sqlite3.node`node_sqlite3::Statement::Work_BeginRun(node_sqlite3::Statement::Baton*)
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@@@@@@@@@@@@@@@@@@@@@@@                  87884
            8192 |@@@@@                                    19478
           16384 |@@                                       8692
           32768 |@@@@@@@@                                 32160
           65536 |@                                        3455
          131072 |                                         0
Posted on October 18, 2013 at 8:49 am by dap · Permalink · Comments Closed
In: DTrace, Joyent, Node.js

Stopping a broken program in its tracks

Last week I debugged a Node issue where under some conditions, fork() failure could result in some file descriptors being closed, leading shortly to bedlam. In the best cases, the program would immediately crash, but often it would just do the wrong thing. This happens a lot, of course, and in many ways it’s worse than a crash because you have no core dump nor even a stack trace to go on.

In my write-up of the bug, I glossed over a key step to debugging the problem:

"Using DTrace and reproducing on SmartOS, I stopped the
process at the moment close() returned EBADF, and caught
this stack trace ..."

I want to explain this technique in more detail because it’s very useful for getting a stack trace or core dump when a program starts going south, even if the program isn’t actually going to crash. (This technique isn’t specific to Node.js or SmartOS, either, though it’s easier with Node.js and on SmartOS.)

In this case, I was debugging a production problem that I reproduced in development, which made things much easier. I was free to kill the process to create a core dump without worrying about disrupting any users. Since the Node program was blowing up on a “close, EBADF” error, I used this DTrace one-liner to make the program dump core precisely when that happened (instead of barreling on with the wrong file descriptor):

# dtrace -w -n 'syscall::close:return
  /execname == "node" && arg1 == -1 && errno == EBADF/
  { trace(pid); raise(SIGABRT); exit(0); }'

Taking apart the script:

The default behavior on receiving SIGABRT is to dump core immediately (see signal.h). It’s possible to alter this behavior, but very few programs do, so this technique almost always works.

Once I had a core file, I fired up MDB to get a JavaScript stack trace, and from there it was easy to figure out what had gone wrong. For details, see the Node issue.

What about production?

I mentioned that in this case, I’d reproduced the problem in development, so I was free to kill the process. What if (as so often happens) the first time we saw this problem was in production, and we wanted to get a core dump at exactly the right time, but without causing the process to crash? That’s not actually much harder, at least on systems with a program like gcore(1):

# dtrace -w -n 'syscall::close:return
  /execname == "node" && arg1 == -1 && errno == EBADF/
  { trace(pid); stop(); system("gcore %d; prun %d", pid, pid); exit(0); }'

This one is very similar to the previous script, except that instead of raising SIGABRT, we use stop() to stop the process in its tracks, and asynchronously run gcore PID (to save the core dump) followed by prun PID (to start the process running again). This is still technically destructive, because we’re stopping the process for however long it takes to save the core dump, but in most cases this isn’t too long and the disruption is insignificant.

To recap: you can use DTrace for more than just data collection, including debugging non-fatal buggy program behavior. DTrace can instrument very specific events on the system (like “the close syscall in a Node process returned -1 with EBADF” or any application-level event), and you can use that to take action like stopping the process to take a core dump and then resuming it again. Putting these pieces together makes it possible to gather all kinds of debugging information from a program that’s behaving badly.

Posted on October 14, 2013 at 7:58 am by dap · Permalink · Comments Closed
In: DTrace, SmartOS

Node.js in production: runtime log snooping

This post also appeared on the Joyeur blog.

Log

This post is the one of several about how we operate Node.js in production at Joyent. Most of my experience comes from working on our Manta Storage Service, which currently comprises a few thousand Node processes across three datacenters.

Logging is one of the most primitive but most valuable forms of postmortem debugging. Logs let you figure out why a program crashed (or did the wrong thing) after it’s already done so. In this post I’ll describe a critical feature of the Bunyan logging library that we use daily to help debug failures in production: runtime log snooping.

The problem

In many production systems, there’s a nasty tradeoff: the more verbose your logs, the more information you have to debug issues, but the greater the overhead of constructing, recording, transmitting, storing, and analyzing these logs as well. Log frameworks typically manage this with log levels: you use “debug” in development, which spews all kinds of information about what the program is doing, but you use “info” or some higher level in production, which filters only the “important” messages.

But it’s often in production that you really need the debug messages that are too expensive to emit in production. It’s when you’ve been paged in the middle of the night because some service’s error count has exceeded some threshold. If you’re lucky, the failure is simple enough that the program can tell you exactly what’s happened, but more often, particularly in distributed systems, where failure modes aren’t so crisp, the error will be somewhat vague. You’ll need more information about what the program was doing in order to figure out what went wrong.

In many of these cases, restarting the service will bring it back to life, but it will also eliminate any hope of root causing the problem. You can already see yourself getting paged again a few nights later for the same issue…

Bunyan

Trent Mick at Joyent wrote a Node module called Bunyan that we use in all of our services to log everything from free-form text intended for humans to programmatically consumable events that trip alarms or feed into our billing system. Bunyan logs are just newline-separated JSON records. A few fields have special meaning, like “time” and “msg”, but programs can log whatever other JSON structures they want for interpretation by humans or other programs. (In Manta, we use Bunyan log records for billing. These log entries include structured payloads that describe the billable operation, the user, the result, and so on.1) The “bunyan” command-line tool facilitates easy filtering, colorizing output, and so on, while programmatic consumers can parse each line as a JSON object.

There are lots of logging libraries out there, but Bunyan’s killer feature is the ability to snoop debug-level logs in production, without restarting (or otherwise impacting) the service. When an application calls log.debug("something went wrong"), and the log level is “info” (i.e., higher than “debug”), the bunyan library drops the message, as you’d expect. When you want to get at these debug messages, you fire up “bunyan -p *”, which is just sugar for:

$ dtrace -Z -x strsize=4k -x switchrate=10hz \ -q -n 'bunyan*:::log-*{ printf("%s\n", copyinstr(arg0)); }' | bunyan

When you run this, you’re enabling DTrace probes built into Bunyan. Enabling these probes actually modifies your Node program’s behavior to fire the probe just before dropping the log message. Firing the probe traps into the kernel, where the log message is buffered and eventually funneled up to the “dtrace” process and printed out to your terminal.

This is a subtle point, but it’s very important: enabling these probes doesn’t cause the debug messages to appear in the regular log, but rather it causes these messages to be collected, buffered by the kernel, and then made available to your “dtrace” process. That means your service is still running at near full speed, not going through the code paths that write entries to the log. Importantly, if the volume of messages gets too high, DTrace will drop them rather than slow down the process. (You can tune the “switchrate” and buffer sizes to control that.)

When you don’t need the extra logging any more, CTRL-C the “bunyan” (or “dtrace”) process. The program modifications are undone, and your program goes on dropping debug log entries.

Voilá! By using Bunyan as a logging library, you get the ability to walk up to a production service and enable debug logging without restarting it. This has proven critical for a number of cases where it’s too expensive to log debug messages all the time, but we had a live service behaving badly, and restarting the service would likely have cleared the issue without giving us a chance to root-cause it.

Try it yourself

I ran these examples on OS X because it’s convenient, but all of our production services run on SmartOS (an illumos distribution). This facility works on any system that supports DTrace USDT.

If you’ve got Node installed already, it’s ridiculously easy to try this out. First, create a temporary directory and install bunyan locally:

dap@sharptooth $ npm install bunyan

then start a Node REPL, load up bunyan, and create a logger at level “info”:

dap@sharptooth $ node
> var r = require('bunyan')
undefined
> var l = new r({ 'name': 'test', 'level': 'info' })
undefined

At this point, you can log “info” messages, but “debug” messages will be hidden:

> l.info('test info message')
{"name":"test","hostname":"sharptooth.local","pid":1076,"level":30,"msg":"test info message","time":"2013-09-27T18:13:00.820Z","v":0}
undefined
> l.debug('something went wrong')
undefined

Now in another shell, use “bunyan -p”:

# In another shell:
dap@sharptooth $ sudo bunyan -p '*'   # sudo required for DTrace

That will block waiting for something to happen. Back in the first shell, log that debug message again:

> l.debug('something went wrong')
undefined

and over in the second shell, you should see something like this:

[2013-09-27T18:16:19.679Z] DEBUG: test/1076 on sharptooth.local: something went wrong

Success! Because we used bunyan -p '*', we’ll see log messages from all Node processes on the system. You can also specify a specific pid instead of ‘*’ to see only messages from that process.

Dynamic probes for your application

Bunyan’s dynamic probes are implemented using Chris Andrews’s node-dtrace-provider module, which makes it very easy to add custom probes to any application. (The whole Bunyan integration with DTrace is about 20 lines of code.)

Another great example that we also use heavily is node-restify, which creates DTrace probes for the beginning and end of each HTTP request, as well as each of the phases of request processing (auth, parsing, logging, and so on). You can enable these probes in production to get a fine-grained look at request latency, including what parts of each request are taking a long time.

 

In this post, I described Bunyan, the Node.js logging library we use at Joyent. I explained how we use DTrace probes built into Bunyan to extract more debug information from a production service than we can normally extract without disrupting the service itself. We focus on tools for production because the hardest issues often show up only in production deployments, and the production environment is far more constrained than development. By building tools designed for production use, we’re able to debug failures both in the field and in development, in many cases root-causing them the first time we see them.


1 Manta service logs are a good example of both the value of structured logs and Manta’s ability to process these logs. Manta service logs include both debug messages and more structured messages used to generate user’s usage reports and bills. These logs are uploaded periodically to Manta. To generate each user’s report, we run a map-reduce Manta job over these logs that parses each record and tallies up reports per-user. Similarly, when we want to see how often some particular error message has been seen, we run a Manta job to grep over the logs (literally just “grep PATTERN”). For more on using Manta for log analysis, check out Wanelo’s post on business data analytics using Manta.

Posted on October 3, 2013 at 5:44 pm by dap · Permalink · Comments Closed
In: Joyent, Node.js

Kartlytics: Applying Big Data Analytics to Mario Kart 64

This post also appears on the Joyeur blog.

If you missed it, Joyent recently launched Manta, a web-facing object store with compute as a first-class operation. Manta makes it easy to crunch on Big Data in the cloud, and we’ve seen it used by both ourselves and others and others to solve real business problems involving Big Data. But it’s not just for user behavior and crash dump analysis: Manta has profoundly changed the way we operate here at Joyent’s SF office.

Mario Kart 64

On a typical Friday afternoon at Joyent’s San Francisco office, it’s been a long week and the engineers are getting restless. I glance over at Bill, who mimes working a video game controller with both hands and nods towards the projector. The war dance has begun, and the trash talking will soon follow. Factions form, and soon everyone’s agreed on only one thing: it’s time to play Kart.

Playing Kart
(photo: Joshua Clulow)

Sound familiar? I know we’re not the only office that brings the same intensity to office video games that we bring to our work. Kart is the perfect game for serious office competition because it’s dead simple to learn but takes a long time to master. It’s a standard arcade-style racing game with cartoonish weapons, encouraging the best of friends to abuse and taunt each other mercilessly.

Kart clip

But it’s not just about the competition. In our years playing Kart, we’ve seen a lot of quirky game behavior, and we’ve always wondered: is it true that the game handicaps successful players by giving them less powerful weapons? Given that, how much does the first lap really matter? And which characters are more likely to win: heavyweights or lightweights?

We’ve also had lots of time to discuss strategy: how much does power sliding matter? How much time do you lose for each banana peel slip, shell hit, bomb impact, or fall off the track?

As serious intellectuals often do, we spent hours discussing these questions, what data we would want to collect to answer them, and even how we might go about collecting it. It sounded like a fun project, so I wrote a program that takes video captures of our Mario Kart 64 sessions and picks out when each race starts, which character is in each box on the screen, the rank of each player as the race progresses, and finally when the race finishes. Then I built a web client that lets us upload videos, record who played which character in each race, and browse the aggregated stats. The result is called Kartlytics, and now contains videos of over 230 races from over the last year and change.

This worked pretty well, but as we collected more videos, the time to process them grew prohibitive. Often I’d want to tweak the program to fix detection of, say, Yoshi Valley, but I couldn’t be sure that the change wouldn’t cause it to do the wrong thing on some other track. To test that, I’d have to rerun the process on all the videos, but that took nearly a full day on my laptop.

Enter Manta

This changed significantly about a month ago when Joyent launched Manta. We like to say that we built Manta to solve all kinds of problems from log analysis to video transcoding, but the reality is now clear: Manta exists to compute analytics on Mario Kart 64 sessions. Immediately after we stood up our production Manta service, I loaded all of the videos we’ve recorded to date. If you’ve set up the toolkit, you can list these videos with:

$ mls -l /dap/public/kartlytics/videos | grep -v json
-rwxr-xr-x 1 dap      71246499 Jun 3 18:35 2012-06-19-00.mov
-rwxr-xr-x 1 dap      66418215 Jun 3 17:25 2012-06-19-02.mov
-rwxr-xr-x 1 dap      67641347 Jun 3 18:37 2012-06-19-03.mov
...

In recent weeks, I’ve converted kartlytics to use Manta for all of the video processing. Here’s the job to process each video, producing a JSON file for each one describing what happened in all races in the video:

$ export KARTLYTICS_OUTDIR=/dap/public/kartlytics/generated
$ mfind -t o -n '.*.mov$' /dap/public/kartlytics/videos |
    mjob create -w
    -s /dap/public/kartlytics/kartlytics.tgz
    -s /dap/public/kartlytics/bin/video-transcribe
    --init "cd /var/tmp && tar xzf /assets/dap/public/kartlytics/kartlytics.tgz"
    -m "/assets/dap/public/kartlytics/bin/video-transcribe /var/tmp/kartlytics $KARTLYTICS_OUTDIR "'$MANTA_INPUT_FILE'

This job uses a common pattern for Manta jobs, which is to throw an existing program (kartlytics) into a tarball in Manta, specify it as an asset for a job, and run the bundled program inside the job. In this case, the kartlytics.tgz tarball contains a built copy of the kartlytics github repo, which contains the C program and support files for doing the video processing. There’s a separate job for creating this tarball, which is just:

$ export KARTLYTICS_TARBALL=/dap/public/kartlytics/kartlytics.tgz
$ mjob create -w
    -s /dap/public/kartlytics/bin/make-tarball
    -r "/assets/dap/public/kartlytics/bin/make-tarball $KARTLYTICS_TARBALL" < /dev/null

The “make-tarball” script just clones the repo, builds it, and saves the tarball back to Manta.

Processing all these videos used to take nearly a day on my laptop. It takes just 5 minutes on Manta, and I don’t have to think about spinning up compute instances and then spinning them down when it’s done. This makes a serious difference because I can get feedback on algorithm tweaks in minutes rather than overnight.

There are two other Manta jobs involved: one that takes the race transcripts produced by the first job and aggregates them with the human-entered data about who played which characters in each race, producing a single aggregated JSON file; and one that takes the same race transcripts and produces smaller web-quality videos for each race that you can play directly on the kartlytics.com website. For details, see the the run_all script in the repo.

Try it: Since these videos are public, you can run kartlytics yourself on our collection of videos. You can run these two jobs exactly as written, replacing KARTLYTICS_OUTDIR and KARTLYTICS_TARBALL with paths in your own storage areas (e.g., /$MANTA_USER/public/kartlytics). The above “run_all” script is also parametrized so that anyone can run it on our videos.

Highlights

The front page of kartlytics.com shows some summary information, including the tracks we play a lot:

Summary

We start every session with Luigi Raceway, which is why that track has the most plays. After the first race, the winner chooses the next track, so the distribution diverges after that.

Next on the page we show races from the latest session, then the “wildest races”. This is a proxy for “slugfests”, which are races with just the right circumstances (wide open track, a tight pack, lucky weapons, etc.) for a weapons-based free-for-all. The “wildest” ones listed here are those with the most number of player position changes per minute.

Wildest races

Below that, we have an example of a hotly-debated question: how often does a player go from 1st to 4th in less than 5 seconds? We call that a “Keithing”, and the front page shows all of them. You can sort by person to see how many times it’s happened to each player (but keep in mind that not all players have played the same number of races).

You can click on the date/time of any race to get to the race details page, which summarizes the results, shows a playable video of just that race, and then the all important race transcript, showing not only everything that happened in the race, but with screenshots to prove it:

Event summary

That’s really important, both for debugging and to silence criticism about kartlytics “making stuff up”.

You can also click any player to see that player’s races, how many times they’ve played each track, and which characters (and character classes — lightweight, middleweight, or heavyweight) they tend to play:

Player details

Future

We’ve really just scratched the surface. We’ve only answered a few of the questions I mentioned above. We’ve discussed tracking lots more data, like power slides (to see how much they matter), weapons (to see how the distribution changes with rank), and various mistakes (to prove [or disprove] claims like “I keep running into banana peels today”).

We’d also love to add videos from you! It’s surprisingly easy. To record video, I purchased an iGrabber and a small power amplifier (so we could split the video signal without losing too much quality — you may not need this if you’re playing on a TV instead of a projector). The iGrabber comes with software to record the video, and kartlytics does the rest. As I showed above, you can run the software on your own jobs.

The algorithm behind Kartlytics is really pretty simplistic (see the README in the repo for details), but it can probably be applied to other cult classics that have relatively simple on-screen game status. I know our Seattle office is big on Street Fighter II; surely Street Fighter-lytics can’t be far away?

Posted on August 8, 2013 at 7:28 am by dap · Permalink · Comments Closed
In: Joyent, Manta