Node.js/V8 postmortem debugging

Update 2014-01-27: The commands and output described in this post have changed since it was first posted. For updated details, see the related post on the Joyent Dev Center.


Photo by Auntie P (flickr)

I recently wrote an article for ACM Queue discussing postmortem debugging for dynamic environments. I argued that while native environments for years have provided rich tools for understanding software failures postmortem, most popular dynamic environments have not developed similar facilities, and this gap is becoming more important as these environments grow popular for building complex core components of large distributed systems.

In building Cloud Analytics, we knew we needed a robust facility for postmortem debugging. The CA service comprises several components: a few global services that manage state, an agent on each server in the datacenter that collects performance data, and a cluster of data aggregators. Debugging distributed systems is always hard, and it’s even harder when the software is running on other people’s production systems.

We built a module called node-panic that dumps predefined program state in the event of a crash. It also has a tool for manually triggering a crash in the event of a hard-hang (e.g., infinite loop). These tools have been extremely valuable in helping us root-cause issues in both production and development environments. But they’re really just a first step, and as I explain in the article, this approach has several important limitations, mostly owing to the fact that node-panic is pure JavaScript and so isn’t baked into the VM itself and doesn’t have access to a lot of important program state.

A next step

More recently I’ve been playing around with building an MDB module for inspecting V8 state. (Recall that V8 is the blazingly fast JavaScript virtual machine in both Node.js and Chrome, and MDB is the illumos modular debugger, an extensible tool for analyzing both live programs and process core dumps.) With this approach, you can attach MDB to either a running Node program, a core file manually created using gcore(1), or a core file generated as a result of a Node crash (e.g., segfault).  Importantly, you don’t need a debug build of node for this so you can use it in production.

Here’s a very simple example program:

dap@devel ~ $ cat -n node-stack/examples/loop-anon.js
     1	/*
     2	 * This script demonstrates a simple stack trace involving an anonymous
     3	 * function, then goes into an infinite loop so you can catch it with the
     4	 * debugger.  One caveat: we grab a stack trace at the start to force V8 to
     5	 * compute line number information for this script.  If we didn't have this, the
     6	 * debugger would only print out position-in-file information.
     7	 */
     8	new Error().stack;
     9
    10	function main() { func1(); }
    11	function func1() { func2(); }
    12
    13	function func2()
    14	{
    15		(function () {
    16			for (;;)
    17				;
    18		})();
    19	}
    20
    21	main();

Note the comment at the top of the script. V8 computes line number information for a given script lazily. If it has never done this for a particular script, the debugger module won’t be able to print out the line number.  This example grabs a stacktrace at the start of the script to make sure V8 has computed line numbers for it.  (There’s an example of the other case below.)

Now I’ll run this script with a stock node build and attach MDB:

dap@devel ~ $ node node-stack/examples/loop-anon.js &
[1] 400524

dap@devel ~ $ mdb -p 400524
Loading modules: [ ld.so.1 libc.so.1 ]
> ::status
debugging PID 400524 (32-bit)
file: /usr/bin/node
threading model: native threads
status: stopped by debugger

Now I’ll grab a JavaScript stack trace:

> ::load mdb_v8.so
> ::jsstack
804738c <anonymous> (as <anon>) at /home/dap/node-stack/examples/loop-anon.js line 15
80473a4 func2 at /home/dap/node-stack/examples/loop-anon.js line 13
80473bc func1 at /home/dap/node-stack/examples/loop-anon.js line 1180473d4 main at /home/dap/node-stack/examples/loop-anon.js line 10
80473f0 <anonymous> (as <anon>) at /home/dap/node-stack/examples/loop-anon.js line 1
8047424 <InternalFrame>
804745c <anonymous> (as Module._compile) at module.js line 340
804747c <anonymous> (as Module._extensions) at module.js line 406
80474a0 <anonymous> (as Module.load) at module.js line 324
80474cc <anonymous> (as Module._load) at module.js line 257
80474ec <anonymous> (as Module.runMain) at module.js line 419
8047518 <anonymous> (as startup.processNextTick.process._tickCallback) at node.js line 120
8047534 <InternalFrame>
8047568 <EntryFrame>
fe84e240 <None>

If V8 hadn’t computed the line number information for our script, the stack trace would look more like this:

> ::jsstack
804738c <anonymous> (as <anon>) at /home/dap/node-stack/examples/loop-anon-noinfo.js
position 814
80473a4 func2 at /home/dap/node-stack/examples/loop-anon-noinfo.js position 772
80473bc func1 at /home/dap/node-stack/examples/loop-anon-noinfo.js position 710
80473d4 main at /home/dap/node-stack/examples/loop-anon-noinfo.js position 652
80473f0 <anonymous> (as <anon>) at /home/dap/node-stack/examples/loop-anon-noinfo.js
position 2
8047424 <InternalFrame>
804745c <anonymous> (as Module._compile) at module.js position 19476
804747c <anonymous> (as Module._extensions) at module.js position 23260
80474a0 <anonymous> (as Module.load) at module.js position 18526
80474cc <anonymous> (as Module._load) at module.js position 15200
80474ec <anonymous> (as Module.runMain) at module.js position 23916
8047518 <anonymous> (as startup.processNextTick.process._tickCallback) at node.js
position 7224
8047534 <InternalFrame>
8047568 <EntryFrame>
fe84e240 <None>

Besides getting a stack trace, you can examine individual frames in more detail using the frame pointer:

> 80473a4::jsframe -v
frame pointer: 80473a4
    function: fd862b8d (JSFunction)
        shared function info: fc615229 (SharedFunctionInfo)
            function name: fc6941a9 (ascii SeqString: "func2")
            function token position: SMI: value = 515
            script: fc614f4d (Script)
                script name: fd85ec05 (ascii ConsString: "
                /home/dap/node-stack/examples/loop-anon.js")
                line endings: fd863605 (FixedArray)
            function token at: line 13

and you can inspect individual heap objects as well. Given a pointer to any JavaScript heap object, “::jsprint” will figure out what type it is and dump a representation of the corresponding C++ class inside the V8 VM:

> fd862b8d::jstype
0xfd862b8d: JSFunction

> fd862b8d::jsprint
fd862b8d JSFunction {
    fd862b8d JSObject {
        fd862b8d HeapObject < Object < MaybeObject  {
            fd862b8c map = fe82df41 (Map)
        }
        fd862b90 properties = fe86e021 (FixedArray)
        fd862b94 elements = fe86e021 (FixedArray)
    }
    fd862b9c prototype_or_initial_map = fe86e0cd (Oddball)
    fd862ba0 shared = fc615229 (SharedFunctionInfo)
    fd862ba8 literals = fe86e021 (FixedArray)
    fd862bac next_function_link = fe86e05d (Oddball "undefined")
}

This is still very much a prototype, but we think this could be very powerful. In the meantime, there are some issues to be solved with the existing implementation and several more problems to address.

Maintainability

The heap inspector requires pretty deep knowledge of the internal memory layout of the V8 heap. How can this be maintained in a reasonable way as V8 itself evolves?

For C programs shipped with illumos, we use CTF at build-time to encode type definitions in the binary itself. CTF is compact enough to be shipped in production binaries, so MDB understands all in-memory structures for these programs on all illumos systems. Application-specific debugger commands can be implemented in terms of CTF so that they don’t have to be built specifically for a particular application binary.

This approach is tougher for C++ programs, which have additional information like class hierarchies that can’t be expressed with CTF. But while V8 uses C++ classes (and an inheritance hierarchy) to represent different heap object types, it doesn’t use standard C++ member fields. Instead, each class defines static constants describing the offsets of its member fields, and macros are used to define accessors in terms of these constants. This actually makes it easier to build and maintain a C debugger module because it’s much easier to export these constants as part of the build process than to reverse-engineer how a C++ structure gets laid out in memory. If this proves to be the right approach, then we could save these constants into the binary using CTF and then build the debugger module independent of the V8 binary.

Completeness

Today, this module can dump a simple stack trace and provide basic inspection of heap objects. But there’s a lot more that would be useful:

  • global variables
  • arguments and local variables for each function on the stack
  • list of events pending on the event queue
  • list of function contexts to be invoked in the future (e.g., when pending I/O completes)

I’ve also only tested it on pretty simple examples so far. My examples have included JavaScript frames for named and anonymous functions, constructor frames, argument adaptor frames, and simple entry frames. I haven’t yet tried optimized frames, exit frames, or an exit/entry sequence.

This module has already helped us understand enough to write a basic DTrace ustack helper for Node programs, which allows similar stack traces to be gathered via DTrace while the program is running. In future blog posts I’ll talk more about the DTrace helper as well as what I’ve learned about V8, which I hope will be useful for others hacking around with it.

Surge 2011

Surge Conference 2011, Sept. 28th - 30th, Baltimore, MDWe had a great time last week attending Surge in Baltimore. Highlights for me included Baron Schwartz‘s visualizations of MySQL execution time (not entirely unlike Brendan‘s, but with the addition of modeling), Geir Magnusson‘s discussion of scalability at Gilt, and Raymond Blum‘s discussion of backup/restore at Google (including recovery after the GMail outage from several months ago).

In his keynote, Ben Fried introduced what turned out to be an important theme in several talks: the importance of hiring generalists. Generalists are people who dive into problems and chase them through all levels of the stack. Ben’s point resonated strongly with me, as it was a very important principle at Fishworks and continues to be important here at Joyent. Bryan posed an interesting follow-up question: are generalists born or made? In my (limited) experience, being a generalist is more an attitude and set of traits (like tenacity and persistence) than any particular set of skills, and while most people either have that attitude or they don’t, those that do can often go either way depending on the environment.

I remember the turning point for me early in my internship with the Solaris kernel group. I was editing source files on the build machine over NFS, typed “:w”, and my vim process hung. Being pretty new to Solaris and this dev environment, I bugged my mentor Dan Price about it. He introduced me to the kernel debugger, showed me how to find the kernel stack for the vim process’s main thread, and then had me pull up the NFS client source where the thread was hung. This seems obvious now, but at the time it hadn’t even occurred to me that I could figure out why my desktop was hung by just pulling up the kernel source and looking at what it was doing. That experience taught me not to see component boundaries as barriers to debugging. Since then, while I spend the vast majority of my time in familiar source bases, I’ve also found myself in code from Apache (to figure out why httpd was hung) to rabbitmq (to understand the problem discussed in our talk). (Not that I know Erlang, but I could make sense of enough to follow the rabbitmq source.) Besides helping to solve an immediate problem, I’ve found that diving into new codebases both broadens your experience and deepens your understanding of the whole system. This investment usually pays off huge.

So one important takeaway from Ben’s and several other talks was that it’s critical to build teams with generalists because understanding failures in complex systems by passing the problem around to different specialized groups is often just untenable.

On Thursday, Bryan and I spoke about the technical choices we made in designing Cloud Analytics and how those have turned out. We also shared some of our recent challenges in deploying Cloud Analytics in production. The slides are available, the talk was recorded, and the videos should be available early next year.

Thanks to OmniTI for putting on a great conference and to everyone who attended. I hope to see you all next year!