Node.js/V8 postmortem debugging

2016 Update: The commands and output in this post are pretty dated now, but you can find an up-to-date docs (including a tutorial) in the mdb_v8 user guide.


Photo by Auntie P (flickr)

I recently wrote an article for ACM Queue discussing postmortem debugging for dynamic environments. I argued that while native environments for years have provided rich tools for understanding software failures postmortem, most popular dynamic environments have not developed similar facilities, and this gap is becoming more important as these environments grow popular for building complex core components of large distributed systems.

In building Cloud Analytics, we knew we needed a robust facility for postmortem debugging. The CA service comprises several components: a few global services that manage state, an agent on each server in the datacenter that collects performance data, and a cluster of data aggregators. Debugging distributed systems is always hard, and it’s even harder when the software is running on other people’s production systems.

We built a module called node-panic that dumps predefined program state in the event of a crash. It also has a tool for manually triggering a crash in the event of a hard-hang (e.g., infinite loop). These tools have been extremely valuable in helping us root-cause issues in both production and development environments. But they’re really just a first step, and as I explain in the article, this approach has several important limitations, mostly owing to the fact that node-panic is pure JavaScript and so isn’t baked into the VM itself and doesn’t have access to a lot of important program state.

A next step

More recently I’ve been playing around with building an MDB module for inspecting V8 state. (Recall that V8 is the blazingly fast JavaScript virtual machine in both Node.js and Chrome, and MDB is the illumos modular debugger, an extensible tool for analyzing both live programs and process core dumps.) With this approach, you can attach MDB to either a running Node program, a core file manually created using gcore(1), or a core file generated as a result of a Node crash (e.g., segfault).  Importantly, you don’t need a debug build of node for this so you can use it in production.

Here’s a very simple example program:

dap@devel ~ $ cat -n node-stack/examples/loop-anon.js
     1	/*
     2	 * This script demonstrates a simple stack trace involving an anonymous
     3	 * function, then goes into an infinite loop so you can catch it with the
     4	 * debugger.  One caveat: we grab a stack trace at the start to force V8 to
     5	 * compute line number information for this script.  If we didn't have this, the
     6	 * debugger would only print out position-in-file information.
     7	 */
     8	new Error().stack;
     9
    10	function main() { func1(); }
    11	function func1() { func2(); }
    12
    13	function func2()
    14	{
    15		(function () {
    16			for (;;)
    17				;
    18		})();
    19	}
    20
    21	main();

Note the comment at the top of the script. V8 computes line number information for a given script lazily. If it has never done this for a particular script, the debugger module won’t be able to print out the line number.  This example grabs a stacktrace at the start of the script to make sure V8 has computed line numbers for it.  (There’s an example of the other case below.)

Now I’ll run this script with a stock node build and attach MDB:

dap@devel ~ $ node node-stack/examples/loop-anon.js &
[1] 400524

dap@devel ~ $ mdb -p 400524
Loading modules: [ ld.so.1 libc.so.1 ]
> ::status
debugging PID 400524 (32-bit)
file: /usr/bin/node
threading model: native threads
status: stopped by debugger

Now I’ll grab a JavaScript stack trace:

> ::load mdb_v8.so
> ::jsstack
804738c <anonymous> (as <anon>) at /home/dap/node-stack/examples/loop-anon.js line 15
80473a4 func2 at /home/dap/node-stack/examples/loop-anon.js line 13
80473bc func1 at /home/dap/node-stack/examples/loop-anon.js line 1180473d4 main at /home/dap/node-stack/examples/loop-anon.js line 10
80473f0 <anonymous> (as <anon>) at /home/dap/node-stack/examples/loop-anon.js line 1
8047424 <InternalFrame>
804745c <anonymous> (as Module._compile) at module.js line 340
804747c <anonymous> (as Module._extensions) at module.js line 406
80474a0 <anonymous> (as Module.load) at module.js line 324
80474cc <anonymous> (as Module._load) at module.js line 257
80474ec <anonymous> (as Module.runMain) at module.js line 419
8047518 <anonymous> (as startup.processNextTick.process._tickCallback) at node.js line 120
8047534 <InternalFrame>
8047568 <EntryFrame>
fe84e240 <None>

If V8 hadn’t computed the line number information for our script, the stack trace would look more like this:

> ::jsstack
804738c <anonymous> (as <anon>) at /home/dap/node-stack/examples/loop-anon-noinfo.js
position 814
80473a4 func2 at /home/dap/node-stack/examples/loop-anon-noinfo.js position 772
80473bc func1 at /home/dap/node-stack/examples/loop-anon-noinfo.js position 710
80473d4 main at /home/dap/node-stack/examples/loop-anon-noinfo.js position 652
80473f0 <anonymous> (as <anon>) at /home/dap/node-stack/examples/loop-anon-noinfo.js
position 2
8047424 <InternalFrame>
804745c <anonymous> (as Module._compile) at module.js position 19476
804747c <anonymous> (as Module._extensions) at module.js position 23260
80474a0 <anonymous> (as Module.load) at module.js position 18526
80474cc <anonymous> (as Module._load) at module.js position 15200
80474ec <anonymous> (as Module.runMain) at module.js position 23916
8047518 <anonymous> (as startup.processNextTick.process._tickCallback) at node.js
position 7224
8047534 <InternalFrame>
8047568 <EntryFrame>
fe84e240 <None>

Besides getting a stack trace, you can examine individual frames in more detail using the frame pointer:

> 80473a4::jsframe -v
frame pointer: 80473a4
    function: fd862b8d (JSFunction)
        shared function info: fc615229 (SharedFunctionInfo)
            function name: fc6941a9 (ascii SeqString: "func2")
            function token position: SMI: value = 515
            script: fc614f4d (Script)
                script name: fd85ec05 (ascii ConsString: "
                /home/dap/node-stack/examples/loop-anon.js")
                line endings: fd863605 (FixedArray)
            function token at: line 13

and you can inspect individual heap objects as well. Given a pointer to any JavaScript heap object, “::jsprint” will figure out what type it is and dump a representation of the corresponding C++ class inside the V8 VM:

> fd862b8d::jstype
0xfd862b8d: JSFunction

> fd862b8d::jsprint
fd862b8d JSFunction {
    fd862b8d JSObject {
        fd862b8d HeapObject < Object < MaybeObject  {
            fd862b8c map = fe82df41 (Map)
        }
        fd862b90 properties = fe86e021 (FixedArray)
        fd862b94 elements = fe86e021 (FixedArray)
    }
    fd862b9c prototype_or_initial_map = fe86e0cd (Oddball)
    fd862ba0 shared = fc615229 (SharedFunctionInfo)
    fd862ba8 literals = fe86e021 (FixedArray)
    fd862bac next_function_link = fe86e05d (Oddball "undefined")
}

This is still very much a prototype, but we think this could be very powerful. In the meantime, there are some issues to be solved with the existing implementation and several more problems to address.

Maintainability

The heap inspector requires pretty deep knowledge of the internal memory layout of the V8 heap. How can this be maintained in a reasonable way as V8 itself evolves?

For C programs shipped with illumos, we use CTF at build-time to encode type definitions in the binary itself. CTF is compact enough to be shipped in production binaries, so MDB understands all in-memory structures for these programs on all illumos systems. Application-specific debugger commands can be implemented in terms of CTF so that they don’t have to be built specifically for a particular application binary.

This approach is tougher for C++ programs, which have additional information like class hierarchies that can’t be expressed with CTF. But while V8 uses C++ classes (and an inheritance hierarchy) to represent different heap object types, it doesn’t use standard C++ member fields. Instead, each class defines static constants describing the offsets of its member fields, and macros are used to define accessors in terms of these constants. This actually makes it easier to build and maintain a C debugger module because it’s much easier to export these constants as part of the build process than to reverse-engineer how a C++ structure gets laid out in memory. If this proves to be the right approach, then we could save these constants into the binary using CTF and then build the debugger module independent of the V8 binary.

Completeness

Today, this module can dump a simple stack trace and provide basic inspection of heap objects. But there’s a lot more that would be useful:

  • global variables
  • arguments and local variables for each function on the stack
  • list of events pending on the event queue
  • list of function contexts to be invoked in the future (e.g., when pending I/O completes)

I’ve also only tested it on pretty simple examples so far. My examples have included JavaScript frames for named and anonymous functions, constructor frames, argument adaptor frames, and simple entry frames. I haven’t yet tried optimized frames, exit frames, or an exit/entry sequence.

This module has already helped us understand enough to write a basic DTrace ustack helper for Node programs, which allows similar stack traces to be gathered via DTrace while the program is running. In future blog posts I’ll talk more about the DTrace helper as well as what I’ve learned about V8, which I hope will be useful for others hacking around with it.

JavaScript Lint on SmartOS


Photo by Sam Fraser-Smith

Back at Fishworks, we used a tool called JavaScript Lint (JSL) for static code analysis. You may know that lint was originally written to identify potential semantic problems in C code, like use of uninitialized variables or blocks of unreachable code. Lint warnings are usually static checks that could reasonably have been compiler warnings, but for whatever reasons those checks didn’t make it into the compiler.

JSL helps us catch similar errors in JavaScript: undeclared variables, variables hiding other variables in the same scope, etc. There exist several JavaScript linters out there, including Crockford’s JSLint and Google’s Closure Linter. But there are two relatively unique properties about JSL:

  • It does not conflate style with lint. Style refers to arbitrary code formatting rules (like leading whitespace rules). Lint refers to actual program correctness issues (like missing “break” statements inside a switch). The line is certainly fuzzy, as in the case of JavaScript semicolon style, but that’s why:
  • It’s highly configurable. Each individual warning can be turned on or off, and warnings can be overridden for individual lines of code. This is essential for cases where potentially dangerous behavior is being deliberately used (carefully, of course).

We’ve been using JSL on Cloud Analytics since day 1, but until recently we were using a hacked-up build I created back in November just to make forward progress. As we started using it more, it became clear that we needed to be able to build JSL reliably, which was not trivial on SmartOS because the old version of SpiderMonkey that JSL bundles doesn’t build on Solaris 10 or later out of the box. I worked out how to build it (see below), but in doing so I decided it wasn’t worth maintaining the complexity of the existing build system.  JSL hasn’t been changed much in the last many months, so I created a github fork of JSL where I removed everything that clearly wasn’t necessary for JSL and replaced the whole build system with a couple of Makefiles. The result is much less portable, but it builds on SmartOS and I expect it can be made to build on MacOS (note: see update below) and Linux with few modifications. If you want to build the existing JavaScript Lint subversion tree on Illumos, here’s what you have to do:

  • Since Sun Studio is no longer available, you’ll want to build with gcc.
  • Remove the file spidermonkey/src/lock_SunOS.s.  According to the comments, this file is only needed with Sun Studio, and it won’t build with the GNU assembler.  If you don’t remove the file, it will be picked up by a Makefile wildcard (not the line in Makefile.ref that appears to refer to it!).
  • Copy the file below into spidermonkey/src/config/SunOS5.11_i86pc.mk.
  • In the root, run “python setup.py build”.
  • This should work except for the very last step in the build.  The build system runs something like this:
    gcc -shared build/temp.solaris-2.11-i86pc-2.4/javascriptlint/pyspidermonkey/pyspidermonkey.o build/temp.solaris-2.11-i86pc-2.4/javascriptlint/pyspidermonkey/nodepos.o -Lbuild/spidermonkey -ljs -o build/lib.solaris-2.11-i86pc-2.4/javascriptlint/pyspidermonkey.so

    On my system, this produces hundreds of linker errors because gcc is trying to use the GNU linker instead of the OS linker. If you replace the “gcc” in that line with “ld” and run it by hand, it should work.

Here’s the contents of SunOS5.11_i86pc.mk:

#
# Config stuff for SunOS5.11
#

AS = as
CC = gcc
CCC = g++
CFLAGS +=  -Wall -Wno-format
RANLIB = echo
OS_CFLAGS = -DXP_UNIX -DSVR4 -DSYSV -DSOLARIS -DHAVE_LOCALTIME_R
OS_LIBS = -lsocket -lnsl -ldl
HAVE_PURIFY = 1
MKSHLIB = $(LD) -G
# Use the editline library to provide line-editing support.
JS_EDITLINE = 1

Of course, these instructions are very specific to the build environment, so YMMV. It took me a while to figure out the right settings in SunOS5.11_i86pc.mk, so I wanted to make that available in case anyone else is trying to build JSL on SmartOS, Illumos or other Solaris-based systems.  That said, if you don’t care about remaining close to the original source, you may as well just use my fork on github. Even if it doesn’t build out of the box in your environment, the Makefiles should be far easier to understand and modify.

Update: The github fork now builds on MacOSX, too, though you need to install Python first because the one shipped with OSX doesn’t include headers.

Distributed Web Architures @ SF Node.js Meetup

At the Node Meetup here at Joyent‘s offices a few weeks ago I gave a brief talk about Cloud Analytics as a distributed web architectureMatt Ranney of Voxer and Curtis Chambers of Uber also spoke about their companies’ web architectures. Thanks to Jackson for putting the videos together. All around it was a great event!

Presenting at OSCON Data 2011

Brendan‘s and my proposal for OSCON Data 2011 has been accepted! The talk is called “Design and Implementation of a Real-Time Cloud Analytics Platform“. We’ll dive into the implementation of Cloud Analytics with particular focus on the design considerations that allow it to work in real-time on a large production cloud.

We’ll also cover some real-world case studies of people using CA to understand application behavior. If you’ve used no.de‘s Analytics to look at your application, we’d love to hear from you! Leave a comment, tweet @dapsays, or email dap at joyent.