Profiling Node.js

(For returning readers, this is basically a “tl;dr” version of my previous post on Node.js performance. The post below also appears on the Node.js blog.)

It’s incredibly easy to visualize where your Node program spends its time using DTrace and node-stackvis (a Node port of Brendan Gregg’s FlameGraph tool):

  1. Run your Node.js program as usual.
  2. In another terminal, run:
    $ dtrace -n 'profile-97/execname == "node" && arg1/{
        @[jstack(100, 8000)] = count(); } tick-60s { exit(0); }' > stacks.out

    This will sample about 100 times per second for 60 seconds and emit results to stacks.out. Note that this will sample all running programs called “node”. If you want a specific process, replace execname == "node" with pid == 12345 (the process id).

  3. Use the “stackvis” tool to transform this directly into a flame graph. First, install it:
    $ npm install -g stackvis

    then use stackvis to convert the DTrace output to a flamegraph:

    $ stackvis dtrace flamegraph-svg < stacks.out > stacks.svg
  4. Open stacks.svg in your favorite browser.

You’ll be looking at something like this:

This is a visualization of all of the profiled call stacks. This example is from the “hello world” HTTP server on the Node.js home page under load. Start at the bottom, where you have “main”, which is present in most Node stacks because Node spends most on-CPU time in the main thread. Above each row, you have the functions called by the frame beneath it. As you move up, you’ll see actual JavaScript function names. The boxes in each row are not in chronological order, but their width indicates how much time was spent there. When you hover over each box, you can see exactly what percentage of time is spent in each function. This lets you see at a glance where your program spends its time.

That’s the summary. There are a few prerequisites:

  • You must gather data on a system that supports DTrace with the Node.js ustack helper. For now, this pretty much means illumos-based systems like SmartOS, including the Joyent Cloud. MacOS users: OS X supports DTrace, but not ustack helpers. The way to get this changed is to contact your Apple developer liason (if you’re lucky enough to have one) or file a bug report at bugreport.apple.com. I’d suggest referencing existing bugs 5273057 and 11206497. More bugs filed (even if closed as dups) show more interest and make it more likely Apple will choose to fix this.
  • You must be on 32-bit Node.js 0.6.7 or later, built --with-dtrace. The helper doesn’t work with 64-bit Node yet. On illumos (including SmartOS), development releases (the 0.7.x train) include DTrace support by default.

There are a few other notes:

  • You can absolutely profile apps in production, not just development, since compiling with DTrace support has very minimal overhead. You can start and stop profiling without restarting your program.
  • You may want to run the stacks.out output through c++filt to demangle C++ symbols. Be sure to use the c++filt that came with the compiler you used to build Node. For example:
    c++filt < stacks.out > demangled.out

    then you can use demangled.out to create the flamegraph.

  • If you want, you can filter stacks containing a particular function. The best way to do this is to first collapse the original DTrace output, then grep out what you want:
    $ stackvis dtrace collapsed < stacks.out | grep SomeFunction > collapsed.out
    $ stackvis collapsed flamegraph-svg < collapsed.out > stacks.svg
  • If you’ve used Brendan’s FlameGraph tools, you’ll notice the coloring is a little different in the above flamegraph. I ported his tools to Node first so I could incorporate it more easily into other Node programs, but I’ve also been playing with different coloring options. The current default uses hue to denote stack depth and saturation to indicate time spent. (These are also indicated by position and size.) Other ideas include coloring by module (so V8, JavaScript, libc, etc. show up as different colors.)

For more on the underlying pieces, see my previous post on Node.js profiling and Brendan’s post on Flame Graphs.

.

Where does your Node program spend its time?


Photo by Julian Lim (flickr)

Performance analysis is one of the most difficult challenges in building production software. If a slow application isn’t spending much time on CPU, it could be waiting on filesystem (disk) I/O, network traffic, garbage collection, or many other things. We built the Cloud Analytics tool to help administrators and developers quickly identify these sources of latency in production software. When the app is actually spending its time on CPU, the next step is figuring out what it’s doing: what functions it’s executing and who’s calling them.

One approach is to sample the program at say, 1000 times per second, and examine the 5-10 most frequently seen stacktraces. DTrace lets you do this with the profile provider, whose output might include something like this for a C program:

    libc.so.1`mutex_trylock_adaptive+0x137              ← stacktrace
    libc.so.1`mutex_lock_impl+0xdc
    libc.so.1`mutex_lock+0x10
    mysqld`_Z10open_tableP3THDP10TABLE_LISTP11st_mem_rootPbj+0x4b1
    mysqld`_Z11open_tablesP3THDPP10TABLE_LISTPjj+0x205
    mysqld`_Z20open_and_lock_tablesP3THDP10TABLE_LIST+0x1b
    mysqld`_Z21mysql_execute_commandP3THD+0x416
    mysqld`_Z11mysql_parseP3THDPKcjPS2_+0xf7
    mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x6c9
    mysqld`_Z10do_commandP3THD+0xca
    mysqld`handle_one_connection+0x343
    libc.so.1`_thr_setup+0x70
    libc.so.1`_lwp_start
                7                                       ← count

... (more stacktraces and associated counts)

Brendan has a great example of using this to find a 2000x win in GNU grep. Knowing that many samples shared a particular stacktrace like this one often points you right to the performance problem. But for dynamic environments like Python or Node.js, this information is very hard to obtain. Although the V8 VM supports profiling, it has to be enabled when you start the program, and this adds overhead to the entire execution of the program. That’s a non-starter for most production deployments. And although you can use DTrace to profile such programs, by default it sees only the native stack frames. In environments like Python, this corresponds to interpreter frames, which are fairly useless:

    libpython2.6.so.1.0`PyObject_Free+0xe
    libpython2.6.so.1.0`PyObject_Realloc+0xcb
    libpython2.6.so.1.0`_PyString_Resize+0x7f
    libpython2.6.so.1.0`get_line+0x333
    libpython2.6.so.1.0`file_readline+0xa8
    libpython2.6.so.1.0`PyCFunction_Call+0x63
    libpython2.6.so.1.0`PyEval_EvalFrameEx+0x4099
    libpython2.6.so.1.0`PyEval_EvalCodeEx+0x7d8
    libpython2.6.so.1.0`PyEval_EvalCode+0x2f
    libpython2.6.so.1.0`run_mod+0x41
    libpython2.6.so.1.0`PyRun_FileExFlags+0x7e
    libpython2.6.so.1.0`PyRun_SimpleFileExFlags+0x32c
    libpython2.6.so.1.0`PyRun_AnyFileExFlags+0x69
    libpython2.6.so.1.0`Py_Main+0xa70
    python2.6`main+0x28
    python2.6`_start+0x80

You can’t tell what Python-level functions are being called here. In Node.js things are even worse because the VM compiles JavaScript functions and invokes them directly on the heap, so you often don’t see any symbols at all:

    0xfc618bc0
    0xfc61bd62
    0xfe870841
    0xfc61c1f3
    0xfc617685
    0xfe870841
    0xfc6154d7
    0xfe870e1a
    0xfe8700a6
    node-dtrace`_ZN2v88internalL6InvokeEbNS0_6HandleINS0_10JSFunctionEEENS1_INS0_6ObjectEEEiPPPS4_Pb+0x1fc
    node-dtrace`0x85e6d50

However, through a mechanism called a ustack helper, DTrace can be taught to understand V8 stack frames, giving you a stacktrace that looks more like this (colors and labels added for clarity):

    libc.so.1`gettimeofday+0x7                                  ←  libc gettimeofday call
    Date at  position                                           ←  JavaScript "Date" function
    << adaptor >>
    << constructor >>                                           ← (invoked as a constructor)
    (anon) as exports.active at timers.js position 7590
    (anon) as Socket._write at net.js position 21336
    (anon) as Socket.write at net.js position 19714
    << adaptor >>
    (anon) as OutgoingMessage._writeRaw at http.js position 21526
    (anon) as OutgoingMessage._send at http.js position 20434
    << adaptor >>
    (anon) as OutgoingMessage.end at http.js position 35062     ← re-entering node library
    << adaptor >>
    (anon) as a at /home/dap/work-server.js line 25
    handle at /home/dap/work-server.js line 13                  ← invoking our request handler
    (anon) as EventEmitter.emit at events.js position 3532
    << adaptor >>
    (anon) as parser.onIncoming at http.js position 80542
    (anon) as parser.onHeadersComplete at http.js position 4904 ← parsing HTTP headers
    << internal >>
    << entry >>                                                 ← entering JavaScript
    node-dtrace`_ZN2v88internalL6InvokeEbNS0_6HandleINS0_10JSFunctionEEENS1_INS0_6ObjectEEEiPPPS4_Pb+0x1fc
    node-dtrace`0x85e6d50

The helper is built into the “node” binary and translates each stack frame that it understands into a text description that typically includes the function name, the inferred name (for anonymous functions), and the file and position where the function is defined. Unrecognized stack frames are left alone, so you can see native C/C++ frames mixed with JavaScript frames. If line numbers are available, the position includes the line number where the function was defined; if not, only the character position in the file is printed. V8 only has line number information available for files that have previously been used in an exception stack trace.

Reading from the bottom up, we see the initial entry into JavaScript from node, then some frames in the node library for processing HTTP parser events, then the call into our program to handle the HTTP request, a call back to the node library to finish sending the response, then native code invoked by the Date constructor invoked as part of sending the response.

There are a few frames that you might be surprised to see if you’re not familiar with V8 internals:

  • entry: frame transitioning from C++ to JavaScript
  • exit: frame transitioning from JavaScript to C++
  • construct: the above frame’s function is being invoked as a constructor
  • adaptor: adaptor used when a function is invoked with fewer arguments than it expects
  • internal: internal V8 frame (does not correspond to a JavaScript frame)

Now that we can have DTrace output JavaScript-friendly stacktraces, we can use the profile provider as described above to identify common stacktraces and thus find out where any Node program is spending its time.

Instructions

The DTrace ustack helper landed in the upcoming Node v0.6.7 and the tools for generating a flame graph are available on github. With this, you’ve got all the pieces to try this out on your own programs, though there are a few caveats to be aware of for now.

  1. First, you must be running on an Illumos- (or Solaris-)based system. ustack helpers don’t work at all on Mac OS and appear to be broken on FreeBSD. If you don’t have such a system and want to play around, try a SmartOS live CD (it’s really easy).
  2. You must also have access to modify kernel tunables, as the default limits are too low to support the V8 helper. (We’ll be rolling these tunable changes across the Joyent cloud so that SmartMachine customers can skip this step, but that hasn’t been completed yet.) Use MDB or /etc/system to set dtrace_helper_actions_max from 32 (the default) to at least 128:
    # echo "dtrace_helper_actions_max/W 0t128" | mdb -kw
    dtrace_helper_actions_max:      0x20            =       0x80
    

    (update: modern versions of SmartOS need no tuning so you can skip this step.)

  3. Build Node with ustack helper support. Run ./configure --with-dtrace and you’ll get support for the ustack helper as well as additional DTrace probes.
  4. Run your program as normal. I’ll use this sample program:
    /*
     * This script demonstrates a simple stack trace involving an anonymous
     * function, then goes into an infinite loop so you can more easily catch it
     * with DTrace.  One caveat: we grab a stack trace at the start to force V8 to
     * compute line number information for this script.  If we didn't have this,
     * DTrace would print out position-in-file information instead of line numbers.
     */
    new Error().stack;
    
    function main() { func1(); }
    function func1() { func2(); }
    
    function func2()
    {
    	(function () {
    		for (;;)
    			;
    	})();
    }
    
    main();
    

    which I run like this:

    $ node loop-anon.js &
    [1] 1791
    
  5. Now we’ll use this DTrace invocation to profile our program for 60 seconds and aggregate on stacktrace:
    # dtrace -n 'profile-997/execname == "node" && arg1/ {
        @[jstack(40, 2000)] = count(); } tick-60s { exit(0); }' > out.stacks
    

    The script automatically stops after 60 seconds. At this point, the output file is already pretty useful, showing JavaScript stacktraces sorted by frequency count. We can also use this file to generate flame graphs (below).

    If you’re not familiar with DTrace, this invocation has a few important pieces:

    • profile-997: Samples on all CPUs at 997Hz. We use 997 instead of the more obvious 1000 to avoid oversampling periodic activity.
    • execname == "node": the probe’s body only fires when the currently running program is “node” to avoid sampling other processes on the system. You could omit this and sample everything.
    • arg1: arg1 for profile probes is the current user-level instruction pointer, so testing against zero here ensures that we only fire the probe body for samples that interrupt userland code.
    • @[jstack(40, 2000)] = count();: the body of the probe populates an aggregation that counts the number of times each userland stacktrace has been seen. We use jstack to collect the userland stack trace. We gather 40 frames and give the ustack helper 2000 bytes of buffer for translating the stack frames.
    • tick-60s { exit(0); }: causes the DTrace invocation to exit after 60 seconds
    • > out.stacks: saves the output to out.stacks rather than stdout.

If you run through these steps with the above example program, you should see all samples with the same or nearly identical stacktraces, since the program quickly enters a tight loop. For a more interesting example, I ran this on one of our Cloud Analytics services. From the output you can see that not all of the stacktraces are complete; see below for details on that.

Scaling up

We believe this DTrace-based profiling support for Node.js will be a huge leap for understanding the performance of individual Node programs in production. We’re also exploring ways of scaling this up to many instances of the same program, or cases where there’s no single hot stacktrace but rather one long-running function that calls many different functions during its execution. For such cases, we’ve been looking at flame graphs to visualize overlapping stacktraces. Here’s a flame graph based on the above output that shows where the Cloud Analytics aggregator (itself a Node.js program) spends its time:

Try clicking that graph for an interactive SVG where you can inspect individual frames.

From here we can see that most time is spent generating heatmaps (not surprisingly, if you saw our OSCON or Surge presentations that talk about our architecture). We can also tell how much time is spent bucketizing the data in preparation for rendering the heatmap vs. generating the PNG image itself.

You can easily generate flame graphs for your own programs. Collect data as shown above, then use the stackcollapse and flamegraph tools to generate the graph.

Conclusions

With the V8 ustack helper, you can profile a running Node program to see what functions it’s spending time in. Being DTrace-based, you can switch this on and off at will and do it safely in production with literally zero overhead when not enabled and minimal overhead when enabled. (There are also ustack helpers for Python and Java.)

There are still some cases that the ustack helper doesn’t handle well that lead to some artifacts in the output. For example, in the above aggregator flame graph, three of the frames called from caAggrValueHeatmapImage are actually the same function, but because the helper doesn’t resolve them, they appear as distinct hex addresses and so show up as three silos instead of one.

Another artifact appears to result from the fact that internal V8 stack frames don’t use frame pointers. This is a common optimization that destroys observability because you can’t collect a stacktrace without frame pointers. More investigation is needed to determine whether this is actually the problem and whether it would be a reasonable tradeoff to disable this optimization.

Nevertheless, for us this tool has been a floodlight in an otherwise pitch-black room. It’s unfortunate that ustack helpers don’t work on Mac OS or FreeBSD, but perhaps with the increased attention on performance analysis in dynamic environments like Node, those systems will address the underlying issues. At Joyent, we’ll be updating both the kernel and the node binaries that we ship to fully support DTrace-based profiling out of the box, and we’re exploring ways of incorporating this data into Cloud Analytics.

USDT Providers Redux

In this post I’m going to review DTrace USDT providers and show a complete working example that I hope will be a useful reference for people interested in building providers for their own applications.

First, the prerequisites:

  • DTrace is the comprehensive dynamic tracing framework available on Illumos-based, BSD, and MacOS systems. If you’ve never used DTrace, check out dtrace.org; this post assumes you’ve at least played around with it.
  • USDT (Userland Statically Defined Tracing) is the mechanism by which application developers embed DTrace probes directly into an application. This allows users to trace semantically meaningful operations like “request-start”, rather than having to know which function implements the operation. More importantly, since USDT probes are part of the source code, scripts that use them continue working even as the underlying software evolves and the implementing functions are renamed and deleted.[1]

Tektronix MSO 4034 (photo by Rostislav Lisovy (flickr))

Back in 2009 I posted about how we put together an Apache USDT provider for the Sun Storage appliances. Although I wasn’t able to include any of the actual source, this post became the de-facto documentation on implementing a USDT provider with translated arguments. Two years later, as part of our work on Cloud Analytics we’ve written (from scratch) a new Apache provider called mod_usdt, and this time the complete source is available on github.

I hope this provider can be a useful reference for others building USDT providers. There are plenty of other providers out there already, but many of them don’t use translated arguments (see below), and most of them are (naturally) embedded inside a complex application and build system, so it’s hard to pick out what’s actually relevant. Since this module is built separately from Apache, nearly 100% of the code in mod_usdt relates directly to the provider.

The main files are these:

  • src/httpd.d: defines the structure translators. Delivered into /usr/lib/dtrace and loaded by dtrace(1M) at run-time.
  • src/httpd_provider.d: defines the provider probes and arguments. Linked into the module binary at link-time.
  • src/httpd_provider_impl.h: defines the structure passed from Apache to DTrace at run-time.
  • src/usdt.c: implements an Apache module using hooks at the beginning and end of each request to fire the corresponding DTrace probes

There are also some example DTrace scripts in the repo for tracing request latency; see the README for details.  Between these various pieces, you have everything you need to define and use the provider.

At this point, if you want to understand the pieces of the implementation, your best bet is to go read all the code in mod_usdt. It’s not actually that much code, and it’s well documented. The rest of this post describes how the surrounding mechanism works.

USDT probes

The provider implementation uses macros in the source to define probe points:

    HTTPD_REQUEST_START(request->method, request->uri);

This example (not actually from mod_usdt) includes two probe arguments, which can be accessed from D scripts to filter or aggregate on this information (the HTTP method and URI in this example). Passing simple arguments works well for simple probes, but starts getting unwieldy as you add a bunch more arguments. So DTrace also allows you to define C-style structs that are passed as arguments to the probes. To do this, you must define translators that take the actual arguments from the application and populate a struct for use in the probe. (Don’t worry if you didn’t catch all that. This will become clearer below.)

Building the shared object

We start with the provider definition file, httpd_provider.d, which defines probes like this:

probe request__start(dthttpd_t *p) : (conninfo_t *p, httpd_rqinfo_t *p);

This defines a probe called “request-start”. The application will pass a pointer to dthttpd_t, and the actual D probes will get a conninfo_t and a httpd_rqinfo_t.

When you run “make”, the first step is to take this provider definition generate a header file that defines the macros that the application uses to fire the probes:

dtrace -xnolibs -h -o build/httpd_provider.h -s src/httpd_provider.d

The header file has definitions like this:

#define HTTPD_REQUEST_START(arg0) \
        __dtrace_httpd___request__start(arg0)

Conceptually, the application uses this macro to fire the probe. (As we’ll see, there’s some magic involved.) When you compile the file:

gcc -Wall -Werror -fPIC  -Ibuild -I/opt/local/include/apr-1 -I/opt/local/include -I/opt/local/include/db4 -D_REENTRANT -I/usr/include -DLDAP_DEPRECATED -I/opt/local/include/httpd -c -o build/usdt.o src/usdt.c

The resulting object file indeed includes a call to the special __dtrace function, but there’s another important step next:

dtrace -xnolibs -G -o build/httpd_provider.o -s src/httpd_provider.d build/usdt.o

The “dtrace -G” pass iterates each of the objects (just usdt.o in this case), identifies these function calls, and for each one records which probe is being fired and the location in the program text[2]. It then replaces the function calls with “nop” instructions, which are ignored by the CPU. A new object file (httpd_provider.o) is generated that includes the probe and location information in a special SUNW_dof section. This object also contains an “_init” function — we’ll get to this shortly. To wrap up the build, we link these objects into the final shared library:

gcc -shared -fPIC  -o build/mod_usdt.so build/usdt.o build/httpd_provider.o

Loading the module

When the library is actually loaded, the _init function transmits the information in the SUNW_dof section down to the DTrace kernel module. See the source for details. With this, DTrace knows which probes are available for this process and their locations in memory.

Enable the probes

When a user runs dtrace(1M) to instrument a USDT probe, DTrace replaces the “nop” instructions at all of that probe’s call sites with “int 3″ (0xcc) instructions, which will cause a fast trap into the kernel. The DTrace kernel module can tell (from the location of the trapping instruction) that this corresponds to an enabled probe, so it fires the probe and returns back to userland.

When the probes are disabled again (i.e. when you CTRL-C the dtrace(1M) process), the “int 3″ instructions are changed back to nops.

The upshot of all this is that when DTrace is not enabled, the application runs essentially the same as it did before USDT support was added at all. There’s no overhead to just having DTrace support.

Translated arguments

The files in /usr/lib/dtrace define structures and translators that take arguments passed into probes from the application and convert them to semantically meaningful structures. For details, see src/httpd.d in mod_usdt.

As described above, you can get away without translated arguments by passing primitives (ints and pointers, with which you can access strings too) directly to the probes. That way, you don’t have to deliver files into /usr/lib/dtrace.

And a few more things

Believe it or not, all of the above is a gross simplification of what actually goes on. To see all the details of implementing a provider, you’ll have to read all of mod_usdt. For a taste of how complex the mechanism for firing userland probes really is, check out Bryan’s post on what happens when magic collides.

Also, I explained above that there’s no overhead for disabled probes. That’s almost true, as long as there’s no code required to set up the arguments for the DTrace probe. But often times, you do want to include more complex arguments. To minimize the overhead of setting them up when DTrace is not enabled, you can use ISENABLED macros to tell at run-time whether the probe is enabled. mod_usdt uses these to avoid setting up the structure if the probes are not enabled.

Conclusion

USDT is extremely powerful for tracing semantically meaningful events in userland applications. While implementing a provider doesn’t require a lot of work per se, the steps required are not very obvious. I hope that mod_usdt will serve as a useful reference for those looking to add USDT to their own application, besides being useful in its own right for those using Apache.


[1] This contrasts with the pid provider, which allows users to trace almost any instruction, function entry, and function exit in an application. This flexibility makes the pid provider extremely valuable for ad-hoc investigation, but the resulting dependence on application implementation details makes it unsuitable for stable tools.

[2] The “dtrace -G” pass emits relocations, since the final locations of these instructions won’t be known until the run-time linker runs.

Distributed Web Architures @ SF Node.js Meetup

At the Node Meetup here at Joyent‘s offices a few weeks ago I gave a brief talk about Cloud Analytics as a distributed web architectureMatt Ranney of Voxer and Curtis Chambers of Uber also spoke about their companies’ web architectures. Thanks to Jackson for putting the videos together. All around it was a great event!