TCP puzzlers

This post was cross-posted to the Joyent blog.

It’s been said that we don’t really understand a system until we understand how it fails. Despite having written a (toy) TCP implementation in college and then working for several years in industry, I’m continuing to learn more deeply how TCP works — and how it fails. What’s been most surprising is how basic some of these failures are. They’re not at all obscure. I’m presenting them here as puzzlers, in the fashion of Car Talk and the old Java puzzlers. Like the best of those puzzlers, these are questions that are very simple to articulate, but the solutions are often surprising. And rather than focusing on arcane details, they hopefully elucidate some deep principles about how TCP works.


These puzzlers assume some basic knowledge about working with TCP on Unix-like systems, but you don’t have to have mastered any of this before diving in. As a refresher:

  • TCP states, the three-way handshake used to establish a connection, and the way connections are terminated are described pretty concisely on the TCP Wikipedia page.
  • Programs typically interact with sockets using read, write, connect, bind, listen, and accept. There’s also send and recv, but for our purposes, these work the same way as read and write.
  • I’ll be talking about programs that use poll. Although most systems use something more efficient like kqueue, event ports, or epoll, these are all equivalent for our purposes. As for applications that use blocking operations instead of any of these mechanisms: once you understand how TCP failure modes affect poll, it’s pretty easy to understand how it affects blocking operations as well.

You can try all of these examples yourself. I used two virtual machines running under VMware Fusion. The results match my experiences in our production systems. I’m testing using the nc(1) tool on SmartOS, and I don’t believe any of the behavior shown here is OS-specific. I’m using the illumos-specific truss(1) tool to trace system calls and to get some coarse timing information. You may be able to get similar information using dtruss(1m) on OS X or strace(1) on GNU/Linux.

nc(1) is a pretty simple tool. We’ll use it in two modes:

  • As a server. In this mode, nc will set up a listening socket, call accept, and block until a connection is received.
  • As a client. In this mode, nc will create a socket and establish a connection to a remote server.

In both modes, once connected, each side uses poll to wait for either stdin or the connected socket to have data ready to be read. Incoming data is printed to the terminal. Data you type into the terminal is sent over the socket. Upon CTRL-C, the socket is closed and the process exits.

In these examples, my client is called kang and my server is called kodos.

Warmup: Normal TCP teardown

This one demonstrates a very basic case just to get the ball rolling. Suppose we set up a server on kodos:


[root@kodos ~]# truss -d -t bind,listen,accept,poll,read,write nc -l -p 8080
Base time stamp:  1464310423.7650  [ Fri May 27 00:53:43 UTC 2016 ]
 0.0027 bind(3, 0x08065790, 32, SOV_SOCKBSD)            = 0
 0.0028 listen(3, 1, SOV_DEFAULT)                       = 0
accept(3, 0x08047B3C, 0x08047C3C, SOV_DEFAULT, 0) (sleeping...)

(Remember, in these examples, I’m using truss to print out the system calls that nc makes. The -d flag prints a relative timestamp and the -t flag selects which system calls we want to see.)

Now on kang, I establish a connection:


[root@kang ~]# truss -d -t connect,pollsys,read,write,close nc 8080
Base time stamp:  1464310447.6295  [ Fri May 27 00:54:07 UTC 2016 ]
 0.0062 connect(3, 0x08066DD8, 16, SOV_DEFAULT)         = 0
pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)

Over on kodos, we see:


23.8934 accept(3, 0x08047B3C, 0x08047C3C, SOV_DEFAULT, 0) = 4
pollsys(0x08045680, 2, 0x00000000, 0x00000000) (sleeping...)

At this point, the TCP connections are in state ESTABLISHED and both processes are in poll. We can see this using the netstat tool on each system:


[root@kodos ~]# netstat -f inet -P tcp -n

   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------   1049792      0 1049800      0 ESTABLISHED

[root@kang ~]# netstat -f inet -P tcp -n

   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------    32806      0 1049800      0 ESTABLISHED

The question is, when we shut down one of these processes, what happens in the other process? Does it find out? How? Try to predict the specific syscall behavior and explain why each system call does what it does.

Let’s try it. I’ll CTRL-C the server on kodos:


pollsys(0x08045680, 2, 0x00000000, 0x00000000) (sleeping...)
^C127.6307          Received signal #2, SIGINT, in pollsys() [default]

Here’s what we see on kang:


pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)
126.1771        pollsys(0x08045670, 2, 0x00000000, 0x00000000)  = 1
126.1774        read(3, 0x08043670, 1024)                       = 0
126.1776        close(3)                                        = 0
[root@kang ~]#

What happened here? Let’s be specific:

  1. We sent SIGINT to the server, causing the process to exit. Upon exit, file descriptors are closed.
  2. When the last file descriptor for an ESTABLISHED socket is closed, the TCP stack on kodos sends a FIN over the connection and enters the FIN_WAIT_1 state.
  3. The TCP stack on kang receives the FIN packet, transitions its own connection to CLOSE_WAIT, and sends an ACK back. Since the nc client is blocked on this socket being ready to read, the kernel wakes this thread with the event POLLIN.
  4. The nc client sees POLLIN for the socket and calls read, which returns 0 immediately. This indicates end-of-stream. nc presumes that we’re done with this socket and closes it.
  5. Meanwhile, the TCP stack on kodos receives the ACK and enters FIN_WAIT_2.
  6. Since the nc client on kang closed its socket, the TCP stack on kang sent a FIN to kodos. The connection on kang enters LAST_ACK.
  7. The TCP stack on kodos receives the FIN, the connection enters TIME_WAIT, and the stack on kodos acknowledges the FIN.
  8. The TCP stack on kang receives the ACK for the FIN and removes the connection entirely.
  9. Two minutes later, the TCP connection on kodos is closed and the stack removes the connection entirely.

(It’s possible for steps to be slightly reordered and for kang to transition through CLOSING instead of FIN_WAIT_2.)

Here’s the final state according to netstat:


[root@kodos ~]# netstat -f inet -P tcp -n

   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------   1049792      0 1049800      0 TIME_WAIT

There’s no output for this connection at all on kang.

The intermediate states transition very quickly, but you could see them using the DTrace TCP provider. You can see the packet flow using snoop(1m) or tcpdump(1).

Conclusions: We’ve seen the normal path of system calls for connection establishment and teardown. Note that kang immediately found out when kodos’s connection was closed — it was woken up out of poll and read returned 0 to indicate end-of-stream. At that point, kang elected to close its socket, which cleaned up the connection state on kodos. We’ll revisit this later to see how the results can differ if kang doesn’t close its socket here.

Puzzler 1: Power cycling

What happens to an established, idle TCP connection if one system is power cycled?

Since most processes go through normal exit as part of a graceful reboot (e.g., if you use the “reboot” command), you get basically the same result if you type “reboot” on kodos instead of killing the server with CTRL-C. But what would happen if we’d power-cycled kodos in the previous example? Surely kang will eventually find out, right?

Let’s try it. Set up the connection:


[root@kodos ~]# truss -d -t bind,listen,accept,poll,read,write nc -l -p 8080
Base time stamp:  1464312528.4308  [ Fri May 27 01:28:48 UTC 2016 ]
 0.0036 bind(3, 0x08065790, 32, SOV_SOCKBSD)            = 0
 0.0036 listen(3, 1, SOV_DEFAULT)                       = 0
 0.2518 accept(3, 0x08047B3C, 0x08047C3C, SOV_DEFAULT, 0) = 4
pollsys(0x08045680, 2, 0x00000000, 0x00000000) (sleeping...)

[root@kang ~]# truss -d -t open,connect,pollsys,read,write,close nc 8080
Base time stamp:  1464312535.7634  [ Fri May 27 01:28:55 UTC 2016 ]
 0.0055 connect(3, 0x08066DD8, 16, SOV_DEFAULT)         = 0
pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)

Now I’ll use VMware’s “reboot” function to power-cycle the system. It’s important for this test that this be an actual power cycle (or OS panic) — anything that causes a graceful shut down will look more like the first case above.

20 minutes later, kang is still sitting in the same place:


pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)

We tend to believe that TCP’s job is to maintain a consistent abstraction (namely, the TCP connection) across multiple systems, so it’s surprising to discover cases where the abstraction is broken like this. And lest you think this is some nc(1) issue, it’s not. “netstat” on kodos shows no connection to kang, but kang shows a perfectly healthy connection to kodos:


[root@kang ~]# netstat -f inet -P tcp -n

   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------    32806      0 1049800      0 ESTABLISHED

We can leave this forever without touching it and kang will never figure out that kodos rebooted.

Now, suppose at this point, kang tries to send data to kodos. What happens?

Let’s try it:


pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)
kodos, are you there?
3872.6918       pollsys(0x08045670, 2, 0x00000000, 0x00000000)  = 1
3872.6920       read(0, " k o d o s ,   a r e   y".., 1024)     = 22
3872.6924       write(3, " k o d o s ,   a r e   y".., 22)      = 22
3872.6932       pollsys(0x08045670, 2, 0x00000000, 0x00000000)  = 1
3872.6932       read(3, 0x08043670, 1024)                       Err#131 ECONNRESET
3872.6933       close(3)                                        = 0
[root@kang ~]#

When I type the message and hit enter, kodos gets woken up, reads the message from stdin, and sends it over the socket. The write call completes successfully! nc goes back to poll to wait for another event, eventually finds out that the socket can be read without blocking, and then calls read. This time, read fails with ECONNRESET. What does this mean? POSIX’s definition of read(2) says that this means:

A read was attempted on a socket and the connection was forcibly closed by its peer.

The illumos read(2) man page provides a little more context:

              The filedes argument refers to a connection oriented
              socket and the connection was forcibly closed by the peer
              and is no longer valid.  I/O can no longer be performed to

This error doesn’t mean that something specific to the read call went wrong, but rather that the socket itself is disconnected. Most other socket operations would fail the same way at that point.

So what happened here? At the point when the nc on kang tried to send data, the TCP stack still didn’t know the connection was dead. kang sent a data packet over to kodos, which responded with an RST because it didn’t know anything about the connection. kang saw the RST and tore down its connection. It cannot close the socket file descriptor — that’s not how file descriptors work — but subsequent operations fail with ECONNRESET until nc does close the fd.


  1. A hard power-cycle is very different than a graceful shutdown. When testing distributed systems, this case needs to be specifically tested. You can’t just kill a process and expect it to test the same thing.
  2. There are situations when one side believes a TCP connection is established, but the other side does not, and where this situation will never be automatically resolved. It’s possible to manage this problem using TCP or application-level keep-alive.
  3. The only reason that kang eventually found out that the remote side had disappeared was that it took action: it sent data and received a response indicating the connection was gone.

That raises the question: what if kodos had not responded to the data message for some reason?

Puzzler 2: Power off

What happens if one endpoint of a TCP connection is powered off for a while? Does the other endpoint ever discover this? If so, how? And when?

Once again, I’ll set up the connection with nc:


[root@kodos ~]# truss -d -t bind,listen,accept,poll,read,write nc -l -p 8080
Base time stamp:  1464385399.1661  [ Fri May 27 21:43:19 UTC 2016 ]
 0.0030 bind(3, 0x08065790, 32, SOV_SOCKBSD)        = 0
 0.0031 listen(3, 1, SOV_DEFAULT)           = 0
accept(3, 0x08047B3C, 0x08047C3C, SOV_DEFAULT, 0) (sleeping...)
 6.5491 accept(3, 0x08047B3C, 0x08047C3C, SOV_DEFAULT, 0) = 4
pollsys(0x08045680, 2, 0x00000000, 0x00000000) (sleeping...)

[root@kang ~]# truss -d -t open,connect,pollsys,read,write,close nc 8080
Base time stamp:  1464330881.0984  [ Fri May 27 06:34:41 UTC 2016 ]
 0.0057 connect(3, 0x08066DD8, 16, SOV_DEFAULT)         = 0
pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)

Now, I’ll cut power to kodos abruptly and attempt to send data from kang:


pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)

114.4971        pollsys(0x08045670, 2, 0x00000000, 0x00000000)  = 1
114.4974        read(0, "\n", 1024)                             = 1
114.4975        write(3, "\n", 1)                               = 1
pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)

The write completes normally, and I don’t see anything for quite a while. Just over 5 minutes later, I see this:


pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)
425.5664        pollsys(0x08045670, 2, 0x00000000, 0x00000000)  = 1
425.5665        read(3, 0x08043670, 1024)                       Err#145 ETIMEDOUT
425.5666        close(3)                                        = 0

This looks similar to the case where the system was power-cycled instead of powered off, except for two things: it took 5 minutes for the system to notice, and the error reported was ETIMEDOUT. Note that again, it’s not that the read timed out per se. We would have seen the same error from other socket operations, and subsequent socket operations would likely fail immediately with the same ETIMEDOUT error. That’s because it’s the socket that has entered a state where the underlying connection has timed out. The specific reason for this is that the remote side failed to acknowledge a data packet for too long — 5 minutes, as configured on this system.


  1. When the remote system is powered off instead of power cycled, again, the first system only finds out if it’s trying to send data. If it doesn’t send packets, it will never find out about the terminated connections.
  2. When the system has been attempting to send data for long enough without receiving acknowledgments from the remote side, the TCP connection will be terminated and a socket operation on them will fail with ETIMEDOUT.

Puzzler 3: Broken connection with no failure

This time, instead of giving you a specific case and asking what happens, I’ll flip it around: here’s an observation, and see if you can figure out how it could happen. We’ve discussed several cases where kang can believe it has a connection to kodos, but kodos doesn’t know about it. Is it possible for kang to have a connection to kodos without kodos knowing about it — indefinitely (i.e., where this will not resolve itself) and even if there’s been no power off, no power cycle, and no other failure of the operating system on kodos and no failure of networking equipment?

Here’s a hint: consider the case above where a connection is stuck in ESTABLISHED. In that case, it’s fair to say that the application has the responsibility to deal with the problem, since by definition it still has the socket open and could find out by sending data, at which point the connection will eventually be terminated. But what if the application didn’t have the socket open any more?

In the Warmup, we looked at the case where kodos’s nc closed its socket, and we said that kang’s nc read 0 (indicating end-of-stream) and then closed its socket. What if it didn’t close the socket, but kept it open? Obviously, it couldn’t read from it. But there’s nothing about TCP that says you can’t send more data to the other side that has sent you a FIN. FIN only means end-of-data-stream in the direction that the FIN was sent.

In order to demonstrate this, we can’t use nc on kang because it automatically closes its socket when it reads 0. So I’ve written a demo version of nc called dnc, which simply skips this behavior. It also prints out explicitly which system calls it’s making. This will give us a chance to observe the TCP states.

First, we’ll set up the connections as usual:


[root@kodos ~]# truss -d -t bind,listen,accept,poll,read,write nc -l -p 8080
Base time stamp:  1464392924.7841  [ Fri May 27 23:48:44 UTC 2016 ]
 0.0028 bind(3, 0x08065790, 32, SOV_SOCKBSD)            = 0
 0.0028 listen(3, 1, SOV_DEFAULT)                       = 0
accept(3, 0x08047B2C, 0x08047C2C, SOV_DEFAULT, 0) (sleeping...)
 1.9356 accept(3, 0x08047B2C, 0x08047C2C, SOV_DEFAULT, 0) = 4
pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)

[root@kang ~]# dnc 8080
2016-05-27T08:40:02Z: establishing connection
2016-05-27T08:40:02Z: connected
2016-05-27T08:40:02Z: entering poll()

Now let’s verify the connections are in the ESTABLISHED state we expect on both sides:


[root@kodos ~]# netstat -f inet -P tcp -n

   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------   1049792      0 1049800      0 ESTABLISHED

[root@kang ~]# netstat -f inet -P tcp -n

   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------    32806      0 1049800      0 ESTABLISHED

Now, let’s CTRL-C the nc process on kodos:


pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)
^C[root@kodos ~]#

We immediately see this on kang:


2016-05-27T08:40:12Z: poll returned events 0x0/0x1
2016-05-27T08:40:12Z: reading from socket
2016-05-27T08:40:12Z: read end-of-stream from socket
2016-05-27T08:40:12Z: read 0 bytes from socket
2016-05-27T08:40:12Z: entering poll()

Now let’s look at the TCP connection states:


[root@kodos ~]# netstat -f inet -P tcp -n

   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------   1049792      0 1049800      0 FIN_WAIT_2

[root@kang ~]# netstat -f inet -P tcp -n

   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------    1049792      0 1049800      0 CLOSE_WAIT

This makes sense: kodos sent a FIN to kang. FIN_WAIT_2 indicates that kodos received an ACK from kang for the FIN that it sent, and CLOSE_WAIT indicates that kang received the FIN but has not sent a FIN back. This is a perfectly valid state for a TCP connection for an indefinite period of time. Imagine kodos had sent a request to kang and was not planning to send any more requests; kang could happily send data in response for hours, and this would work. Only in our case, kodos has actually closed the socket.

Let’s wait a minute and check the TCP connection states again. A minute later, the connection is completely missing from kodos, but it’s still present on kang:


[root@kang ~]# netstat -f inet -P tcp -n

   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------    1049792      0 1049800      0 CLOSE_WAIT

What happened here? We hit a lesser-known special case in the TCP stack: when an application has closed a socket, the TCP stack has sent a FIN, and the remote TCP stack has acknowledged the FIN, the local TCP stack waits a fixed period of time and then closes the connection. The reason? In case the remote side has rebooted. This case is actually an analog to the case above where one side had an ESTABLISHED connection and the other doesn’t know about it. The difference is that in this specific case, the application has closed the socket, so there’s no other component that could deal with the problem. As a result, the TCP stack waits a fixed period of time and then tears down the connection (without sending anything to the other side).

Follow-up question: what happens if kang sends data to kodos at this point? Remember, kang still thinks the connection is open, but it’s been torn down on kodos.


2016-05-27T08:40:12Z: entering poll()
kodos, are you there?
2016-05-27T08:41:34Z: poll returned events 0x1/0x0
2016-05-27T08:41:34Z: reading from stdin
2016-05-27T08:41:34Z: writing 22 bytes read from stdin to socket
2016-05-27T08:41:34Z: entering poll()
2016-05-27T08:41:34Z: poll returned events 0x0/0x10
2016-05-27T08:41:34Z: reading from socket
dnc: read: Connection reset by peer

This is the same as we saw in case Puzzler 1: the write() actually succeeds, since the TCP stack doesn’t know that the connection is closed yet. But it does get a RST, which wakes up the thread in poll(), and the subsequent read() returns ECONNRESET.


  • It’s possible for two sides to disagree about the state of the connection even when there’s been no OS, network, or hardware failure at all.
  • In a case like the one above, it’s not possible for kang to distinguish between the case where kodos is attentively waiting to receive data or kodos has closed the socket and isn’t listening (at least, not without sending a packet). For this reason, maybe it’s not a great idea to design a system that uses sockets in these half-open states for an extended period of time under normal operation.


TCP is typically presented as a protocol for maintaining a consistent abstraction — the “TCP connection” — across two systems. We know that in the face of certain types of network and software problems, connections will fail, but it’s not always obvious that there are cases where the abstraction fails, in that the two systems disagree about the state of the connection. Specifically:

  • It’s possible for one system to think it has a working, established TCP connection to a remote system, while the remote system knows nothing about that connection.
  • This is possible even when there has been no failure of the network, hardware, or operating system.

These behaviors do not reflect deficiencies in TCP. Quite the contrary, in all of these cases, the TCP implementation appears to behave as reasonably as it could given the situation. If we were tempted to implement our own communication mechanism instead of using TCP, the presence of these cases might well remind us how complex the underlying problems are. These are intrinsic problems with distributed systems, and a TCP connection is fundamentally a distributed system.

That said, the single most important lesson in all of this is that the notion of a TCP connection that spans multiple systems is a convenient fiction. When it goes wrong, it’s critical to think about it as two separate state machines that operate simultaneously to try to maintain a consistent view of the connection. It’s the responsibility of the application to handle the cases where these differ (often using a keep-alive mechanism).

Furthermore, there’s a disconnect between the application’s file descriptor and the underlying TCP connection. TCP connections exist (in various states related to closing) even after an application has closed the file descriptor, and a file descriptor can be open when the underlying TCP connection has been closed as a result of a failure.

Other lessons to keep in mind:

  • Ungraceful reboot of a system (as happens when the OS crashes) is not the same as a userland process exiting or closing its socket. It’s important to test this case specifically. Reboot, when the remote system comes back online before the connection times out, is also different than power-off.
  • There’s no proactive notification from the kernel when a TCP socket is torn down. You only find this out when you call read(), write(), or some other socket operation on the socket file descriptor. If your program doesn’t do this for some reason, you’ll never find out about the connection failure.

Some related notes that I’ve found aren’t so commonly known:

  • ECONNRESET is a socket error you can see from read(), write(), and other operations that indicates that the remote peer has sent a RST.
  • ETIMEDOUT is a socket error you can see from read(), write(), and other operations that indicates that some timeout associated with the connection has elapsed. The cases I’ve seen most are when the remote side did not acknowledge some packet for too long. These are usually either data packets, a FIN packet, or a KeepAlive probe.

Importantly, neither of these errors means that there was anything wrong with the read or write operation that you tried to do — just that the socket itself is closed.

Programming language debuggability

A few weeks ago I had the fortune to attend a talk by Douglas Crockford on various ideas related to JavaScript past and future and programming language design. The talk was thoughtful and well-received (and thanks to Crockford and Bigcommerce for making it happen!). The main theme, not new to those familiar with Crockford’s work, was that as programmers, we have a responsibility to create bug-free programs; and that therefore when faced with choices about which language features to use, we should always choose the safest ones — that is, we should always avoid the features that can sometimes be used unsafely when a safer feature is available.

On the surface, this seems like a no-brainer. This is the principle behind linters and other static analyzers, including Crockford’s own JSLint. (My favorite is still the less popular javascriptlint.) I try to lint just about all of the JavaScript I write not just before committing it, but each time I run it, because I find so often even while writing small chunks of new code I can easily wind up with super subtle bugs that take a while to debug but lint can catch quickly.

But while static analysis is enormously valuable, it’s not the only consideration for software developers. There are basically two paths for writing bug-free software:

  1. Write bug-free code the first time.
  2. Write good code, then test it, then find the bugs, and then fix the bugs.

This also sounds like a no-brainer, but it actually suggests something different from Crockford’s advice. Linters and static analysis are part of path 1: they attempt to find and fix well-known bugs before the program ever runs. That’s terrific, but it’s incomplete. Lint cannot verify that a chunk of reasonable code correctly implements what the programmer intends. There are always problems you don’t discover until you at least run the code (and in most cases, there will be problems you don’t discover until long after you’ve started running the code in production).

Path 2 accepts that many bugs won’t be found until you run the code, and for that reason making choices about code debuggability should be at least as important as choices about code safety. Some constructs are intrinsically more debuggable than others. JavaScript closures are great, but they do have a debuggability problem: even if you had a debugging oracle that you could use to show you any variable in your program while the program was running (which we basically do), how does the developer even express to the debugger what variable they’re interested in? When variables are referenced by closures and not by any globals, there’s no unique global name for it. It’s important to realize that this is an intrinsic challenge with closures in JavaScript, not a problem with some particular debugger.1 (That doesn’t mean you should never use closures, but this consideration can inform their use. For example, if there’s any state I want to be able to see in the debugger, I make sure it’s not just referenced by the closure, but is also hanging off some object I can find from the debugger.)

Another example of this came up later in Crockford’s talk, when he revealed that he no longer uses new or even Object.create, but instead defines classes using a normal function (not really a constructor) using local variables for instance state and closures to reference that state. Here’s an example (this is my interpretation, not his code):

function mypoint(args)
    var x = args.x;
    var y = args.y;

    return ({
        'getX': function () { return (x); },
        'getY': function () { return (y); },
        'setX': function (newx) { return (x = newx); },
        'setY': function (newy) { return (y = newy); }

There are advantages to this style. You can avoid all the nastiness associated with this, and you can implement strong data hiding. But it has important downsides for debuggability. First, we have the closure problem we talked about earlier: given that you’ve got your program open in the debugger, how do you find a particular Point object? Supposing you do find one, how do you see what “x” and “y” actually are?

But there’s a deeper, subtler problem. Many JavaScript programs run into issues with memory usage (or leaks), and the first piece of information you typically want to debug an issue like that is a report of how many of each kind of object you have. But assuming you had a debugger that could do that, without a constructor name and without any non-function properties, how would you group objects in a way that’s useful for the developer?

Similar to the way there are language features to prefer in the name of safety, there are patterns to prefer to improve debuggability. It helps to name all functions, for example. Even inline functions. This makes stack traces more readable, and it also enables the debugger to tell you that you’ve created 800,000 closures of function “authenticateRequest” instead of “(anon) from file server.js line 37″.

These are just basic examples. I’d love to see the issue of programming language debuggability play a more significant role in discussions about programming language design and in selecting the good parts of a language. With a programming language built for debuggability and first-class tools to take advantage of it, there’d be even more to gain from developers investing time to build, learn, and use tools to help find bugs in code while it’s running or after it’s been run. And that’s how we fulfill our obligation to build bug-free software: not by writing perfect code, but by finding and fixing the bugs.

1 We addressed this problem in MDB by letting you search for objects by property names, but it’s still far easier if you can find relevant state through a singleton global object than if you have to wade through all objects with a common property like “hostname”.

Debugging enhancements in Node 0.12

(also on the Joyent blog)


Node.js provides the richest postmortem debugging facilities of just about any dynamic language. Postmortem debugging refers to debugging programs from a core file, heap dump, or some similar dump of memory. With a core file of a Node program (which is essentially a snapshot of your program’s state), you can:

  • Find all JavaScript objects and classify them by their properties. This is very useful for identifying large memory leaks.
  • Find JavaScript objects by their constructors or properties and print them out.
  • Enhanced in 0.12: Print a stack trace of where the program was when the core file was created. (If you’re generating a core file on crash, this is the stack where the program crashed.) Unlike the stack you get when a Node program normally crashes, this stack includes function arguments. In 0.12, by default the printed stack trace is much friendlier, and it also contains the “this” argument for each function.
  • New in 0.12: Find all of the JavaScript closures that have been created. Closures can be a major source of memory leaks, so identifying which functions have created lots of closures helps find memory leaks.
  • New in 0.12: Find the chunk of native memory that’s being used for a given Node Buffer. When Buffers make up a large percentage of memory used in your program, this can help identify what they’re storing.
  • New in 0.12: Find the C pointer for a given V8 “internal” field. This is mostly interesting for add-on developers.

The postmortem approach has several advantages over other debugging techniques:

  • It doesn’t require any runtime overhead.
  • It’s comprehensive. A core file is a complete snapshot of your program. If there’s something missing, you can add it by just storing it in your program.
  • In production, when you find a program behaving badly, you can save a core file and then immediately restart the program. This often fixes the problem, restores service, and still enables you to debug the problem after the fact.
  • Because all the debugging is offline (i.e., the program is not suspended), debuggers can use sophisticated (but expensive) analysis techniques like the ones mentioned above.

You can save core files of running programs using gcore(1) on most systems, and you can configure Node to generate a core file when it crashes using the --abort-on-uncaught-exception flag. Although MDB only runs on illumos-based systems (like SmartOS), it can read core files from GNU/Linux systems as well.

Changes in v0.12

If this is the first you’ve heard of debugging Node from core files, you’ll want to check out the Node.js MDB intro. The rest of this post is just a summary of changes.

These changes are part of the debugger module that ships with Node 0.12, but you can use this module on core files from versions of Node as far back as 0.4. So you can use these features today before even upgrading your programs to use Node 0.12.

Streamlined ::jsstack

In previous versions, ::jsstack‘s default output includes a lot of information that’s not relevant for JavaScript-only problems and is frequently overwhelming for new users. After these changes, it hides internal frames, frame pointers, and instruction pointers by default. It also chops long C++ symbol names to fit lines at 80 columns.


> ::jsstack
8043190 0xb4e1a5ee MyObject (ba780b15)
80431b4 0x94e0fba9 <ConstructFrame>
80431c8 0xb4e1096c <anonymous> (as <anon>) (ba7b3d59)
8043200 0x94e645bf <anonymous> (as EventEmitter.emit) (93f24ad9)
804321c 0x94e0f7a1 <ArgumentsAdaptorFrame>
8043240 0xb4e0f9c0 <anonymous> (as parser.onIncoming) (92609b45)
804327c 0xb4e0c9f3 parserOnHeadersComplete (ba7917e9)
804329c 0x94e0fe19 <InternalFrame>
80432d8 0x94e0a0aa <EntryFrame>
8043348 _ZN2v88internalL6InvokeEbNS0_6HandleINS0_10JSFunctionEEENS1_INS0_6ObjectEEEiPS5_Pb+0xef
8047c98 uv_run+0xea
8047d08 node::Start+0x17b
8047d28 main+0x1b
8047d40 _start+0x83


> ::jsstack
js:     MyObject
        (1 internal frame elided)
js:     <anonymous> (as <anon>)
js:     <anonymous> (as EventEmitter.emit)
        (1 internal frame elided)
js:     <anonymous> (as parser.onIncoming)
js:     parserOnHeadersComplete
        (1 internal frame elided)
        (1 internal frame elided)
native: _ZN2v88internalL6InvokeEbNS0_6HandleINS0_10JSFunctionEEENS1_INS0...
native: uv_run+0xea
native: node::Start+0x17b
native: main+0x1b
native: _start+0x83

You can get the full native details with “-a” (for “all frames and information”).

this pointer and constructor names in ::jsstack -v output

::jsstack -v now includes the this pointer. Also, when it prints out object pointers (for this as well as function arguments), they include the constructor name.


> ::jsstack -vn0
8043190 0xb4e1a5ee MyObject (ba780b15)
    file: /root/example.js
    posn: position 62
80431b4 0x94e0fba9 <ConstructFrame>
80431c8 0xb4e1096c <anonymous> (as <anon>) (ba7b3d59)
    file: /root/example.js
    arg1: ba7ceb01 (JSObject)
    arg2: ba7cf781 (JSObject)
    posn: position 595


> ::jsstack -vn0
js:     MyObject
          file: /root/example.js
          posn: position 62
        (1 internal frame elided)
js:     <anonymous> (as <anon>)
          file: /root/example.js
          posn: position 595
          this: ba7b3d7d (JSObject: Server)
          arg1: ba7ceb01 (JSObject: IncomingMessage)
          arg2: ba7cf781 (JSObject: ServerResponse)

New dcmd: ::jsconstructor: print constructor of a JavaScript object

Given an object, you can print the constructor that was used to create it:

> ba7ce109::jsconstructor

> ba7ce781::jsconstructor

> ba7cdb75::jsconstructor

With “-v”, the command prints out a pointer to the constructor function:

> ba7b3d7d::jsconstructor -v
Server (JSFunction: ba791bbd)

which is useful for passing to ::jssource (see next).

New dcmd: ::jssource: print source for a JavaScript function

Given a function, you can print the source code for it:

> 88a21d0d::jssource -n1
file: http.js

 1964   socket.addListener('error', function(e) {
 1965     self.emit('clientError', e, this);
 1966   });

This closure is an instance of the anonymous function defined inline.

New dcmd: ::nodebuffer: fetch native pointer for a buffer

When debugging memory leaks using ::findjsobjects, you may find that you have a ton of Buffer objects. These represent chunks of memory from the native heap. You often want to know what that memory contains. With the ::nodebuffer command, you can take one of these Buffers:

> b1a9ec5d::jsconstructor

> b1a9ec5d::jsprint
    "length": 1423,
    "parent": {
        "length": 8192,
        "used": 7272,
    "offset": 0,

and get a pointer to the underlying C memory:

> b1a9ec5d::nodebuffer

Then you can use ::dump (for example) to see what that memory actually contains:

> 8db8c08,0t64::dump
           0 1 2 3  4 5 6 7 \/ 9 a b  c d e f  01234567v9abcdef
8db8c00:  08200000 f89f103a 00000007 6f707469  . .....:....opti
8db8c10:  6f6e7300 00000468 65616400 00000470  ons....head....p
8db8c20:  6f737400 00000370 75740000 00066465
8db8c30:  6c657465 00000005 74726163 65000000  lete....trace...
8db8c40:  06616363 65707400 00000e61 63636570  .accept....accep

This often points you to a particular HTTP request handler or other code path in your program.

New dcmd: ::jsfunctions: list closures

findjsobjects is very helpful for identifying classes of objects with many instances. Once you’ve found which objects are being leaked, you can use findjsobjects -r to find out which objects reference the leaked one. Sometimes memory is not referenced by other objects, but rather by closures: JavaScript functions that are pending and may be invoked some time in the future. The new jsfunctions command lists all functions in your JavaScript program, and how many different closures have been created for each one:

> ::jsfunctions
    FUNC   #FUNCS NAME                                     FROM
8803e63d     1667 <anonymous> (as parser.onIncoming)       http.js position 58040
880389c5      543 <anonymous> (as <anon>)                  _stream_readable.js position 26340
88038921      543 <anonymous> (as <anon>)                  _stream_writable.js position 8134
880388d5      543 <anonymous> (as cb)                      _stream_writable.js position 5923
8800be75      587 <anonymous> (as <anon>)                  /home/dap/demo/restify-ex.js position 616
8800be35     2114 _ifError                                 /home/dap/demo/node_modules/restify/lib/server.js position 2846
8800b229      691 <anonymous> (as <anon>)                  net.js position 12228
880099a1     1755 <anonymous> (as <anon>)                  http.js position 55840
8800997d     1755 <anonymous> (as <anon>)                  http.js position 54999
88009939     1696 socketOnDrain                            http.js position 57774
88009915     2234 serverSocketCloseListener                http.js position 54505
880098f1     2234 abortIncoming                            http.js position 54314
88009885     1754 g                                        events.js position 5742

You can take one of these functions and print the source. Here’s abortIncoming():

> 880098f1::jssource
file: http.js

 1907 function connectionListener(socket) {
 1908   var self = this;
 1909   var outgoing = [];
 1910   var incoming = [];
 1912   function abortIncoming() {
 1913     while (incoming.length) {
 1914       var req = incoming.shift();
 1915       req.emit('aborted');
 1916       req.emit('close');
 1917     }
 1918     // abort socket._httpMessage ?
 1919   }
 1921   function serverSocketCloseListener() {
 1922     debug('server socket close');
 1923     // mark this parser as reusable
 1924     if (this.parser)
 1925       freeParser(this.parser);

Note that it’s defined inside the connectionListener function. This explains why we have 2234 closures for it. Each time connectionListener is called, a new closures of abortIncoming is created to close around socket, self, outgoing, and incoming.

As with findjsobjects, the exact count in the jsfunctions list may be misleading, but large values often reflect memory leaks.

New advanced dcmd: ::v8internal: fetch internal fields

Native code that uses the V8 API (whether part of Node.js itself or part of an add-on) often uses internal fields, which are essentially slots inside an object for storing internal data. For example, Node classes like TcpWrap typically store the underlying libuv handle as internal field 0. So if you have a TCP object, then you can grab internal field 0 to get the underlying libuv handle.

Bug fixes

  • jsprint now quotes property names. (The current approach is simplistic, and the results are not to be confused for valid JSON, but it’s much clearer now than it was, particularly for properties with non-alphanumeric characters.)
  • Several bugs in 64-bit core files, particularly printing out source code using “::jsstack -v” and printing two-byte strings. Both 32-bit and 64-bit core files of Node 0.10 and 0.12 are fully supported. Core files from 0.4 and 0.6 should work as well, but have not been recently tested.
  • Two-byte strings are no longer erroneously truncated.

Stay tuned

We’re working on a few more impressive tools using jsfunctions. Stay tuned
for details!

For help

For help, file a Node issue or reach out to @dapsays, @tjfontaine, or @misterdjules on Twitter. Joyent also offers Nodejs Support that builds heavily on these tools.

Understanding DTrace ustack helpers

or, everything you ever wanted to know about stack traces

I promised this post over a year ago, and now that someone’s actually working on a new ustack helper, I thought it was finally time to write about what ustack helpers are, how they work, and how I went about building one for Node.js. Only a handful of ustack helpers have ever been written: Node, Java, Python, and PHP (the last of which is believed lost to the sands of time), so this post is mainly for a narrow audience of developers, plus anyone who’s interested in how this all works.

This post covers a lot of the background you need to understand the details. For more information, check out my ACM article on Postmortem Debugging in Dynamic Environments. Though this post covers dynamic tracing, the challenges are similar to those for postmortem debugging, since both involve operating on a snapshot of program state without the aid of the VM itself.

The value of stack traces

Quick review: DTrace is a facility for dynamically instrumenting all kinds of operations on a system — particularly systems in production. It’s available on OS X, illumos distributions (including SmartOS and OmniOS), Solaris 10 and later, and BSD.

The hundreds of thousands of probes on a typical system can be combined with various DTrace actions to gather incredibly specific data. Some examples:

  • When a program opens a specific file, grab a stack trace (to figure out where in the program is opening the file).
  • When a program writes a particular string to stderr, take a core dump (to debug why it hit some particular error case).
  • When any program opens a specific file, print the program name and pid (to figure out who’s accessing a file).
  • At a frequency of 97Hz, if a given process is on-CPU, grab a stack trace (to profile it, to see where it’s spending CPU time).
  • When a given syscall returns a given errno (e.g., close(2) returns EBADF), save a core file of the current process (to debug why that happened — see my previous post).
  • When malloc() returns NULL for a process, grab a stack trace (to see who’s failing to allocate memory).

This is just a sampling, of course. Among probes that work “out of the box” today are:

  • Process events: processes created, exec’d, exited; signals sent and delivered
  • System calls: every syscall entry and exit, which makes it easy to trace files opened, filesystem reads and writes, and other specific events in a process
  • Disk I/O: I/Os started, completed
  • Network: IP and TCP events (packets received and dropped, state transitions)
  • Virtual memory: pageout, pagein events
  • Nearly any function entry and exit in the kernel
  • Nearly any native function entry and function exit in any userland process
  • Nearly any instruction in any userland process
  • Apache: all kinds of server events
  • Node.js: HTTP request received, response sent, garbage collection start/done, and so on
  • Postgres: all kinds of server events
  • Java, Perl, Python, and Erlang: various runtime operations (often including function entry/exit)

With a tiny amount of work, you can also add your own probes to Node.js, Lua, Ruby, and Perl.

With the ability to grab a stack trace when any of these events happens, you can analyze performance (e.g., profiling on-CPU time) or debug a particular problem (e.g., “why does my Node process keep calling gettimeofday?”). But while DTrace users just call ustack() to get a stack trace, under the hood the process of recording a stack trace at an arbitrary point in a process running in production is deceptively tricky, and that’s what this post is about.

Aside: x86 stacks

What exactly is a stack trace? And how are debugging tools like DTrace able to print them out?1

Within a process, each thread has a stack, which keeps track of functions called, their arguments, and their local variables. On Intel systems (including x86 and amd64), there are two important pieces of state related to the stack:

  • The stack pointer (register %esp (32-bit) or %rsp (64-bit)) points to the next byte of free memory on the stack.
  • The frame pointer (or base pointer, register %ebp (32-bit) or %rbp (64-bit)) points to the first address in the current stack frame. This value in turn usually points to the top of the previous frame, so the frame pointer is essentially the head of a linked list of stack frames.2

There’s also an instruction pointer (register %eip (32-bit) or %rip (64-bit)) that points to the currently executing instruction in memory.

When one function calls another function, here’s what happens with the stack:

  • The call instruction in the parent function pushes the current value of the instruction pointer (register %eip/%rip) onto the stack, then jumps to the first instruction in the called function.
  • The first step inside the called function is typically to push the current value of the frame pointer onto the stack, and to copy the current stack pointer into the frame pointer register. The called function then executes. It may use more stack space for variables and to call other functions, but when it’s ready to return to the caller, the stack is in the same state as when the function started.
  • When ready to return, the function pops the top value of the stack into the frame pointer register. The ret instruction pops the new top of the stack into the instruction pointer, causing control to jump back to the calling function.

This is what the whole thing looks like:

Stack diagram

If this is new to you, it’s worth noting how deep this is: this is how control flows between functions in native programs. There’s no magic: the notion of functions in C essentially fall out of a stack and a few basic instructions.

You can see this pattern by disassembling any native function. Here’s an example, looking at the code for fork in libc (the userland function, which calls forkx() to do most of the work and then invoke the fork system call):

$ mdb -p $$
Loading modules: [ ]
> fork::dis`fork:                 pushl  %ebp`fork+1:               movl   %esp,%ebp
...`fork+0x19:            call   -0x246   <`forkx>
...`fork+0x22:            popl   %ebp`fork+0x23:            ret

The result of these stack manipulations is that at any given time, the frame pointer register points to the head of a linked list that has one entry for every function on the call stack, up to the top of the stack. The frame pointers pushed onto the stack in each function represent the “next” pointers of the linked list, and the return address pointers pushed by the call instructions denote the address that called the next function.

If you had a snapshot of a process memory state (i.e., a core dump), you can imagine a simple algorithm for constructing a stack trace:

callstack = [ current_%eip ];
frameptr = current_%ebp

while (frameptr != NULL) {
     callstack.push(value adjacent to frameptr);
     frameptr = *frameptr

In other words, the top frame is denoted by the current instruction pointer. Then we start with the current frame pointer register, and follow the linked list of frame pointers until we get to the top of the stack. Along the way, we record the instruction pointer that was saved on the stack.

There’s one critical step left. This algorithm gives us a bunch of instruction addresses — but we want human-readable function names, not memory addresses. In a traditional debugger, a memory address is pretty easy to translate to a function name, because the symbol table that’s in the process (or core file) necessarily includes the address of each function and the size of each function. In MDB, you can see this with the “::nm”

> ::nm ! grep -C2 -w fork
0xfee02581|0x00000094|FUNC |GLOB |0x3  |15      |execlp
0xfedece61|0x0000015b|FUNC |GLOB |0x3  |15      |_D_cplx_div_ix
0xfee76395|0x00000024|FUNC |GLOB |0x3  |15      |fork
0xfee1b6f3|0x00000057|FUNC |GLOB |0x3  |15      |nss_endent
0xfedef150|0x00000019|FUNC |GLOB |0x3  |15      |atomic_or_ushort_nv

In this case, the fork function is stored at address 0xfee76395 and is 0×24 bytes long. From this, the debugger knows that when the instruction pointer is 0xfee763b3, that’s inside the “fork” function, and is more conveniently printed as fork+0x1e (instruction offset “0x1e” inside the “fork” function).

As I mentioned above, this all sounds highly specific to native code, but as we’ll see, dynamic environments like Node.js do basically the same thing, with some twists that we’ll get to later.

Native stack traces in DTrace

It’s well-understood how to grab stack traces from a debugger, but saving a stack trace in the context of a DTrace probe is a bit more complicated. Recall that DTrace records and buffers data in the kernel for later consumption by the “dtrace” command. This buffering decouples the data source (events of interest) from the consumer (the “dtrace” command), which allows the system to perform much better than traditional tools like “strace” or “truss”, which actually stop the target process for long enough for the tool itself to step in and record data. This decoupling also allows DTrace to instrument extremely delicate contexts, including the kernel’s pagefault handler and other critical interrupt handlers. It wouldn’t be possible to instrument these contexts if the instrumentation itself had to wait for the “dtrace” process to step in and execute.

It might seem like these contexts are uninteresting to application developers, but they’re actually quite relevant for a number of use cases:

  • Profiling an application involves sampling stacks based on some timer — an interrupt context.
  • To see what processes are suffering the most from having been paged out, a useful trick is to instrument the VM’s pagein probe, since these events represent synchronous delays of potentially many milliseconds.
  • For Node especially, it’s often interesting to know when your process comes off-CPU, and what it’s doing that caused the kernel to take it off-CPU. That’s a one-liner — but requires instrumenting the scheduler.

In order to support instrumenting these contexts, the actions that you’re allowed to take in a probe are pretty constrained. You can’t do anything that requires servicing a pagefault, for example, or enter an infinite loop. And since DTrace can’t know what loops will be infinite, it doesn’t allow explicit loops at all.

These constraints make saving a userland stack trace from DTrace tricky. Walking the stack as we described above almost always works, because the stack memory is rarely paged out, but that whole bit about accessing the process’s symbol table to convert memory addresses to their human-readable names? That can easily involve a bunch of operations that are at the very least expensive, and at worst impossible in probe context (because the pages may not be available).

To work around this problem, DTrace defers the resolution of symbol names until the data is consumed. That is, when a probe fires and you’ve asked DTrace to print out a stack trace, all that’s done immediately is to record the process id and the list of memory addresses that make up the call stack. When your dtrace process in userland consumes the data, it uses system debugger interfaces to look up the addresses in the process that was traced and translate them to the appropriate function name as described above.3

What about dynamic environments?

All this works well for native code, but what about dynamic environments? These can vary wildly. I’m going to focus on Node.js, since that’s the one I’m very familiar with.

Node.js is based on V8, the JavaScript VM that also powers Google’s Chrome browser. With V8, the basic process for walking the stack works basically the same as for a native program (walking frame pointers, recording the adjacent instruction pointers), but there’s a huge problem when it comes to resolving instruction pointer addresses to human-readable names: the compiled code for JavaScript functions don’t correspond to symbols in the process’s symbol table!

For a native process, we knew 0xfee76395 was inside the fork function because the process has a symbol table (built when the program was compiled, linked, and loaded) that says that fork starts at 0xfee76395 and is 0×24 bytes long. But in a JavaScript program, we may have a function at address 0×8004039, and that address doesn’t correspond to anything in the process’s symbol table. That’s because that function didn’t exist when the program started: it was dynamically created when some JavaScript code used the function keyword to define a new function, and V8 compiled that to native code, and stored those instructions in the heap. There’s no way for a native code debugger to “know” that this corresponds to, say, the fs.readFile JavaScript function.

Suppose you had some way to ask V8: what’s the name of the JavaScript function at 0×8004039? DTrace could do the same thing it does for native stack traces, which is to record just the addresses and resolve these names later, right? Unfortunately, that doesn’t work for dynamic environments because functions themselves are stored on the runtime heap and can actually move around during execution as a result of garbage collection or reoptimization. So the function at 0×8004039 may no longer be at 0×8004039 when the “dtrace” command gets around to resolving the name. We have to resolve the name when we actually record the stack trace.

Enter ustack helpers

So we have these constraints:

  • We must record the stack trace inside the kernel, using only operations that are safe to execute in a DTrace probe.
  • We must resolve the symbol names when we record the stack trace — again, in the kernel, using only safe operations.
  • The process of resolving symbol names is totally VM-dependent, and like good software engineers, we don’t want to encode VM-internal details in some other component (like the OS kernel).

These constraints essentially define the solution: VM implementors write a chunk of code in D that knows how to translate a (frame pointer, instruction pointer) pair into a human-readable function name. The code is safe by virtue of being in D, which can only express operations that are safe in DTrace probes. That code (the helper) gets glued onto the binary during the build process and loaded into the kernel when the VM process starts up. When a user needs to get a stack trace from that process, the kernel executes the helper to resolve the function names.

The helper translates a frame pointer and instruction pointer into a human-readable function name. In this above example, it translates 0×8004039 to “fs.readFile”. (Since JavaScript function names are not unique, the Node ustack helper actually translates it to something more complete like “(anon) as fs.readFile at fs.js line 123″.)

The guts of a ustack helper are totally dependent on the dynamic environment it’s targeted at. The V8 helper uses the fact that when V8 generates the instructions to call a JavaScript function that’s been compiled to native code, it pushes onto the stack a pointer to a C++ object that it uses internally to keep track of the function being called. From that object, we can follow pointers to get the name of the function, the location where it was defined, and so on.

So you want to build a ustack helper

As I mentioned above, a few ustack helpers have been written: for Java, for Python, and for Node.js. I know two more that people have expressed interest in developing: Erlang and Ruby. In general, I’d break the process down into a few steps.

  1. Collect some example core files from known programs, where you know what the correct stack trace should look like.
  2. Study the source code and the core files until you can construct the correct stack trace by hand from the core file. That is, it’s not necessarily an automated procedure, but you can at least identify the right frames and for each one, a set of manual steps that will get you to the function names.
  3. Automate printing of the stack trace, based on the manual algorithm you came up with in step 2.
  4. Implement (3) in a D script: that will become the ustack helper.

Step 1: collect some example cores

The process of figuring out how a dynamic environment lays out its stack can be grueling and time-consuming. To explore at your own pace, it’s crucial to have a core file from a known program, where you know the correct stack trace. When I started, I spent enough time reading the V8 source to discover that there were a few different types of frames in V8, including regular JavaScript frames, constructor frames, argument adaptor frames, and a few others. Then I wrote a fairly simple Node program that exercised a bunch of these cases: from the top-level, it calls a regular top-level function, which calls a constructor, which calls another constructor, a method (defined as an anonymous function), and so on, until the last function goes into an infinite loop. That way, once I started the program, I could use gcore(1M) to create a core file from the running program. The result is that I had a memory snapshot I could play with in the debugger that had most types of frames I would care about. I could play around with this at my leisure. Later I would profile the same program to test the ustack helper.

Step 2: figure out how to manually produce a stack trace

This is where you’ll spend much of your time. The difficulty here depends a lot on the quality of the runtime’s source code and its documentation and how complex the environment is. (For some environments (like Perl), it may be impossible to write a ustack helper, at least without new DTrace features. ustack helpers assume at the most fundamental level that stacks are laid out just as they are in native code, using frame pointers and instruction pointers. There’s nothing that says a runtime environment has to actually do it that way.)

For V8, the basic process was simple, though it still took a while to work out. I started with the code that V8 itself uses when constructing a stack trace, as when you print an Error’s stack trace. I studied it for a while, took a lot of notes, and tried to replicate the algorithm by hand from the core file.

I strongly recommed building tools to help yourself. I used MDB, specifically because it makes it easy to write new commands with C code. I quickly wrote a bunch of commands to tell me for a given memory address, what I was looking at. This was critical: long before I was able to print a stack trace, I had learned that the first step was to print out the “Function” object that V8 stores on the stack, and that that object refers to a SharedFunctionInfo object that includes the name, and that that points to a Script object that includes the script name where the function was defined. The function and script names are stored as Strings, which were AsciiStrings or ConsStrings. So the first thing I did was to write debugger commands that could identify what kind of object I was looking at. This became the “::v8type” MDB command:

> a7790941::v8type
0xa7790941: JSFunction

Then I wrote commands to print out the C++ objects so I could inspect them. This became the “::v8print” MDB command:

> 0xa7790941::v8print
a7790941 JSFunction {
    a7790941 JSObject {
        a7790941 JSReceiver {
            a7790941 HeapObject < Object  {
                a7790940 map = 9f009749 (Map)
        a7790944 properties = a26080a1 (FixedArray)
        a7790948 elements = a26080a1 (FixedArray)
    a7790950 prototype_or_initial_map = ba9080a1 (Oddball: "hole")
    a7790954 shared = a777b6f5 (SharedFunctionInfo)
    a779095c literals_or_bindings = a7790965 (FixedArray)
    a7790960 next_function_link = ba9299a1 (JSFunction)

Then I wrote commands for decoding the string objects as strings. This became “::v8str”:

> 0xa7790f81::v8str

> 0xa7790f81::v8str -v
    ptr1: a2615b7d
    ptr2: 892ef331
        length: 11 chars (11 bytes), will read 11 bytes from offset 0
        length: 6 chars (6 bytes), will read 6 bytes from offset 0

It sounds like a lot of work up front, but it paid off big when I could poke around much more easily: I could start with a pointer from the stack that I thought should be a Function object, and explore what information it pointed to. For example, if I have this frame from the native stack (which you can get with $C in MDB):

Frame ptr   Instruction ptr
0804792c    0x7560e19a

I discovered from the source that it looked like a JSFunction object was pushed two words below the frame pointer, so I checked that out:

> 0804792c-0x8/p
0x8047924:      0xa7737d65      

> 0xa7737d65::v8type
0xa7737d65: JSFunction

> 0xa7737d65::v8print
a7737d65 JSFunction {
    a7737d65 JSObject {
        a7737d65 JSReceiver {
            a7737d65 HeapObject < Object  {
                a7737d64 map = 9f009749 (Map)
        a7737d68 properties = a26080a1 (FixedArray)
        a7737d6c elements = a26080a1 (FixedArray)
    a7737d74 prototype_or_initial_map = ba9080a1 (Oddball: "hole")
    a7737d78 shared = ba941471 (SharedFunctionInfo)
    a7737d80 literals_or_bindings = a26080a1 (FixedArray)

and so on.

Besides the ability to explore more easily, with not much more work, I wrote a few commands to print the V8 representations of objects, arrays, and so on as their JavaScript values — which gave me postmortem debugging for JavaScript as well. This became “::jsprint”:

> 1f712ffd5601::jsprint
    protocol: "http:",
    slashes: true,
    auth: null,
    host: "",
    port: null,
    hostname: "",
    hash: null,
    search: null,
    query: null,
    pathname: "/episodes/3F02.html",
    path: "/episodes/3F02.html",
    href: "",

This brings up a related point: writing a ustack helper is grueling, and I found it never paid to take shortcuts. I’d rather make only a little progress each day, knowing what each step was doing, than to try to keep it all in my head and be confused why things didn’t work as expected.

Back to the task at hand. Since you wrote the example program, you know what the stack trace in the core file should look like. (You can even have the program print the stack trace using the runtime’s built-in mechanism — in JavaScript, this would be console.log(new Error().stack). The first milestone will be when you can construct that stack trace by hand from the core file. That is, when you can look at the frame pointer in %ebp, follow that to the other frames, and for each one, find the right pointers to follow to get you to the name of the function at that frame.

Step 3: automate printing a stack trace

The next step is to automate that process of printing out the stack trace. As with the rest of this project, I’d strongly recommend building this incrementally. First build commands that can print out a frame pointer with a useful description of the function it references:

> 0804792c::jsframe -v
8047940 0x756266db _tickFromSpinner (a7737dad)
    file: node.js
    posn: position 13051

and then implement something that walks the stack and labels each frame. You could skip the first step, but you need to automate this procedure to build the ustack helper anyway, and it will help significantly to iron out the details in a traditional language like C, which is easier to write and has rich facilities for debugging.

Once this works reliably, create some more example programs and test it on those. Create core files from production processes and test it on those, too. You may find that there were some cases you missed in your example programs.

Step 4: build the ustack helper

Once you have an automated command that reliably prints a stack trace for an arbitrary core file, you’ve got to implement that same logic in D.

The basic idea is that you define a D program with pseudo-probes called “dtrace:helper:ustack:”. The interface is:

  • The entire script is executed for each frame in the stack. You’re not responsible for walking the stack; you’ve just got to translate the current frame to a string.
  • arg0 and arg1 are the current frame’s instruction pointer and frame pointer, respectively. This is all the context you have.
  • The last value in the last clause should be an ASCII string describing the frame. In practice, this is usually something you’ve allocated with the DTrace alloca() subroutine and then filled in yourself.
  • If a frame’s label starts with “@”, the string is considered an annotation, rather than a replacement for the name DTrace would have used. For JIT’d environments like V8, this isn’t useful, because the name DTrace would have used is just a memory address that’s likely not meaningful to anybody. For environments like Python, though, the original name might be “py_val”, which might be useful to a VM developer.

There are some patterns that have emerged in the existing ustack helpers:

  • Use “this” variables to store state. These are clause-local: they’ll persist through the execution of the script on a single frame. These aren’t initialized by default, so you’ll want to clear these at the start to avoid inadvertently picking up values from previous invocations.
  • At the beginning, allocate a fixed-size string to store your result. I kept track of this as this->buf. You’ll also want to keep track of your current offset in the string, as this->off.
  • It’s helpful to have macros like APPEND_CHR(some_character), which is usually just:
    #define APPEND_CHR(c) (this->buf[this->off++] = (c))
    Then build up macros like APPEND_NUM (for decimal numbers) and APPEND_PTR (for pointer values). See the V8 ustack helper for details.
  • When done, set "this->done = 1". All of your clauses after the first should predicate /!this->done/.

The V8 ustack helper built up several more complicated layers of APPEND macros for the various kinds of strings in V8.

The rest is entirely runtime-specific, so all I can offer are some general tips:

  • Start with a trivial ustack helper and integrate that into your runtime’s build system. Here’s the simplest example I came up with for the simplest possible program. It prints out “hiworld” for each frame.
  • If there’s an error in your D script, DTrace will completely ignore the translation for that frame. It’s very hard to debug this, so I strongly recommend an incremental approach. If I was really lost, I would set this->done in some early clause, see if I got output, and move that back until I found the clause that was failing.
  • As you build up the helper, use the output string to debug. DTrace does have a helper tracing mechanism that’s beyond the scope of this blog post, but it’s rather low-level. I found it easier to use printf-style debugging: inserting debug statements directly into the output string, so they’d show up in the DTrace output. So I’d first print out the address of some pointer, then I’d try decoding it as a string, and so on. Since you’ve already integrated the helper into the program’s build system, you can iterate pretty quickly.

You can also consider writing a plain old D script to iterate on most of the logic for the helper. The downside is that once you get it working perfectly, if there’s an error after you translate the script into a helper, it’ll be hard to track down where the error was. I usually found it easier to develop the helper itself.


There’s no doubt this is all a lot of work, but the resulting observability has proven tremendously valuable for our work with Node.js. We use the ustack helper primarily to profile Node programs, but also to debug them (i.e., to find out what part of a program is responsible for some other system event), and we use it both in development and in production.

Besides that, the result of this project was not just the ustack helper, but a complete postmortem debugging environment for Node programs. We configure most of our services to dump core automatically when an uncaught exception is thrown, and we’ve root-caused dozens of nasty bugs (including several in Node core) from the core files.


If you’ve used ustack helpers at all before, you’ve probably already discovered that they don’t work on OS X. I’m told that the way to get this changed is to contact your Apple developer liaison (if you’re lucky enough to have one) or file a bug report at I’d suggest referencing existing bugs 5273057 and 11206497. I’m told that more bugs filed (even if closed as dups) show more interest and make it more likely Apple will choose to fix this.

That’s all I’ve got. If you’ve got questions, your best bet is the dtrace-discuss list. I follow that, as do many others working with DTrace. You can also comment below or tweet me @dapsays.

Many thanks to Bryan, who implemented support for ustack helpers, encouraged me to take on the V8 ustack helper project, and helped enormously along the way.


1 This section looks specific to C/C++, but the details are largely the same in dynamic environments like Java, Node.js, and Python.

2 I say “typically” because it’s possible to disable this behavior at compile-time with gcc. This is ostensibly done for the performance improvement of having an additional register available and to avoid pushing and popping the frame pointer, though reports of performance improvements never appear to be based on hard data, and it’s highly doubtful that the effect would be measurable on most programs. On the other hand, the resulting breakage prevents traditional debuggers and DTrace from grabbing stack traces on such programs. (Even if one could show an actual performance benefit, it’s important to consider that a few percent performance improvement likely does not justify giving up the ability to observe the performance of an application in production, since that precludes many other future improvements.)

3 This leads to the slightly annoying behavior that if you try to trace callstacks from a short-lived process, the process may be gone by the time the userland processes goes to resolve the function names, and you wind up with only the memory addresses. You can work around this by using “-c” to start the process, or “-p” to attach to an existing one. This causes DTrace to attach to the process so that after it exits, the process will stick around until DTrace gets what it needs from the process. While slightly annoying, it’s a small price to pay for instrumenting arbitrary contexts in production.

Tracing Node.js add-on latency

Node.js has some built-in DTrace probes for looking at HTTP request start/done and GC start/done, and I wrote nhttpsnoop to report latency based on these probes. There’s also a future project that will make it easy to trace any asynchronous operations. But in the meantime, there’s one category of latency that’s missing, which is latency of asynchronous operations kicked off by add-ons.

As an example, I’ll use node-sqlite3. Below, I have sqlite-demo.js, which is a version of the demo program that’s been modified to run in a loop where it opens the database, inserts some records, selects some records, and closes the database. As this program runs, the add-on kicks off async operations to open the database, run queries, and close the database. We’re going to trace those operations to see how long each type of operation takes.

Below that, I have trace-addon-latency.d, a D script that traces the libuv uv_queue_work function as well as all functions in all add-ons in the target Node process. This is a little heavy-handed for what we’re trying to do, and can have a noticeable impact on program performance, depending on the add-on. For many of the small add-ons I use, the impact is negligible.

By running the program and the D script for a few minutes, then stopping the D script, you get output that looks like this:

           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@@@@@@@@@@@@@@@@@                        6018
            8192 |@@@@@                                    1590
           16384 |@                                        302
           32768 |@@@@@@@@@@@@@@@@@                        5722
           65536 |                                         156
          131072 |                                         0

There’s a histogram like this for each “kind” of asynchronous operation that each add-on kicks off. By “kind”, I’m referring to the C++ function that kicks off the operation. In this case, Work_BeginExec kicked off 6018 operations that took between 4us and 8us.

This is a quick way to get a look at what async operations your Node program is doing (from add-ons) and how long they’re taking. This works on OS X and SmartOS, and should work on BSD and any other system with DTrace.

Full steps to reproduce

$ mkdir demo
$ cd demo
$ npm install sqlite3
$ curl -O
$ curl -O
$ node sqlite-demo.js &
$ dtrace -s trace-addon-latency.d -p $!     # you may need "sudo" here, particularly on OS X

Let that run for a little while, then kill either one of the “node” or “dtrace” processes. (Killing the “node” one causes “dtrace” to exit, since it was attached to it.) That will cause “dtrace” to exit and print the results.


var sqlite3 = require('sqlite3').verbose();

function go()
    var db = new sqlite3.Database(':memory:');
    db.serialize(function () {
  "CREATE TABLE lorem (info TEXT)");
            var stmt = db.prepare("INSERT INTO lorem VALUES (?)");
            for (var i = 0; i < 10; i++) {
          "Ipsum " + i);
            db.exec("SELECT rowid AS id, info FROM lorem", function () {


#!/usr/sbin/dtrace -s

 * trace-addon-latency.d: Run this script as:
 *     dtrace -s trace-addon-latency.d -p PID
 * to trace the latency of asynchronous operations kicked off by all add-ons in
 * the node process PID.  NOTE: this will instrument all native functions in all
 * add-ons, which can have a noticeable performance impact, depending on the
 * add-on.

    callers[arg1] = ucaller;
    donefunc[arg1] = arg3;
    starttime[arg1] = timestamp;

    @[ufunc(callers[arg0])] = quantize(timestamp - starttime[arg0]);
    callers[arg0] = 0;
    starttime[arg0] = 0;
    donefunc[arg0] = 0;


           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |                                         167
            8192 |@@@@@@@@@@@@@@@@@@@@@@@@@                8529
           16384 |@@@@@@@                                  2326
           32768 |@@@@@@@@                                 2740
           65536 |                                         26
          131072 |                                         0

           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@@@@@@@@@@@@@@@@@                        6018
            8192 |@@@@@                                    1590
           16384 |@                                        302
           32768 |@@@@@@@@@@@@@@@@@                        5722
           65536 |                                         156
          131072 |                                         0

           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@                                        286
            8192 |@@@@@@@@@@@@@@@@@@                       6064
           16384 |@@                                       642
           32768 |@@@@@@@@@@@@@@@@@@@                      6701
           65536 |                                         94
          131072 |                                         1
          262144 |                                         0

           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@@@@@                                    3604
            8192 |@@@@@@@@@@@@                             8376
           16384 |@@                                       1341
           32768 |@@@@@@@@@@@@@@@@                         10831
           65536 |@@@@@                                    3423
          131072 |                                         2
          262144 |                                         0

           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@@@@@@@@@@@@@@@@@@@@@@@                  87884
            8192 |@@@@@                                    19478
           16384 |@@                                       8692
           32768 |@@@@@@@@                                 32160
           65536 |@                                        3455
          131072 |                                         0