Stopping a broken program in its tracks

Last week I debugged a Node issue where under some conditions, fork() failure could result in some file descriptors being closed, leading shortly to bedlam. In the best cases, the program would immediately crash, but often it would just do the wrong thing. This happens a lot, of course, and in many ways it’s worse than a crash because you have no core dump nor even a stack trace to go on.

In my write-up of the bug, I glossed over a key step to debugging the problem:

"Using DTrace and reproducing on SmartOS, I stopped the
process at the moment close() returned EBADF, and caught
this stack trace ..."

I want to explain this technique in more detail because it’s very useful for getting a stack trace or core dump when a program starts going south, even if the program isn’t actually going to crash. (This technique isn’t specific to Node.js or SmartOS, either, though it’s easier with Node.js and on SmartOS.)

In this case, I was debugging a production problem that I reproduced in development, which made things much easier. I was free to kill the process to create a core dump without worrying about disrupting any users. Since the Node program was blowing up on a “close, EBADF” error, I used this DTrace one-liner to make the program dump core precisely when that happened (instead of barreling on with the wrong file descriptor):

# dtrace -w -n 'syscall::close:return
  /execname == "node" && arg1 == -1 && errno == EBADF/
  { trace(pid); raise(SIGABRT); exit(0); }'

Taking apart the script:

The default behavior on receiving SIGABRT is to dump core immediately (see signal.h). It’s possible to alter this behavior, but very few programs do, so this technique almost always works.

Once I had a core file, I fired up MDB to get a JavaScript stack trace, and from there it was easy to figure out what had gone wrong. For details, see the Node issue.

What about production?

I mentioned that in this case, I’d reproduced the problem in development, so I was free to kill the process. What if (as so often happens) the first time we saw this problem was in production, and we wanted to get a core dump at exactly the right time, but without causing the process to crash? That’s not actually much harder, at least on systems with a program like gcore(1):

# dtrace -w -n 'syscall::close:return
  /execname == "node" && arg1 == -1 && errno == EBADF/
  { trace(pid); stop(); system("gcore %d; prun %d", pid, pid); exit(0); }'

This one is very similar to the previous script, except that instead of raising SIGABRT, we use stop() to stop the process in its tracks, and asynchronously run gcore PID (to save the core dump) followed by prun PID (to start the process running again). This is still technically destructive, because we’re stopping the process for however long it takes to save the core dump, but in most cases this isn’t too long and the disruption is insignificant.

To recap: you can use DTrace for more than just data collection, including debugging non-fatal buggy program behavior. DTrace can instrument very specific events on the system (like “the close syscall in a Node process returned -1 with EBADF” or any application-level event), and you can use that to take action like stopping the process to take a core dump and then resuming it again. Putting these pieces together makes it possible to gather all kinds of debugging information from a program that’s behaving badly.

Posted on October 14, 2013 at 7:58 am by dap · Permalink
In: DTrace, SmartOS