When magic collides

We had an interesting issue come up the other day, one that ended up being a pretty nasty bug in DTrace. It merits a detailed explanation, but first, a cautionary note: we’re headed into rough country; in the immortal words of Sixth Edition, you are not expected to understand this.

The problem we noticed was this: when running node and instrumenting certain hot functions (memcpy:entry was noted in particular), after some period of activity, the node process would die on a strange seg fault. And looking at the core dumps, we were (usually) lost in some code path in memcpy() with registers or a stack that was self-inconsistent. In debugging this, it was noted (with DTrace, natch) that the node process was under somewhat intense signal processing activity (we’ll get to why in a bit), and that (in particular) the instrumented function activity and the signal activity seemed to be interacting “shortly” before the fatal failure.

Before wading any deeper, it’s worth taking an aside to explain how we deal with signals in DTrace — a discussion that in turn requires some background on the way loss-less user-level tracing works in DTrace. One of the important constraints that we placed upon ourselves (albeit one that we didn’t talk about too much) is that DTrace is loss-less: if multiple threads on multiple CPUs hit the same point of instrumentation, no thread will see an uninstrumented code path. This may seem obvious, but it’s not the way traditional debuggers work: normally, a debugger stops a process (and all of its threads) when it hits a breakpoint — which it induces by modifying program text with a synchronous trap instruction (e.g., an “int 3” in x86). Then, the program text is modified to contain the original instruction, the thread that hit the breakpoint (and only that thread) is single-stepped (often using hardware support) over the original instruction, and the original instruction is again replaced with a breakpoint instruction. We felt this to be unacceptable for DTrace, and imposed a constraint that tracing always be loss-less.

For user-level instrumentation (that is, the pid provider), magic is required: we need to execute the instrumented instruction out-of-line. For an overview of the mechanism for this, see Adam‘s block comment in fasttrap_isa.c. The summary is that we have a per-thread region of memory at user-level to which we copy both the instrumented instruction and a control transfer instruction that gets us back past the point of instrumentation. Then, when we hit the probe (and after we’ve done in-kernel probe processing), we point the instruction pointer to be in the per-thread region and return to user-level; back in user-land, we execute the instruction and transfer control back to the rest of the program.

Okay, so easy, right? Well, not so fast — enter signals: if we take a signal while we’re executing this little user-level trampoline, we could corrupt program state. At first, it may not be obvious how this is the case; if we execute a signal before we execute the instruction in the trampoline we will simply execute the signal handler (whatever it may be) and return to the trampoline, finish that off, and then get back to the business of executing the program. Where’s the problem? The problem is that if the signal handler executes something that itself is instrumented with DTrace: if this is the case, we’ll hit another probe, and that probe will copy its instruction to the same trampoline. This is bad: it means that upon return from the signal, we will return to the trampoline and execute the wrong instruction. Only one of them, of course. Which, if you’re lucky, will induce an immediate crash. But more likely, the program will execute tens, hundreds or even thousands of instructions before dying on some “impossible” problem. Sound nasty to debug? It is — and next time you see Adam, show some respect.

So how do we solve this problem? For this, Adam made an important observation: the kernel is in charge of signal delivery, and it knows about all of this insanity, so can’t it prevent this from happening? Adam solved this by extending the trampoline after the control transfer instruction to consist of the original instruction (again) followed by an explicit trap into the kernel. If the kernel detects that we’re about to deliver a signal while in this window (that is, with the program counter on the original instruction in the trampoline), it moves the program counter to the second half of the trampoline and returns to user-level with the signal otherwise unhandled. This leaves the signal pending, but only allows the program to execute the one instruction before bringing it back downtown on the explicit trap in the second half of the trampoline. There, we will see that we are in this situation, update the trapped program counter to be the instruction after the instrumented instruction (telling the program a small but necessary lie about where signal delivery actually occured), and bounce back to user-level; the kernel will see the bits set on the thread structure indicating that a signal is pending, and the Right Thing will happen.

And it all works! Well, almost — and that brings us to the bug. After some intensive DTrace/mdb sessions and a long Jabber conversation with Adam, it became clear that while signals were involved, the basic mechanism was actually functioning as designed. More experimenting (and more DTrace, of course) revealed that the problem was even nastier than the lines along which we were thinking: yes, we were taking a signal in the trampoline — but that logic was working as designed, with the program counter being moved correctly into the second half of the trampoline. The problem was that we were taking a second asynchronous signal, this time in the second half of the trampoline, and this brought us to this logic in dtrace_safe_defer_signal:

        /*
         * If we've executed the original instruction, but haven't performed
         * the jmp back to t->t_dtrace_npc or the clean up of any registers
         * used to emulate %rip-relative instructions in 64-bit mode, do that
         * here and take the signal right away. We detect this condition by
         * seeing if the program counter is the range [scrpc + isz, astpc).
         */
        if (t->t_dtrace_astpc - rp->r_pc <
            t->t_dtrace_astpc - t->t_dtrace_scrpc - isz) {
                ...
                rp->r_pc = t->t_dtrace_npc;
                t->t_dtrace_ft = 0;
                return (0);
        }

Now, the reason for this clause (the meat of which I’ve elided) is the subject of its own discussion, but it is not correct as written: it works (due to some implicit use of unsigned math) when the program counter is in the first half of the trampoline, but it fails if the program counter is on the instrumented instruction in the second half of the trampoline (denoted with t_dtrace_astpc). If we hit this case — first one signal on the first instruction of the trampoline followed by another signal on the first instruction on the second half of the trampoline — we will redirect control to be the instruction after the instrumentation (t_dtrace_npc) without having executed the instruction. This, if it needs to be said, is bad — and leads to exactly the kind of mysterious death that we’re seeing.

Fortunately, the fix is simple — here’s the patch. All of this leaves just one more mystery: why are we seeing all of these signals on node, anyway? They are all SIGPROF signals, and when I pinged Ryan about this yesterday, he was as surprised as I was that node seemed to be doing them when otherwise idle. A little digging on both of our parts revealed that the signals we were seeing are an artifact of some V8 magic: the use of SIGPROF is part of the feedback-driven optimization in the new Crankshaft compilation infrastructure for V8. So the signals are goodness — and that DTrace can now correctly handle them with respect to arbitrary instrumentation that much more so!

Posted on March 9, 2011 at 11:27 pm by admin · Permalink
In: Uncategorized

9 Responses

Subscribe to comments via RSS

  1. Written by When magic collides
    on March 9, 2011 at 11:33 pm
    Permalink

    [...] note: we’re headed into rough country; in the immortal words of Sixth Edition, you… [full post] admin The Observation Deck uncategorized 0 0 0 0 0 [...]

  2. Written by Jason Williams
    on March 9, 2011 at 11:50 pm
    Permalink

    Is this going to make it back into Illumos? :)

  3. Written by Erik Corry
    on March 10, 2011 at 12:01 am
    Permalink

    Hairy.

    I’m a little surprised that you are getting two signals within two instructions. The Crankshaft signals are supposed to be around 1ms apart AFAIR. So I would expect it to be very rare that they hit so close to each other. Or is handling the first one taking more than 1ms?

  4. Written by Brendan Gregg
    on March 10, 2011 at 1:11 am
    Permalink

    Thank you, a great blog post. While extremely rare and avoidable (don’t trace memcpy(), for a start, which I usually never want to do anyway), it is alarming for a DTrace bug – which has an extraordinary track record. I can’t recall a single other time I saw DTrace core dump a process. Thanks for the quick fix.

  5. Written by admin
    on March 10, 2011 at 1:20 am
    Permalink

    Hey Erik,

    For whatever it’s worth, it looks like Crankshaft is doing a setitimer() of 10 ms. (We could actually make this of arbitrary resolution on our system by having it use ITIMER_REALPROF instead of ITIMER_PROF; I’ll follow up with Ryan about exploring this.) But that makes your observation that much more important: how could it take that long to execute a single instruction? I don’t know the complete answer, but do remember that this includes the time to execute the signal handler itself — which is to say, whatever work Crankshaft itself is going to do as part of its feedback-driven optimization. If you don’t already, Crankshaft may want to be aware of this effect: if it takes longer than 10 ms to do its work, it will see a “phantom” signal. Anyway, I can investigate this more fully tomorrow if that would be of any use to you…

  6. Written by Vitaly Repeshko
    on March 10, 2011 at 3:54 am
    Permalink

    In Crankshaft we don’t rely on signal handling to determine what to optimize. On Linux, Mac, and Windows there’s a profiler thread that wakes up every 1ms and tells the VM thread to sample itself using our own stack guard mechanism (no signals involved). Yet we do also have statistical profiler support as a tool for developers and it uses SIGPROF on Linux and thread suspend calls on Mac and Windows. On Linux we don’t use setitimer because of its poor resolution and the signals are sent manually by the same profiler thread. I keep mentioning these three platforms because they are the ones we care the most and track on our buildbots. What platform are you using? It may well be it’s a bad shape after the Crankshaft merge to trunk.

  7. Written by Vyacheslav Egorov
    on March 10, 2011 at 4:11 am
    Permalink

    It seems that V8′s platform-solaris.cc is not really up to date with all recent changes in profiling.

    There are two different profilers in V8: sampling CPU profiler activated for example when you run V8 with –prof and runtime profiler which collects data for Crankshaft.

    Sampling CPU profiler is driven by SIGPROFs and does stack sampling in a signal handler. But Crankshaft’s runtime profiler does not rely on SIGPROF — instead it relies on V8′s built-in interrupts system: signals are unreliable because they can arrive at arbitrary moments, so it is in general unsafe to deeply examine VM state in a signal handler.

    So for Crankshaft we did some refactoring: on Linux we start an additional thread aka “sampler” that in regular intervals

    1) sends SIGPROFs to main VM thread if sampling CPU profiler is enabled
    2) requests Crankshaft runtime profiler interrupts if Crankshaft is enabled.

    (note: on MacOS and Widows sampler-thread examines VM thread context directly without sending any signals, but it still requests runtime profiler interrupts).

    I looked at platform-solaris.cc and noticed that this refactoring was not done for it. More importantly I did not notice any calls to RuntimeProfiler::NotifyTick() — this basicly means that on Solaris Crankshaft will not function properly because it’s runtime profiler will not collect any information about hot functions.

  8. Written by Dan McDonald
    on March 10, 2011 at 7:59 am
    Permalink

    Not the first time improvements in one portion find bugs in another. Exhibit A, your honor:

    http://blogs.sun.com/danmcd/entry/how_to_tell_when_a

Subscribe to comments via RSS