Happy 5th Birthday, DTrace!

It’s hard to believe, but DTrace is five years old today: it was on September 3, 2003 that DTrace integrated into Solaris. DTrace was a project that extended all three of us to our absolute limit as software engineers — and the 24 hours before integration was then (and remains now) the most harrowing of my career. As it will hopefully remain my most stressful experience as an engineer, the story of that final day merits a retelling…

Our project had been running for nearly two years, but it was not until mid-morning on September 2nd — the day before we were slated to integrate — that it was discovered that the DTrace prototype failed to boot on some very old hardware (the UltraSPARC-I, the oldest hardware still supported at that time). Now, “failed to boot” can meet a bunch of different things, but this was about as awful as it gets: a hard hang after the banner message. That is, booting mysteriously stopped making progress soon after control transferred to the kernel — and one could not break in with the kernel debugger. This is an awful failure mode because with no debugger and no fatal error, one has no place to start other than to start adding print statements — or start ripping out the code that is the difference between the working system and the busted one. This was a terrifying position to be in less than 24 hours before integration! Strangely, it was only the non-DEBUG variant that failed to boot: the DEBUG version laden with assertions worked fine. Our only lucky break was that we were able to find two machines that exhibited the problem, enabling us to bifurcate our efforts: I starting ripping out DTrace-specific code in one workspace, while Mike started frenetically adding print statements in another…

Meanwhile, while we were scrambling to save our project, Eric was having his first day at Sun. My office door was closed, and with our integration pending and me making frequent (and rapid) trips back and forth to the lab, the message to my coworkers was clear: stay the hell back. Eric was blissfully unaware of these implicit signals, however, and he cheerfully poked his head in my office to say hello (Eric had worked the previous summer in our group as an intern). I can’t remember exactly what I said to Eric when he opened my office door, but suffice it to say that the implicit signals were replaced with a very explicit one — and I remain grateful to this day that Eric didn’t quit on the spot…

Back on our problem, Mike — through process of elimination — had made the key breakthrough: it wasn’t actually an instruction set architecture (ISA) issue, but rather it seemed to be a host bus adapter (HBA) issue. This was an incredibly important discovery: while we had a bevy of architectural changes that could conceivably be invalid on an ancient CPU, we had no such HBA-specific changes — this was more likely to be something marring the surface of our work rather than cracking its foundation. Mike further observed that running a DEBUG variant of these ancient HBA drivers (esp and fas) would boot on an otherwise non-DEBUG kernel. At that, I remembered that we actually did have some cosmetic changes to these drivers, and on carefully reviewing the diffs, we found a deadly problem: in folding some old tracing code under a DEBUG-only #define, a critical line (the one that actually initiates the I/O) became compiled in only when DEBUG was defined. We hadn’t seen this until now because these drivers were only used on ancient machines — machines on which we had never tested non-DEBUG. We fixed the problem, and all of our machines booted DEBUG and non-DEBUG — and we felt like we were breathing again for the first time in the more than six hours that we had been working on the problem. (Here is the mail that I sent out explaining the problem.)

To celebrate DTrace’s birthday beyond just recounting the terror of its integration, I wanted to make a couple of documents public that we have not previously shared:

We said it then, and it’s even truer today: it’s been quite a ride. Happy 5th Birthday, DTrace — and thanks again to everyone in the DTrace community for making it what it has become!

Posted on September 3, 2008 at 10:57 am by bmc · Permalink
In: Solaris

17 Responses

Subscribe to comments via RSS

  1. Written by Chad Mynhier
    on September 3, 2008 at 11:23 am

    Hey, Bryan,
    I’ll raise a glass to you three this evening in honor of the occasion.

  2. Written by Anantha
    on September 3, 2008 at 12:28 pm

    Raise a pint my friend, ya’ll deserved it. Of all the Solaris 10 features DTrace & ZFS are the absolute winners.

  3. Written by Jonathan Adams
    on September 3, 2008 at 1:38 pm


  4. Written by Jim Mauro
    on September 3, 2008 at 3:03 pm

    I’ve been spoiled for years. I can’t even think about looking at a system that does not have DTrace!
    Congratulations on showing the world how to do it right.

  5. Written by fletch
    on September 3, 2008 at 4:27 pm

    here’s to applying Moore’s Law to the next 5 years!
    It was truly inspiring being introduced to DTrace at your Lisa in 2006 presentation – I was reminded of the Java presentation I saw James Gosling give in Calgary a loooong time ago
    may you never run out of interesting problems to solve!

  6. Written by Jason Hoffman
    on September 3, 2008 at 5:57 pm

    Congratulations Bryan, Mike and Adam! Can’t imagine a world without DTrace. Thank you thank you.

  7. Written by Jason J. W. Williams
    on September 4, 2008 at 12:50 am

    Congratulations guys. Thank you for bringing light to the darkness. :-) Can’t imagine life before it again.

  8. Written by Björn Herzig
    on September 4, 2008 at 5:40 am

    Congrats and many thanks for this tool.
    Your dtrace presentation on goggle tech talks really blew me away. It was actually one of the reasons why I immediately switched from Linux to Solaris after a decade. I now use it almost every day when debugging problems.
    Dtrace makes working with software so much more fun and a lot more productive.

  9. Written by Jim Grisanzio
    on September 5, 2008 at 2:56 am

    Congrats, guys! :)

  10. Written by Magnus
    on September 5, 2008 at 9:36 am

    Happy Birthday to a really great tool!
    So, are there any anniversary T-Shirts? :-)

  11. Written by Magnus
    on September 5, 2008 at 9:41 am

    And by the way, yes, please share the presentation video.

  12. Written by Ceri Davies
    on September 5, 2008 at 12:11 pm

    Congratulations! Any idea if the PSARC case could be opened for us all to see?

  13. Written by just...gord
    on September 30, 2008 at 11:04 pm

    Hilarious Google techtalk, at first I thought you’d lost me with the rambling intro… but your point was well made. Great talk, a lot of fun.
    I think the value of Dtrace will be to make OS’s much much better by having a superb textual visualization tool – Honestly I’m an avid fan of FreeBSD, Linux and Solaris… it’ll be great to see all three of these improve so much as your tool gets used day to day by other kernel hackers.
    I wonder whether the Linux community will eventually make an (API-, D-language-, DtraceVM- ) compatible version from your specs, in order to have a fully GPL’d Dtrace impl in mainline.
    After seeing Dtrace running…like your analogy, its hard to go back to normal TV, when you’ve tried TiVo :]

  14. Written by CarlOS
    on October 18, 2008 at 2:44 pm

    how dtrace its thread safe? i mean changing the nops for traps instructions in the probe.

  15. Written by Jonathan Adams
    on October 18, 2008 at 3:04 pm

    CarlOS: What’s your concern, exactly?

  16. Written by CarlOS
    on October 18, 2008 at 4:34 pm

    according to this http://blogs.sun.com/basant/entry/performance_impact_of_dtrace_probes normally the probes are just nops, but when dtrace use the probe the nops are changed for a trap instruction. How does this works on a multithreaded app?
    What happend if multiple threads step in the same probe and DTrace at the same time change the nops of the probe for the trap instruction?

  17. Written by Jonathan Adams
    on October 19, 2008 at 12:07 am

    The first thing dtrace does when processing a probe firing is to disable interrupts; at that point, the processing is locked to a CPU. Tracing data goes into per-CPU buffers.
    The provider (fbt or sdt) doesn’t have to worry about any of this; it just has to turn the ‘nop’ into something which calls dtrace_probe() with the appropriate arguments.

Subscribe to comments via RSS