DTrace and customer service

Today, I thought I’d share a real-world experience that might portray DTrace in a slightly different light than you’re used to. The other week, I was helping a customer with the following question:

Why is automountd constantly taking up 1.2% of CPU time?

The first thought that came to mind was a broken automountd. But if that were the case, you’d be more likely to see it spinning and stealing 100% of the CPU. Just to be safe, I asked the customer to send truss -u a.out:: output for the automountd process. As expected, I saw automountd chugging away, happily servicing each request as it came in. Automountd was doing nothing wrong – some process was indirectly sending millions of requests a day to the automounter. Taking a brief look at the kernel code, I responded with the following D script:

#!/usr/sbin/dtrace -s
auto_lookup_request:entry
{
@lookups[execname, stringof(args[0]->fi_path)] = count();
}

The customer gave it a shot, and found a misbehaving program that was continuously restarting and causing loads of automount activity. Without any further help from me, the customer could easily see exactly which application was the source of the problem, and quickly fixed the misconfiguration.

Afterwards, I reflected on how simple this exchange was, and how difficult it would have been in the pre-Solaris 10 days. Now, I don’t expect customers to be able to come up with the above D script on their own (though industrious admins will soon be able to wade through OpenSolaris code). But I was able to resolve their problem in just 2 emails. I was reminded of the infamous gtik2_applet2 fiasco described in the DTrace USENIX paperautomountd was just a symptom of an underlying problem, part of an interaction that was prohibitively difficult to trace to its source. One could turn on automountd debug output, but you’d still only see the request itself, not where it came from. To top it off, the offending processes were so short-lived, that they never showed up in prstat(1) output, hiding from traditional system-wide tools.

After a little thought, I imagined a few Solaris 9 scenarios where I’d either set a kernel breakpoint via kadb, or set a user breakpoint in automountd and use mdb -k to see which threads were waiting for a response. But these (and all other solutions I came up with) were:

It really makes me feel the pain our customer support staff must go through now to support Solaris 8 and Solaris 9. DTrace is such a fundamental change in the debugging and observability paradigm that it changes not only the way we kernel engineers work, but also the way people develop applications, administer machines, and support customers. Too bad we can’t EOL Solaris 8 and Solaris 9 next week for the benefit of Sun support…

Posted on February 1, 2005 at 10:57 am by eschrock · Permalink
In: OpenSolaris

6 Responses

Subscribe to comments via RSS

  1. Written by Rayson Ho
    on February 2, 2005 at 6:09 am
    Permalink
  2. Written by Eric Schrock
    on February 2, 2005 at 9:28 am
    Permalink

    Yep, fixed now. Thanks for catching that.

  3. Written by Alan Hargreaves
    on February 3, 2005 at 8:33 pm
    Permalink

    In fact you could have done this purely from the command line …

    $ dtrace -n ‘auto_lookup_request:entry{@lookups[execname, stringof(args[0]->fi_path)] = count();}
    trace: description ‘auto_lookup_request:entry’ matched 1 probe
    ^C
    ls /clones 1
    csh /clones 5
    dtwm /home 5

    I edited the aove for clarity (removing some spaces)
    Alan.

  4. Written by Eric Schrock
    on February 3, 2005 at 10:52 pm
    Permalink

    Yout got me – that’s what I actually sent to the customer. But the script looks nicer for illustrative purposes ;-)

  5. Written by asiapower
    on March 17, 2005 at 4:34 am
    Permalink

    Yout got me – that’s what I actually sent to the customer. But the script looks nicer for illustrative purposes ;-)
    cell phone
    assurance quality
    home loans

  6. Written by iwan rahabok
    on March 22, 2005 at 6:04 am
    Permalink

    You might want to consider giving more examples. I’m doing a Solaris 10 training for 40+ IT folks of my key customers in April, and example like this is certainly useful to drive the point. Thank you. Iwan.

Subscribe to comments via RSS