The Observation Deck

Search
Close this search box.

Month: August 2004

Software is like nothing else in the history of human endeavor:1 unlike everything else we have ever built, software costs nothing to manufacture, and it never wears out. Yet these magical properties are arguably overshadowed by the ugly truth that software remains incredibly expensive to build. This gives rise to some strange economic properties: software’s fixed costs are high (very high — too high), but its variable costs are zero. As strange as they are, these economic properties aren’t actually unique to software; they are also true (to varying degree) of the products that we have traditionally called “intellectual property.” But unlike books or paintings or movies, software is predominantly an industrial good — it is almost always used as a component in a larger, engineered system. When you take these together — software’s role as an industrial good, coupled with its high fixed costs and zero variable costs — you get all sorts of strange economic phenomena. For example, doesn’t it strike you as odd that your operating system is essentially free, but your database is still costing you forty grand per CPU? Is a database infinitely more difficult to write than an operating system? (Answer: no.) If not, why the enormous pricing discrepancy?

I want to ultimately address the paradox of the software price discrepancy, but first a quick review of the laws of supply and demand in a normal market: at high prices, suppliers tend to want to supply more, while consumers tend to demand less; at low prices, consumers tend to demand more, while suppliers tend to want to supply less. We can show price versus quantity demanded/supplied with the classic supply and demand curves:

The point of intersection of the curves is the equilibrium price, and the laws of supply and demand tend to keep the market in equilibrium: as prices rise slightly out of equilibrium, suppliers will supply a little more, consumers will demand a little less, inventories will rise a little bit, and prices will fall back into equilibrium. Likewise, if prices fall slightly, consumers will demand a little more, inventories will become depleted, and prices will rise back into equilibrium.

The degree to which suppliers and consumers can react to prices — the slope of their respective curve — is known as price elasticity. In a price inelastic market, suppliers or consumers cannot react quickly to prices. For example, cigarettes have canonically high inelastic demand: if prices increase, few smokers will quit. (That said, demand for a particular brand of cigarettes is roughly normal: if Marlboros suddenly cost ten bucks a pack, cheap Russian imports may begin to look a lot more attractive.)

So that’s the market for smokes, but what of software? Let’s start by looking at the supply side, because it’s pretty simple: the zero variable cost means that suppliers can supply an arbitrary quantity at a given price. That is, this is the supply curve for software:

The height of the “curve” will be dictated by various factors: competitive environment, fixed costs, etc.; we’ll talk about how the height of this curve is set (and shifted) in a bit.

And what of the demand side? Software demand is normal to the degree that consumers have the freedom to choose software components. The problem is that for all of the rhetoric about software becoming a “commodity”, most software is still very much not a commodity: one software product is rarely completely interchangeable with another. The lack of interchangeability isn’t as much of an issue for a project that is still being specified (one can design around the specific intricacies of a specific piece of software), but it’s very much an issue after a project has deployed: deployed systems are rife with implicit dependencies among the different software components. These dependencies — and thus the cost to replace a given software component — tend to increase over time. That is, your demand becomes more and more price inelastic as time goes on, until you reach a point of complete price inelasticity. Perhaps this is the point when you have so much layered on top of the decision, that a change is economically impossible. Or perhaps it’s the point when the technical talent that would retool your infrastructure around a different product has gone on to do something else — or perhaps they’re no longer with the company. Whatever the reason, it’s the point after which the software has become so baked into your infrastructure, the decision cannot be revisited.

So instead of looking the nice supply and demand curves above, software supply and demand curves tend to look like this:

And of course, your friendly software vendor knows that your demand tends towards inelasticity — which is why they so frequently raise the rent while offering so little in return. We’ve always known about this demand inelasticity, we’ve just called it something else: vendor lock-in.

If software suppliers have such unbelievable pricing power, why don’t companies end up forking over every last cent for software? Because the demand for software isn’t completely price inelastic. It’s only inelastic as long as the price is below the cost of switching software. In the spirit of the FYO billboard on the 101, I dub this switching point the “FYO point”: it is the point at which you get so pissed off at your vendor that you completely reevaluate your software decision — you put everything back on the table. So here’s the completed picture:

What happens at the FYO point? In extreme cases, you may decide to rewrite it yourself. Or maybe you’ll decide that it’s worth the pain to switch to a different (and less rapacious) vendor — or at least scaring your existing vendor to ease up a bit on their pricing. Or maybe you’ll ramp up a new project to replace the existing one, using all new components — thus normalizing your demand curve. And increasingly often, you decide that you’re not using half of the features of this thing anyway — and you start looking for a “good enough” open source option to get out of this ugly mess once and for all. (More on this later.)

Now, your software vendor doesn’t actually want you to hit the FYO point; they want to keep you far enough below it that you just sigh (or groan) and sign the check. (Which most of them are pretty good at, by the way — but of course, you already know that from all of your sighing and groaning.) There are essentially two ways for a software company to grow revenue on an established software product:

  1. Take away business from competitors

  2. Extract more dough out of existing customers

In terms of the FYO point, taking away competitors’ business amounts to lowering the FYO point of competitors’ customers. This can be done through technology (for example, by establishing open standards or developing migration tools) or it can be done through pricing (for example, by lowering the price they charge their competitors’ customers for their software — the “competitive upgrade”). This is a trend that generally benefits customers. If only there were no other option…

Sadly, there is another option, and most software companies opt for it: extracting more money from their existing customers. In terms of the FYO point, this amounts to raising the FYO point of their own customers. That is, software vendors act as a natural monopolist does: focusing their efforts not on competition, but rather on raising barriers to entry. They have all sorts of insidious ways of doing this: proprietary data formats, complicated interdependencies, deliberate incompatibilities, etc. Personally, I find these behaviors abhorrent, and I have been amazed about how brazen some software vendors are about maintaining their inalienable right to screw their own customers. To wit: I now have had not one but two software vendors tell me that I must add a way to disable DTrace for their app to prevent their own customers from observing their software. They’re not even worried about their competitors — they’re too busy screwing their own customers! (Needless to say, their requests for such a feature were, um, declined.)

So how does open source fit into this? Open source is a natural consequence of the economics of software, on both the demand-side and the supply-side. The demand-side has been discussed ad nauseum (and frequently, ad hominem): the demand for open source comes from customers who are sick of their vendors’ unsavory tactics to raise their FYO point — and they are sick more generally of the whole notion of vendor lock-in. The demand-side is generally responsible for customers writing their own software and making it freely available, or participating in similar projects in the community at large. To date, the demand-side has propelled much of open source software, including web servers (Apache) and scripting languages (Perl, Python). With some exception, the demand-side consists largely of individuals participating out of their interest more than their self-interest. As a result, it generally cannot sustain full-time, professional software developers.

But there’s also a supply-side to open source: if software has no variable cost, software companies’ attempts to lower their competitors’ customers’ FYO point will ultimately manifest itself in free software. And the most (if not the only) way to make software convincingly free is to make the source code freely available — to make it open source.2 The tendency towards open source is especially strong when companies profit not directly from the right-to-use of the software, but rather from some complementary good: support, services, other software, or even hardware. (In the specific case of Solaris and Sun, it’s actually all of the above.) And what if customers never consume any of these products? Well, the software costs nothing to manufacture, so there isn’t a loss — and there is often an indirect gain. To take the specific example of Solaris: if you run Solaris and never give a nickel to Sun, that’s fine by us; it didn’t even cost us a nickel to make your copy, and your use will increase the market for Solaris applications and solutions — driving platform adoption and ultimately driving revenue for Sun. To put this in retail terms, open source software has all of the properties of a loss-leader — minus the loss, of course.

While the demand-side has propelled much of open source to date, the supply-side is (in my opinion) ultimately a more powerful force in the long run: the software created by supply-side forces is generally been developed by people who do it full-time for a living — there is naturally a greater attention to detail. For a good example of supply-side forces, look at operating systems where Linux — the traditionally dominant open source operating system — has enjoyed profound benefits from the supply-side. These include contributions from operating systems such as AIX (JFS, scalability work), IRIX (XFS, observability tools and the occasional whoopsie), DYNIX/ptx (RCU locks), and even OS/2 (DProbes). And an even larger supply-side contribution looms: the open sourcing of Solaris. This will certainly be the most important supply-side contribution to date, and a recognition of the economics both of the operating systems market and of the software market more generally. And unlike much prior supply-side open source activity, the open sourcing of Solaris is not open source as capitulation — it is open source as counter-strike.

To come back to our initial question: why is the OS basically free while the database is costing you forty grand per CPU? The short answer is that the changes that have swept through the enterprise OS market are still ongoing in the database market. Yes, there have been traditional demand-side efforts like MySQL and research efforts like PostgreSQL, but neither of these “good enough” efforts has actually been good enough to compete with Informix, Oracle, DB/2 or Sybase in the enterprise market. In the last few years, however, we have seen serious supply-side movement, with MaxDB from SAP and Ingres from CA both becoming open source. Will either of these be able to start taking serious business away from Oracle and IBM? That is, will they be enough to lower the FYO point such that more customers say “FY, O”? The economics of software tells us that, in the long run, this is likely the case: either the demand-side will ultimately force sufficient improvements to the existing open source databases, or the supply-side will force the open sourcing of one of the viable competitors. And that software does not wear out and costs nothing to manufacture assures us that the open source databases will survive to stalk their competitors into the long run. Will this happen anytime soon? As Keynes famously pointed out, “in the long run, we are all dead” — so don’t count on any less sighing or groaning or check writing in the immediate future…


1 As an aside, I generally hate this rhetorical technique of saying that “[noun] is the [superlative] [same noun] [verb] by humankind.” It makes it sound like the chimps did this years ago, but we humans have only recently caught up. I regret using this technique myself, so let me clarify: with the notable exception of gtik2_applet2, the chimps have not yet discovered how to write software.

2 Just to cut off any rabid comments about definitions: by “open source” I only mean that the source code is sufficiently widely and publicly available that customers don’t question that its right-to-use is (and will always be) free. This may or may not mean OSI-approved, and it may or may not mean GPL. And of course, many customers have discovered that open source alone doesn’t solve the problem. You need someone to support it — and the company offering support begins to look, act and smell a lot like a traditional, rapacious software company. (Indeed, FYO point may ultimately be renamed the “FYRH point.”) You still need open standards, open APIs, portable languages and so on…

So DTrace was recently mentioned on the linux-kernel mailing list. The question in the subject was is “DTrace-like analysis possible with future Linux kernels?” The responses have been interesting. Karim Yaghmour rattled in with his usual blather about the existence of DTrace proving that LTT should have been accepted into the Linux kernel long ago. I find this argument incredibly tedious, and have
addressed it at some length. Then there was this inane post:

> http://www.theregister.co.uk/2004/07/08/dtrace_user_take/:
> "Sun sees DTrace as a big advantage for Solaris over other versions of Unix
> and Linux."
That article is way too hypey.
It sounds like one of those strange american commercials you see
sometimes at night, where two overenthusiastic persons are telling you
how much that strange fruit juice machine has changed their lives,
with making them loose 200 pounds in 6 days and improving their
performance at beach volleyball a lot due to subneutronic antigravity
manipulation. You usually can't watch those commercials for longer
than 5 minutes.
The same applies to that article, I couldn't even read it completely,
it was just too much.
And is it just me or did that article really take that long to
mentioning what dtrace actually IS?
Come on, it's profiling. As presented by that article, it is even more
micro optimization than one would think. What with tweaking the disk
I/O improvements and all... If my harddisk accesses were a microsecond
more immediate or my filesystem giving a quantum more transfer rate,
it would be nice, but I certainly wouldn't get enthusiastic and I bet
nobody would even notice.
Maybe, without that article, I would recognize it as a fine thing (and
by "fine" I don't mean "the best thing since sliced bread"), but that
piece of text was just too ridiculous to take anything serious.
I sure hope that article is meant sarcastically. By the way, did I
miss something or is profiling suddenly a new thing again?
Regards,
Julien

Yes, you missed something Julien: you forgot to type “dtrace” into google. (If there were a super-nerd equivalent of the Daily Show, we might expect
Lewis Black to say that — probably punctuated with his usual “you moron!”)
If you had done this, you would have been taken to the
DTrace BigAdmin site which contains links to the
DTrace USENIX paper, the
DTrace documentation, and a boatload of other material that supports the claims in The Register story. In fact, if you had just scrolled to the bottom of that story you would have read the “Bootnotes” section of the story — which provides plenty of low-level supporting detail. (Indeed, I’m not sure that I’ve ever seen The Register publish such user-supplied detail to support a story.)

Sometimes the bigotry surrounding Linux suprises even me: in the time he took to record his misconceptions, Julien could have (easily!) figured out that he was completely wrong. But I guess that even this is too much work for someone who is looking to confirm preconceived notions rather than understand new technology…

Fortunately,
one of the responses did call Julien on this, if only slightly:

* Julien Oster:
> Miles Lane  writes:
>
>> http://www.theregister.co.uk/2004/07/08/dtrace_user_take/:
>> "Sun sees DTrace as a big advantage for Solaris over other versions of Unix
>> and Linux."
>
> That article is way too hypey.
Maybe, but DTrace seems to solve one really pressing problem: tracking
disk I/O to the processes causing it.  Unexplained high I/O
utilization is a *very* common problem, and there aren't any tools to
diagnose it.
Most other system resources can be tracked quite easily: disk space,
CPU time, committed address space, even network I/O (with tcpdump and
netstat -p).  But there's no such thing for disk I/O.

Of course, the responder misses the larger point about DTrace — that one can instrument one’s system arbitrarily and safely with DTrace — but at least he correctly identifies one capacity in which DTrace clearly leads the pack. And I suppose that this is the best that a rival technology can expect to do, so close to the epicenter of Linux development

With my explanation of a demo gone wrong, several people have asked me the more general question: how does one demo DTrace? This question doesn’t have a single answer, especially given that DTrace is best demonstrated with ad hoc queries of the system. Indeed, the best demos are when someone in the audience shouts out their own question that they want to see answered: answering such a question instantly with DTrace nearly always blows away the questioner — who has presumably suffered in the past trying to answer similar questions. Of course, saying “why, there are many ways to demo to DTrace!” is a useless answer to the question of how one demos DTrace; while there are many ways that one can demo DTrace, it’s useful to get a flavor of how a typical demo might go. So with the substantial caveat that this is not the way to demo DTrace, but merely a way, let me walk you through an example demo.

It all starts by running dtrace(1M) without any arguments:

# dtrace
Usage: dtrace [-32|-64] [-aACeEFGHlqSvVwZ] [-b bufsz] [-c cmd] [-D name[=def]]
        [-I path] [-o output] [-p pid] [-s script] [-U name]
        [-x opt[=val]] [-X a|c|s|t]

        [-P provider [[ predicate ] action ]]
        [-m [ provider: ] module [[ predicate ] action ]]
        [-f [[ provider: ] module: ] func [[ predicate ] action ]]
        [-n [[[ provider: ] module: ] func: ] name [[ predicate ] action ]]
        [-i probe-id [[ predicate ] action ]] [ args ... ]

        predicate -> '/' D-expression '/'
           action -> '{' D-statements '}'

        -32 generate 32-bit D programs and ELF files
        -64 generate 64-bit D programs and ELF files

        -a  claim anonymous tracing state
        -A  generate driver.conf(4) directives for anonymous tracing
        -b  set trace buffer size
        -c  run specified command and exit upon its completion
        -C  run cpp(1) preprocessor on script files
        -D  define symbol when invoking preprocessor
        -e  exit after compiling request but prior to enabling probes
        -E  exit after enabling probes but prior to tracing data
        -f  enable or list probes matching the specified function name
        -F  coalesce trace output by function
        -G  generate an ELF file containing embedded dtrace program
        -H  print included files when invoking preprocessor
        -i  enable or list probes matching the specified probe id
        -I  add include directory to preprocessor search path
        -l  list probes matching specified criteria
        -m  enable or list probes matching the specified module name
        -n  enable or list probes matching the specified probe name
        -o  set output file
        -p  grab specified process-ID and cache its symbol tables
        -P  enable or list probes matching the specified provider name
        -q  set quiet mode (only output explicitly traced data)
        -s  enable or list probes according to the specified D script
        -S  print D compiler intermediate code
        -U  undefine symbol when invoking preprocessor
        -v  set verbose mode (report program stability attributes)
        -V  report DTrace API version
        -w  permit destructive actions
        -x  enable or modify compiler and tracing options
        -X  specify ISO C conformance settings for preprocessor
        -Z  permit probe descriptions that match zero probes

I usually just point out this just gives your typical Unix-ish help
message — implicitly making the point that dipping into DTrace doesn’t
require much more than typing its name on the command line.
From the output of the usage message,
I highlight the “-l” option, noting that it lists all the probes in the
system. I then run with that option — being sure to pipe the output to
more(1):

# dtrace -l | more
   ID   PROVIDER            MODULE                          FUNCTION NAME
    1     dtrace                                                     BEGIN
    2     dtrace                                                     END
    3     dtrace                                                     ERROR
    4   lockstat           genunix                       mutex_enter adaptive-acquire
    5   lockstat           genunix                       mutex_enter adaptive-block
    6   lockstat           genunix                       mutex_enter adaptive-spin
    7   lockstat           genunix                        mutex_exit adaptive-release
    8   lockstat           genunix                     mutex_destroy adaptive-release
    9   lockstat           genunix                    mutex_tryenter adaptive-acquire
   10   lockstat           genunix                          lock_set spin-acquire
   11   lockstat           genunix                          lock_set spin-spin
   12   lockstat           genunix                      lock_set_spl spin-acquire
   13   lockstat           genunix                      lock_set_spl spin-spin
   14   lockstat           genunix                          lock_try spin-acquire
   15   lockstat           genunix                        lock_clear spin-release
   16   lockstat           genunix                   lock_clear_splx spin-release
   17   lockstat           genunix                      CLOCK_UNLOCK spin-release
   18   lockstat           genunix                          rw_enter rw-acquire
   19   lockstat           genunix                          rw_enter rw-block
   20   lockstat           genunix                           rw_exit rw-release
   21   lockstat           genunix                       rw_tryenter rw-acquire
   22   lockstat           genunix                     rw_tryupgrade rw-upgrade
   23   lockstat           genunix                      rw_downgrade rw-downgrade
   24   lockstat           genunix                       thread_lock thread-spin
   25   lockstat           genunix                  thread_lock_high thread-spin
   26   fasttrap                                            fasttrap fasttrap
   27    syscall                                               nosys entry
   28    syscall                                               nosys return
   29    syscall                                               rexit entry
   30    syscall                                               rexit return
--More--

I point out that each of these rows is a point of instrumentation. I then explain each of the columns, explaining that the PROVIDER column denotes the DTrace notion of instrumentation providers which allows for disjoint instrumentation techniques. If the audience is a particularly savvy one, I may point out that the lockstat provider is a recasting of the instrumentation technique used by lockstat(1M), and that lockstat has been made to be a DTrace consumer. I mention that the MODULE column contains the name of the kernel module (in the case of a kernel probe) or shared object name (in the case of a user-level probe), that the FUNCTION column contains the function that the probe is in, and that the NAME column contains the name of the probe. I explain that a each probe is uniquely identified by the probe tuple: provider, module, function and name.

Scrolling down through the output, I point out the probes from syscall provider, describing its ability to instrument every system call entry and return. (For some audiences, I may pause here to remind them of the definition of a system call.)

I then continue to scroll down my probe listing until I get to output that looks something like this:

 ...
 1268        fbt              unix                      dvma_release entry
 1269        fbt              unix                      dvma_release return
 1270        fbt              unix                        softlevel1 entry
 1271        fbt              unix                        softlevel1 return
 1272        fbt              unix                        freq_notsc entry
 1273        fbt              unix                        freq_notsc return
 1274        fbt              unix                         psm_unmap entry
 1275        fbt              unix                         psm_unmap return
 1276        fbt              unix             cpu_visibility_online entry
 1277        fbt              unix             cpu_visibility_online return
 1278        fbt              unix                         setbackdq entry
 1279        fbt              unix                         setbackdq return
 1280        fbt              unix                       lgrp_choose entry
 1281        fbt              unix                       lgrp_choose return
 1282        fbt              unix              cpu_intr_swtch_enter entry
 1283        fbt              unix              cpu_intr_swtch_enter return
 1284        fbt              unix                      page_upgrade entry
 1285        fbt              unix                      page_upgrade return
 1286        fbt              unix                      page_lock_es entry
 1287        fbt              unix                      page_lock_es return
 1288        fbt              unix               lgrp_shm_policy_set entry
 1289        fbt              unix               lgrp_shm_policy_set return
 1290        fbt              unix                        segkmem_gc entry
 1291        fbt              unix                        segkmem_gc return
 1292        fbt              unix                      disp_anywork entry
 1293        fbt              unix                      disp_anywork return
 1294        fbt              unix                   prgetprxregsize entry
 1295        fbt              unix                   prgetprxregsize return
 1296        fbt              unix                       cpuid_pass1 entry
 1297        fbt              unix                       cpuid_pass1 return
 1298        fbt              unix                       cpuid_pass2 entry
 1299        fbt              unix                       cpuid_pass2 return
 1300        fbt              unix                       cpuid_pass3 entry
 1301        fbt              unix                       cpuid_pass3 return
 1302        fbt              unix                       cpuid_pass4 entry
 1303        fbt              unix                       cpuid_pass4 return
 1304        fbt              unix                 release_bootstrap entry
 1305        fbt              unix                 release_bootstrap return
 1306        fbt              unix          i_ddi_rnumber_to_regspec entry
 1307        fbt              unix          i_ddi_rnumber_to_regspec return
 1308        fbt              unix                    page_mem_avail entry
 1309        fbt              unix                    page_mem_avail return
 1310        fbt              unix                      page_pp_lock entry
 1311        fbt              unix                      page_pp_lock return
 ...

I pause here to explain that the fbt provider can instrument every function entry and return in the kernel. Making the observation that there are quite a few functions in the kernel, I then quit out of the output, and re-run the command — this time piping the output through wc. Usually, the output is something like this:

# dtrace -l | wc -l
   32521

Anyone who has dealt with traditional instrumentation frameworks will typically express some shock at this — that there are (for example) 32,520 points of instrumentation on an optimized, stock system. Occasionally, someone who has perhaps heard of DTrace may pipe up: “Is that where that 30,000 number comes from that I’ve seen in in the press?” The quick answer is that this is indeed where that number comes from — but the longer answer is that this really only the beginning because with the pid provider, DTrace can instrument every instruction in every running app. For perhaps obvious reasons, we create these pid probes lazily — we don’t want to clog up the system with millions of unused probes.

After having listed all probes and counted them, my next invocation of dtrace is to list the probes yet again — but this time listing only probes that match the probe tuple “syscall:::entry". I explain that this means I want to match probes from the syscall provider named entry — and that I don’t care about the module and function. This output is simply a shorter version of the previous listing:

# dtrace -l -n syscall:::entry
   ID   PROVIDER            MODULE                          FUNCTION NAME
   27    syscall                                               nosys entry
   29    syscall                                               rexit entry
   31    syscall                                             forkall entry
   33    syscall                                                read entry
   35    syscall                                               write entry
   37    syscall                                                open entry
   39    syscall                                               close entry
   41    syscall                                                wait entry
...

I then explain that to enable these probes (instead of just listing them), I need to merely not include the “-l“:

# dtrace -n syscall:::entry
dtrace: description 'syscall:::entry' matched 226 probes
CPU     ID                    FUNCTION:NAME
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    261                  sysconfig:entry
  0    261                  sysconfig:entry
  0    195                  sigaction:entry
  0    195                  sigaction:entry
  0    125                      ioctl:entry
  0    261                  sysconfig:entry
  0     61                        brk:entry
  0     61                        brk:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    403                    fstat64:entry
  0     61                        brk:entry
  0     61                        brk:entry
  0    403                    fstat64:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0     35                      write:entry
  0    125                      ioctl:entry
  0    403                    fstat64:entry
  0     35                      write:entry
  0    125                      ioctl:entry
  0    125                      ioctl:entry
  0    339                    pollsys:entry
  0     33                       read:entry
...

I let this run on the screen for a while, explaining that we’ve gone from an optimized, uninstrumented system call table to an instrumented one — and that we’re now seeing every system call as it happens on the box. I explain that (on the one hand) this is novel information: in Solaris, we haven’t had a way to get system call information for the entire system. (I often remind the audience that this may look similar to truss , but it’s really quite different: truss operates only on a single process, and has an enormous probe effect from constantly stopping and running the target process.)

But while this information may be novel (at least for Solaris), it’s actually not that valuable — merely knowing that we executed these systems calls is unlikely to be terrible helpful. Rather, we may like to know which applications are inducing these system calls. To do this, we add a clause in the DTrace control language, D. The clause we need to add is quite simple; we’re going to use the trace action to record the current application name, which is stored in execname variable. So I ^C the previous enabling, and run the following:

# dtrace -n syscall:::entry'{trace(execname)}'
dtrace: description 'syscall:::entry' matched 226 probes
...
  0    125                      ioctl:entry   xterm
  0    125                      ioctl:entry   xterm
  0    339                    pollsys:entry   xterm
  0    339                    pollsys:entry   xterm
  0    313                lwp_sigmask:entry   Xsun
  0    313                lwp_sigmask:entry   Xsun
  0    313                lwp_sigmask:entry   Xsun
  0    313                lwp_sigmask:entry   Xsun
  0     33                       read:entry   Xsun
  0    233                     writev:entry   Xsun
  0    125                      ioctl:entry   xterm
  0     33                       read:entry   xterm
  0    339                    pollsys:entry   xterm
  0    339                    pollsys:entry   xterm
  0    125                      ioctl:entry   xterm
  0    125                      ioctl:entry   xterm
  0    339                    pollsys:entry   xterm
  0    125                      ioctl:entry   xterm
  0    125                      ioctl:entry   xterm
  0    339                    pollsys:entry   xterm
  0    339                    pollsys:entry   xterm
  0    313                lwp_sigmask:entry   Xsun
  0    313                lwp_sigmask:entry   Xsun
  0    313                lwp_sigmask:entry   Xsun
  0    313                lwp_sigmask:entry   Xsun
  0    339                    pollsys:entry   Xsun
  0     35                      write:entry   dtrace
  0    125                      ioctl:entry   xterm
  0    125                      ioctl:entry   xterm
  0    339                    pollsys:entry   xterm
  0     33                       read:entry   xterm
  0    125                      ioctl:entry   xterm
  0     35                      write:entry   xterm
  0    125                      ioctl:entry   xterm
  0    339                    pollsys:entry   xterm
  0    125                      ioctl:entry   xterm
  0     33                       read:entry   xterm
  0     35                      write:entry   xterm
  0    125                      ioctl:entry   xterm
  0    125                      ioctl:entry   xterm
  0    339                    pollsys:entry   xterm
  0    339                    pollsys:entry   xterm
  0    313                lwp_sigmask:entry   Xsun
...

This is clearly more useful: we can see which apps are performing system calls. But seeing this also leads to a very important observation, one that some heckler in a savvy audience may have already made by now: what we’re seeing in the above output is the system calls to write all of this gunk out to the terminal. That is, if we were to take this output and cut it, sort it, uniq it, and sort it again, we would come to the conclusion that dtrace, xterm and Xsun were the problem — regardless of what was being executed on the machine.

This is a serious problem, and the DTrace solution represents a substantial difference from virtually all earler work in this domain: in DTrace, the aggregation of data is a first-class citizen. Instead of aggregating data postmortem with traditional Unix tools, DTrace allows data to be aggregated in situ. This aggregation can reduce the data flow by up to a factor of the number of data points — a tremendous savings of both space and time.

So to recast the example in terms of aggregations, we want to aggregate on the application name, with our aggregating action being to count:

# dtrace -n syscall:::entry'{@[execname] = count()}'
dtrace: description 'syscall:::entry' matched 226 probes

When we run the above, we don’t see any output — just that we matched some number of probes. I explain that behind the scenes, we’re just keeping a table of application name, and the count of system calls. Whenever we ^C the above, we get the answer:

^C

  utmpd                                                             4
  automountd                                                        8
  inetd                                                             9
  svc.configd                                                       9
  syslogd                                                          16
  FvwmAuto                                                         39
  in.routed                                                        48
  svc.startd                                                       97
  MozillaFirebird-                                                125
  sendmail                                                        183
  fvwm2                                                           621
  dhcpagent                                                      1192
  dtrace                                                         2195
  xclock                                                         2894
  vim                                                            7760
  xterm                                                         11768
  Xsun                                                          24916
#

I often point out that in DTrace — as in good cocktail conversation — the answer to one question typically provokes the next question. Looking at the above output, we may have all sorts of questions. For example, we may well ask: what the hell is dhcpagent doing? To answer this question, we can add a predicate to our clause:

# dtrace -n syscall:::entry'/execname == "dhcpagent"/{@[probefunc] = count()}'

The predicate is contained between the forward slashes; the expression in the curly braces is only evaluated if the predicate evaluates to a non-zero value. And note that we have changed the tuple on which we are aggregating: instead of aggregating on execname (which, thanks to the predicate, will always be “dhcpagent“), we’re aggregating on the probefunc. For the syscall provider, the probefunc is the name of the system call.

Running the above for ten seconds or so, and then ^C‘ing yields the following (at least on my laptop):

^C

  pollsys                                                           3
  close                                                             3
  open                                                              3
  lwp_sigmask                                                       6
  fstat64                                                           6
  read                                                             12
  ioctl                                                            12
  llseek                                                           15

Now, we may be interested in drilling down more into one of these system calls. Let’s say we want to know what dhcpagent is doing when it performs an open. We can effect this by narrowing our enabling a bit: instead of enabling all syscall probes named entry, we want to enable only those probes that are also from the function named open. We want to keep our predicate that the execname is dhcpagent, but this time we don’t want to aggregate on the probefunc (which, thanks to the narrowed enabling, will always be “open“); we want to aggregate on the name of the file that is being opened — which is the first argument to the open system call. To do this, we’ll aggregate on arg0. For slightly arcane reasons, we need to use the copyinstr action to effect this. The enabling:

# dtrace -n syscall::open:entry'/execname == "dhcpagent"/{@[copyinstr(arg0)] = count()}'
dtrace: description 'syscall::open:entry' matched 1 probe

Again, running the above for about ten seconds:

^C

  /etc/default/dhcpagent                                            4

This tells us that there were four calls to the open system call from dhcpagent — and that all four were to the same file. Just on the face of it, this is suspicious: why is dhcpagent opening the file “/etc/default/dhcpagent” with such frequency? Hasn’t whoever wrote this code ever heard of stat?1 To answer the former question, we’re going to change our enabling to aggregate on dhcpagent‘s stack backtrace when it performs an open. This is done with the ustack action. The new enabling:

# dtrace -n syscall::open:entry'/execname == "dhcpagent"/{@[ustack()] = count()}'
dtrace: description 'syscall::open:entry' matched 1 probe

Running the above for ten seconds or so and then ^C‘ing yields the following:

^C


              libc.so.1`__open+0xc
              libc.so.1`_endopen+0x92
              libc.so.1`fopen+0x26
              libcmd.so.1`defopen+0x40
              dhcpagent`df_get_string+0x27
              dhcpagent`df_get_int+0x1c
              dhcpagent`dhcp_requesting+0x248
              libinetutil.so.1`iu_expire_timers+0x8a
              libinetutil.so.1`iu_handle_events+0x95
              dhcpagent`main+0x300
              805314a
                4

This tells us that all four of the open calls were from the above stack trace.

From here we have many options, and our course of action would depend on how we wanted to investigate this. Up until this point, we have been instrumenting only the system call layer — the shim layer between applications and the kernel. We may wish to now instrument the application itself. To do this, we first need to know how many dhcpagent processes are contributing to the output we’re seeing above. There are lots of ways to do this; a simple way is to aggregate on pid:

# dtrace -n syscall:::entry'/execname == "dhcpagent"/{@[pid] = count()}'
dtrace: description 'syscall:::entry' matched 226 probes
^C

       43               80

This indicates that in the time between when we started the above, and the time that we ^C‘d, we saw 80 system calls from pid 43 — and that is was the only dhcpagent process that made any system calls. Now that we have the pid that we’re intrested in, we can instrument the application. For starters, let’s just instrument the defopen function in libcmd.so.1. (We know that it’s being called because we saw it in the stack backtrace of the open system call.) To do this:

# dtrace -n pid43::defopen:entry
dtrace: description 'pid43::defopen:entry' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0  32622                    defopen:entry
  0  32622                    defopen:entry
  0  32622                    defopen:entry
  0  32622                    defopen:entry
^C

With this we have instrumented the running app. When we ^C‘d, the app went back to being its usual optimized self. (Or at least, as optimized as dhcpagent ever is.)

If we wanted to trace both entry to and return from the defopen function, we could add another enabling to the mix:

# dtrace -n pid43::defopen:entry,pid43::defopen:return
dtrace: description 'pid43::defopen:entry,pid43::defopen:return' matched 2 probes
CPU     ID                    FUNCTION:NAME
  0  32622                    defopen:entry
  0  32623                   defopen:return
  0  32622                    defopen:entry
  0  32623                   defopen:return
  0  32622                    defopen:entry
  0  32623                   defopen:return
  0  32622                    defopen:entry
  0  32623                   defopen:return
^C

But we may want to know more than this: we may want to know everything called from defopen. Answering this is a bit more sophisticated — and it’s a little too much for the command line. To answer this question, we head to an editor and type in the following DTrace script:

#!/usr/sbin/dtrace -s

pid43::defopen:entry
{
        self->follow = 1;
}

pid43:::entry,
pid43:::return
/self->follow/
{}

pid43::defopen:return
/self->follow/
{
        self->follow = 0;
        exit(0);
}

So what is this script doing? We’re enabling defopen again, but this time we’re setting a thread-local variable — which we named “follow” to 1. We then enabled every function entry and return in the process, with the predicate that we’re only interested if our thread-local variable is non-zero. This has the effect of only tracing function entry and return if the function call was induced by defopen. We’re only interested in capturing one of these traces, which is the reason why we explicitly exit in the pid43::defopen:return clause.

Running the above is going to instrument the hell out of dhcpagent. We’re going to see that we matched many, many probes. Then we’ll see a bunch of output as defopen is called. Finally, we’ll be dropped back onto the shell prompt as we hit the exit action. By the time we see the shell again, we have restored the dhcpagent process to its uninstrumented self. Running the above, assuming it’s in dhcp.d:

# chmod +x dhcp.d
# ./dhcp.d
dtrace: script './dhcp.d' matched 10909 probes
CPU     ID                    FUNCTION:NAME
  0  32622                    defopen:entry
  0  33328              _get_thr_data:entry
  0  37823            thr_getspecific:entry
  0  43272           thr_getspecific:return
  0  38784             _get_thr_data:return
  0  37176                      fopen:entry
  0  37161                   _findiop:entry
  0  37525      pthread_rwlock_wrlock:entry
  0  37524             rw_wrlock_impl:entry
  0  37513               rw_read_held:entry
  0  42963              rw_read_held:return
  0  37514              rw_write_held:entry
  0  42964             rw_write_held:return
  0  37518                rwlock_lock:entry
  0  37789                      sigon:entry
  0  43238                     sigon:return
  0  42968               rwlock_lock:return
  0  42974            rw_wrlock_impl:return
  0  42975     pthread_rwlock_wrlock:return
  0  37174                     getiop:entry
  0  37674              mutex_trylock:entry
  0  43123             mutex_trylock:return
  0  37676               mutex_unlock:entry
  0  43125              mutex_unlock:return
  0  42625                    getiop:return
  0  37174                     getiop:entry
  0  37674              mutex_trylock:entry
  0  43123             mutex_trylock:return
  0  37676               mutex_unlock:entry
  0  43125              mutex_unlock:return
  0  42625                    getiop:return
  0  37174                     getiop:entry
  0  37674              mutex_trylock:entry
  0  43123             mutex_trylock:return
  0  37676               mutex_unlock:entry
  0  43125              mutex_unlock:return
  0  42625                    getiop:return
  0  37174                     getiop:entry
  0  37674              mutex_trylock:entry
  0  43123             mutex_trylock:return
  0  35744                     memset:entry
  0  41198                    memset:return
  0  37676               mutex_unlock:entry
  0  43125              mutex_unlock:return
  0  42625                    getiop:return
  0  37531      pthread_rwlock_unlock:entry
  0  37514              rw_write_held:entry
  0  37680                 mutex_held:entry
  0  43129                mutex_held:return
  0  42964             rw_write_held:return
  0  37789                      sigon:entry
  0  43238                     sigon:return
  0  42981     pthread_rwlock_unlock:return
  0  42612                  _findiop:return
  0  37123                   _endopen:entry
  0  37309                      _open:entry
  0  37951                     __open:entry
  0  43397                    __open:return
  0  42760                     _open:return
  0  37152                  _flockget:entry
  0  37670                 mutex_lock:entry
  0  37669            mutex_lock_impl:entry
  0  43118           mutex_lock_impl:return
  0  43119                mutex_lock:return
  0  42603                 _flockget:return
  0  37676               mutex_unlock:entry
  0  43125              mutex_unlock:return
  0  42574                  _endopen:return
  0  42627                     fopen:return
  0  32623                   defopen:return
  0  32623                   defopen:return

#

This is interesting, but a little hard to sift through. Go back to dhcp.d, and add the following line:

#pragma D option flowindent

This sets a simple option that indents a code flow: probes named entry cause indentation to increase two spaces, and probes named return cause indentation to decrease two spaces. Running the new dhcp.d:

# ./dhcp.d
dtrace: script './dhcp.d' matched 10909 probes
CPU FUNCTION
  0  -> defopen
  0    -> _get_thr_data
  0      -> thr_getspecific
  0      <- thr_getspecific
  0    <- _get_thr_data
  0    -> fopen
  0      -> _findiop
  0        -> pthread_rwlock_wrlock
  0          -> rw_wrlock_impl
  0            -> rw_read_held
  0            <- rw_read_held
  0            -> rw_write_held
  0            <- rw_write_held
  0            -> rwlock_lock
  0              -> sigon
  0              <- sigon
  0            <- rwlock_lock
  0          <- rw_wrlock_impl
  0        <- pthread_rwlock_wrlock
  0        -> getiop
  0          -> mutex_trylock
  0          <- mutex_trylock
  0          -> mutex_unlock
  0          <- mutex_unlock
  0        <- getiop
  0        -> getiop
  0          -> mutex_trylock
  0          <- mutex_trylock
  0          -> mutex_unlock
  0          <- mutex_unlock
  0        <- getiop
  0        -> getiop
  0          -> mutex_trylock
  0          <- mutex_trylock
  0          -> mutex_unlock
  0          <- mutex_unlock
  0        <- getiop
  0        -> getiop
  0          -> mutex_trylock
  0          <- mutex_trylock
  0          -> memset
  0          <- memset
  0          -> mutex_unlock
  0          <- mutex_unlock
  0        <- getiop
  0        -> pthread_rwlock_unlock
  0          -> rw_write_held
  0            -> mutex_held
  0            <- mutex_held
  0          <- rw_write_held
  0          -> sigon
  0          <- sigon
  0        <- pthread_rwlock_unlock
  0      <- _findiop
  0      -> _endopen
  0        -> _open
  0          -> __open
  0          <- __open
  0        <- _open
  0        -> _flockget
  0          -> mutex_lock
  0            -> mutex_lock_impl
  0            <- mutex_lock_impl
  0          <- mutex_lock
  0        <- _flockget
  0        -> mutex_unlock
  0        <- mutex_unlock
  0      <- _endopen
  0    <- fopen
  0   | defopen:return
  0  <- defopen

#

Now it's much easier to see what exactly is going on. One thing I may be interested in doing is tracing a timestamp at each of these points -- to get an idea of how much time we're spending in each of these functions. To do this, copy dhcp.d to dhcptime.d, and modify it to look like this:

#!/usr/sbin/dtrace -s

#pragma D option flowindent

pid43::defopen:entry
{
        self->start = vtimestamp;
}

pid43:::entry,
pid43:::return
/self->start/
{
        trace(vtimestamp - self->start);
}

pid43::defopen:return
/self->start/
{
        self->start = 0;
        exit(0);
}

Running dhcptime.d:

# ./dhcptime.d
dtrace: script './dhcptime.d' matched 10909 probes
CPU FUNCTION
  0  -> defopen                                               0
  0    -> _get_thr_data                                    2276
  0      -> thr_getspecific                                6214
  0      <- thr_getspecific                                8445
  0    <- _get_thr_data                                   10512
  0    -> fopen                                           14296
  0      -> _findiop                                      34498
  0        -> pthread_rwlock_wrlock                       37822
  0          -> rw_wrlock_impl                            39265
  0            -> rw_read_held                            42514
  0            <- rw_read_held                            61148
  0            -> rw_write_held                           64573
  0            <- rw_write_held                           66630
  0            -> rwlock_lock                             69802
  0              -> sigon                                 89495
  0              <- sigon                                 92150
  0            <- rwlock_lock                             94463
  0          <- rw_wrlock_impl                            96615
  0        <- pthread_rwlock_wrlock                      114732
  0        -> getiop                                     117720
  0          -> mutex_trylock                            121825
  0          <- mutex_trylock                            141821
  0          -> mutex_unlock                             144477
  0          <- mutex_unlock                             146819
  0        <- getiop                                     149181
  0        -> getiop                                     150408
  0          -> mutex_trylock                            168143
  0          <- mutex_trylock                            169478
  0          -> mutex_unlock                             170679
  0          <- mutex_unlock                             171952
  0        <- getiop                                     173196
  0        -> getiop                                     174371
  0          -> mutex_trylock                            175493
  0          <- mutex_trylock                            176733
  0          -> mutex_unlock                             177897
  0          <- mutex_unlock                             179068
  0        <- getiop                                     180403
  0        -> getiop                                     181585
  0          -> mutex_trylock                            182721
  0          <- mutex_trylock                            184085
  0          -> memset                                   185393
  0          <- memset                                   186869
  0          -> mutex_unlock                             188002
  0          <- mutex_unlock                             189275
  0        <- getiop                                     190758
  0        -> pthread_rwlock_unlock                      211137
  0          -> rw_write_held                            212521
  0            -> mutex_held                             214849
  0            <- mutex_held                             217061
  0          <- rw_write_held                            218361
  0          -> sigon                                    237062
  0          <- sigon                                    238458
  0        <- pthread_rwlock_unlock                      241227
  0      <- _findiop                                     243451
  0      -> _endopen                                     263709
  0        -> _open                                      267861
  0          -> __open                                   271268
  0          <- __open                                   334013
  0        <- _open                                      336267
  0        -> _flockget                                  338543
  0          -> mutex_lock                               340668
  0            -> mutex_lock_impl                        342788
  0            <- mutex_lock_impl                        345111
  0          <- mutex_lock                               363767
  0        <- _flockget                                  365602
  0        -> mutex_unlock                               366861
  0        <- mutex_unlock                               368156
  0      <- _endopen                                     370564
  0    <- fopen                                          389523
  0   | defopen:return                                   391539
  0  <- defopen

#

The times in the above are all in nanoseconds. That is, the call to _get_thr_data was 2,276 nanoseconds after we called defopen, the first call to thr_getspecific was 6,214 nanoseconds, and so on. Note the big jump in __open above: we spent over sixty microseconds (334,013 - 271,268 > 60,000) in that one function. What happened there? We went into the kernel of course. (Remember, we found defopen by aggregating on the user stack from the open system call.) We may wish to know everything -- kernel and user -- that happens from defopen. To do this, go back to the original dhcp.d and change this:

pid43:::entry,
pid43:::return
/self->follow/
{}

To this:

pid43:::entry,
pid43:::return,
fbt:::
/self->follow/
{
        printf("  (%s)", probeprov);
}

The "fbt:::" line enables every probe from the fbt provider. Recall that this provider makes available a probe upon entry to and return from every function in the kernel. I've added an additional line that records the name of the current provider (the "probeprov" variable) to make clear if a probe is in the kernel or is in user-space. This script is really going to instrument the snot out of the system; running the new dhcp.d:

# ./dhcp.d
dtrace: script './dhcp.d' matched 42092 probes
CPU FUNCTION
  0  -> defopen                                 (pid43)
  0    -> _get_thr_data                         (pid43)
  0      -> thr_getspecific                     (pid43)
  0      <- thr_getspecific                     (pid43)
  0    <- _get_thr_data                         (pid43)
  0    -> fopen                                 (pid43)
  0      -> _findiop                            (pid43)
  0        -> pthread_rwlock_wrlock             (pid43)
  0          -> rw_wrlock_impl                  (pid43)
  0            -> rw_read_held                  (pid43)
  0            <- rw_read_held                  (pid43)
  0            -> rw_write_held                 (pid43)
  0            <- rw_write_held                 (pid43)
  0            -> rwlock_lock                   (pid43)
  0              -> sigon                       (pid43)
  0              <- sigon                       (pid43)
  0            <- rwlock_lock                   (pid43)
  0          <- rw_wrlock_impl                  (pid43)
  0        <- pthread_rwlock_wrlock             (pid43)
  0        -> getiop                            (pid43)
  0          -> mutex_trylock                   (pid43)
  0          <- mutex_trylock                   (pid43)
  0          -> mutex_unlock                    (pid43)
  0          <- mutex_unlock                    (pid43)
  0        <- getiop                            (pid43)
  0        -> getiop                            (pid43)
  0          -> mutex_trylock                   (pid43)
  0          <- mutex_trylock                   (pid43)
  0          -> mutex_unlock                    (pid43)
  0          <- mutex_unlock                    (pid43)
  0        <- getiop                            (pid43)
  0        -> getiop                            (pid43)
  0          -> mutex_trylock                   (pid43)
  0          <- mutex_trylock                   (pid43)
  0          -> mutex_unlock                    (pid43)
  0          <- mutex_unlock                    (pid43)
  0        <- getiop                            (pid43)
  0        -> getiop                            (pid43)
  0          -> mutex_trylock                   (pid43)
  0          <- mutex_trylock                   (pid43)
  0          -> memset                          (pid43)
  0          <- memset                          (pid43)
  0          -> mutex_unlock                    (pid43)
  0          <- mutex_unlock                    (pid43)
  0        <- getiop                            (pid43)
  0        -> pthread_rwlock_unlock             (pid43)
  0          -> rw_write_held                   (pid43)
  0            -> mutex_held                    (pid43)
  0            <- mutex_held                    (pid43)
  0          <- rw_write_held                   (pid43)
  0          -> sigon                           (pid43)
  0          <- sigon                           (pid43)
  0        <- pthread_rwlock_unlock             (pid43)
  0      <- _findiop                            (pid43)
  0      -> _endopen                            (pid43)
  0        -> _open                             (pid43)
  0          -> __open                          (pid43)
  0            -> syscall_mstate                (fbt)
  0              -> tsc_gethrtimeunscaled       (fbt)
  0              <- tsc_gethrtimeunscaled       (fbt)
  0            <- syscall_mstate                (fbt)
  0            -> open                          (fbt)
  0              -> copen                       (fbt)
  0                -> falloc                    (fbt)
  0                  -> ufalloc                 (fbt)
  0                    -> ufalloc_file          (fbt)
  0                      -> fd_find             (fbt)
  0                      <- fd_find             (fbt)
  0                      -> fd_reserve          (fbt)
  0                      <- fd_reserve          (fbt)
  0                    <- ufalloc_file          (fbt)
  0                  <- ufalloc                 (fbt)
  0                  -> kmem_cache_alloc        (fbt)
  0                  <- kmem_cache_alloc        (fbt)
  0                  -> crhold                  (fbt)
  0                  <- crhold                  (fbt)
  0                <- falloc                    (fbt)
  0                -> vn_openat                 (fbt)
  0                  -> lookupnameat            (fbt)
  0                    -> pn_get_buf            (fbt)
  0                    <- pn_get_buf            (fbt)
  0                    -> lookuppnat            (fbt)
  0                      -> lookuppnvp          (fbt)
  0                        -> pn_fixslash       (fbt)
  0                        <- pn_fixslash       (fbt)
  0                        -> pn_getcomponent   (fbt)
  0                        <- pn_getcomponent   (fbt)
  0                        -> fop_lookup        (fbt)
  0                          -> ufs_lookup      (fbt)
  0                            -> dnlc_lookup   (fbt)
  0                            <- dnlc_lookup   (fbt)
  0                            -> ufs_iaccess   (fbt)
  0                              -> crgetuid    (fbt)
  0                              <- crgetuid    (fbt)
  0                            <- ufs_iaccess   (fbt)
  0                          <- ufs_lookup      (fbt)
  0                        <- fop_lookup        (fbt)
  0                        -> vn_mountedvfs     (fbt)
  0                        <- vn_mountedvfs     (fbt)
  0                        -> vn_rele           (fbt)
  0                        <- vn_rele           (fbt)
  0                        -> pn_getcomponent   (fbt)
  0                        <- pn_getcomponent   (fbt)
  0                        -> fop_lookup        (fbt)
  0                          -> ufs_lookup      (fbt)
  0                            -> dnlc_lookup   (fbt)
  0                            <- dnlc_lookup   (fbt)
  0                            -> ufs_iaccess   (fbt)
  0                              -> crgetuid    (fbt)
  0                              <- crgetuid    (fbt)
  0                            <- ufs_iaccess   (fbt)
  0                          <- ufs_lookup      (fbt)
  0                        <- fop_lookup        (fbt)
  0                        -> vn_mountedvfs     (fbt)
  0                        <- vn_mountedvfs     (fbt)
  0                        -> vn_rele           (fbt)
  0                        <- vn_rele           (fbt)
  0                        -> pn_getcomponent   (fbt)
  0                        <- pn_getcomponent   (fbt)
  0                        -> fop_lookup        (fbt)
  0                          -> ufs_lookup      (fbt)
  0                            -> dnlc_lookup   (fbt)
  0                            <- dnlc_lookup   (fbt)
  0                            -> ufs_iaccess   (fbt)
  0                              -> crgetuid    (fbt)
  0                              <- crgetuid    (fbt)
  0                            <- ufs_iaccess   (fbt)
  0                          <- ufs_lookup      (fbt)
  0                        <- fop_lookup        (fbt)
  0                        -> vn_mountedvfs     (fbt)
  0                        <- vn_mountedvfs     (fbt)
  0                        -> vn_rele           (fbt)
  0                        <- vn_rele           (fbt)
  0                        -> pn_setlast        (fbt)
  0                        <- pn_setlast        (fbt)
  0                        -> vn_path           (fbt)
  0                        <- vn_path           (fbt)
  0                      <- lookuppnvp          (fbt)
  0                    <- lookuppnat            (fbt)
  0                  <- lookupnameat            (fbt)
  0                  -> fop_getattr             (fbt)
  0                    -> ufs_getattr           (fbt)
  0                    <- ufs_getattr           (fbt)
  0                  <- fop_getattr             (fbt)
  0                  -> fop_access              (fbt)
  0                    -> ufs_access            (fbt)
  0                      -> ufs_iaccess         (fbt)
  0                        -> crgetuid          (fbt)
  0                        <- crgetuid          (fbt)
  0                      <- ufs_iaccess         (fbt)
  0                    <- ufs_access            (fbt)
  0                  <- fop_access              (fbt)
  0                  -> fop_open                (fbt)
  0                    -> ufs_open              (fbt)
  0                    <- ufs_open              (fbt)
  0                    -> vn_rele               (fbt)
  0                    <- vn_rele               (fbt)
  0                  <- fop_open                (fbt)
  0                <- vn_openat                 (fbt)
  0                -> setf                      (fbt)
  0                  -> cv_broadcast            (fbt)
  0                  <- cv_broadcast            (fbt)
  0                <- setf                      (fbt)
  0              <- copen                       (fbt)
  0            <- open                          (fbt)
  0            -> syscall_mstate                (fbt)
  0              -> tsc_gethrtimeunscaled       (fbt)
  0              <- tsc_gethrtimeunscaled       (fbt)
  0            <- syscall_mstate                (fbt)
  0          <- __open                          (pid43)
  0        <- _open                             (pid43)
  0        -> _flockget                         (pid43)
  0          -> mutex_lock                      (pid43)
  0            -> mutex_lock_impl               (pid43)
  0            <- mutex_lock_impl               (pid43)
  0          <- mutex_lock                      (pid43)
  0        <- _flockget                         (pid43)
  0        -> mutex_unlock                      (pid43)
  0        <- mutex_unlock                      (pid43)
  0      <- _endopen                            (pid43)
  0    <- fopen                                 (pid43)
  0   | defopen:return                          (pid43)
  0  <- defopen

#

As the transition from pid43 to fbt in the above indicates, this script shows us flow control across the user/kernel boundary. (At the risk of sounding too much like Ron Popeil, the above output is the clearest display of code flow across a protection boundary from any tool -- research or otherwise.)

By this point in the demo, the audience is usually convinced that DTrace can instrument huge tracts of the system that were never before observable; the focus usually shifts from "can this do anything?" to "can this do anything for me?" The short answer is "yes" (of course), but a longer answer is merited. Unfortunately, this blog entry has already gone on long enough however; I'll save the longer answer for another day...


1 For whatever it's worth, a bug has been filed on this issue. Once it's fixed, I will sadly have to find some other sub-optimal software with which to demonstrate DTrace. The good news is that such software isn't terribly hard to find.

Recent Posts

November 26, 2023
November 18, 2023
November 27, 2022
October 11, 2020
July 31, 2019
December 16, 2018
September 18, 2018
December 21, 2016
September 30, 2016
September 26, 2016
September 13, 2016
July 29, 2016
December 17, 2015
September 16, 2015
January 6, 2015
November 10, 2013
September 3, 2013
June 7, 2012
September 15, 2011
August 15, 2011
March 9, 2011
September 24, 2010
August 11, 2010
July 30, 2010
July 25, 2010
March 10, 2010
November 26, 2009
February 19, 2009
February 2, 2009
November 10, 2008
November 3, 2008
September 3, 2008
July 18, 2008
June 30, 2008
May 31, 2008
March 16, 2008
December 18, 2007
December 5, 2007
November 11, 2007
November 8, 2007
September 6, 2007
August 21, 2007
August 2, 2007
July 11, 2007
May 20, 2007
March 19, 2007
October 12, 2006
August 17, 2006
August 7, 2006
May 1, 2006
December 13, 2005
November 16, 2005
September 13, 2005
September 9, 2005
August 21, 2005
August 16, 2005

Archives