With my explanation of a demo gone wrong, several people have asked me the more general question: how does one demo DTrace? This question doesn’t have a single answer, especially given that DTrace is best demonstrated with ad hoc queries of the system. Indeed, the best demos are when someone in the audience shouts out their own question that they want to see answered: answering such a question instantly with DTrace nearly always blows away the questioner — who has presumably suffered in the past trying to answer similar questions. Of course, saying “why, there are many ways to demo to DTrace!” is a useless answer to the question of how one demos DTrace; while there are many ways that one can demo DTrace, it’s useful to get a flavor of how a typical demo might go. So with the substantial caveat that this is not the way to demo DTrace, but merely a way, let me walk you through an example demo.
It all starts by running dtrace(1M) without any arguments:
# dtrace Usage: dtrace [-32|-64] [-aACeEFGHlqSvVwZ] [-b bufsz] [-c cmd] [-D name[=def]] [-I path] [-o output] [-p pid] [-s script] [-U name] [-x opt[=val]] [-X a|c|s|t] [-P provider [[ predicate ] action ]] [-m [ provider: ] module [[ predicate ] action ]] [-f [[ provider: ] module: ] func [[ predicate ] action ]] [-n [[[ provider: ] module: ] func: ] name [[ predicate ] action ]] [-i probe-id [[ predicate ] action ]] [ args ... ] predicate -> '/' D-expression '/' action -> '{' D-statements '}' -32 generate 32-bit D programs and ELF files -64 generate 64-bit D programs and ELF files -a claim anonymous tracing state -A generate driver.conf(4) directives for anonymous tracing -b set trace buffer size -c run specified command and exit upon its completion -C run cpp(1) preprocessor on script files -D define symbol when invoking preprocessor -e exit after compiling request but prior to enabling probes -E exit after enabling probes but prior to tracing data -f enable or list probes matching the specified function name -F coalesce trace output by function -G generate an ELF file containing embedded dtrace program -H print included files when invoking preprocessor -i enable or list probes matching the specified probe id -I add include directory to preprocessor search path -l list probes matching specified criteria -m enable or list probes matching the specified module name -n enable or list probes matching the specified probe name -o set output file -p grab specified process-ID and cache its symbol tables -P enable or list probes matching the specified provider name -q set quiet mode (only output explicitly traced data) -s enable or list probes according to the specified D script -S print D compiler intermediate code -U undefine symbol when invoking preprocessor -v set verbose mode (report program stability attributes) -V report DTrace API version -w permit destructive actions -x enable or modify compiler and tracing options -X specify ISO C conformance settings for preprocessor -Z permit probe descriptions that match zero probes
I usually just point out this just gives your typical Unix-ish help
message — implicitly making the point that dipping into DTrace doesn’t
require much more than typing its name on the command line.
From the output of the usage message,
I highlight the “-l” option, noting that it lists all the probes in the
system. I then run with that option — being sure to pipe the output to
more(1):
# dtrace -l | more ID PROVIDER MODULE FUNCTION NAME 1 dtrace BEGIN 2 dtrace END 3 dtrace ERROR 4 lockstat genunix mutex_enter adaptive-acquire 5 lockstat genunix mutex_enter adaptive-block 6 lockstat genunix mutex_enter adaptive-spin 7 lockstat genunix mutex_exit adaptive-release 8 lockstat genunix mutex_destroy adaptive-release 9 lockstat genunix mutex_tryenter adaptive-acquire 10 lockstat genunix lock_set spin-acquire 11 lockstat genunix lock_set spin-spin 12 lockstat genunix lock_set_spl spin-acquire 13 lockstat genunix lock_set_spl spin-spin 14 lockstat genunix lock_try spin-acquire 15 lockstat genunix lock_clear spin-release 16 lockstat genunix lock_clear_splx spin-release 17 lockstat genunix CLOCK_UNLOCK spin-release 18 lockstat genunix rw_enter rw-acquire 19 lockstat genunix rw_enter rw-block 20 lockstat genunix rw_exit rw-release 21 lockstat genunix rw_tryenter rw-acquire 22 lockstat genunix rw_tryupgrade rw-upgrade 23 lockstat genunix rw_downgrade rw-downgrade 24 lockstat genunix thread_lock thread-spin 25 lockstat genunix thread_lock_high thread-spin 26 fasttrap fasttrap fasttrap 27 syscall nosys entry 28 syscall nosys return 29 syscall rexit entry 30 syscall rexit return --More--
I point out that each of these rows is a point of instrumentation. I then explain each of the columns, explaining that the PROVIDER column denotes the DTrace notion of instrumentation providers which allows for disjoint instrumentation techniques. If the audience is a particularly savvy one, I may point out that the lockstat provider is a recasting of the instrumentation technique used by lockstat(1M), and that lockstat has been made to be a DTrace consumer. I mention that the MODULE column contains the name of the kernel module (in the case of a kernel probe) or shared object name (in the case of a user-level probe), that the FUNCTION column contains the function that the probe is in, and that the NAME column contains the name of the probe. I explain that a each probe is uniquely identified by the probe tuple: provider, module, function and name.
Scrolling down through the output, I point out the probes from syscall provider, describing its ability to instrument every system call entry and return. (For some audiences, I may pause here to remind them of the definition of a system call.)
I then continue to scroll down my probe listing until I get to output that looks something like this:
... 1268 fbt unix dvma_release entry 1269 fbt unix dvma_release return 1270 fbt unix softlevel1 entry 1271 fbt unix softlevel1 return 1272 fbt unix freq_notsc entry 1273 fbt unix freq_notsc return 1274 fbt unix psm_unmap entry 1275 fbt unix psm_unmap return 1276 fbt unix cpu_visibility_online entry 1277 fbt unix cpu_visibility_online return 1278 fbt unix setbackdq entry 1279 fbt unix setbackdq return 1280 fbt unix lgrp_choose entry 1281 fbt unix lgrp_choose return 1282 fbt unix cpu_intr_swtch_enter entry 1283 fbt unix cpu_intr_swtch_enter return 1284 fbt unix page_upgrade entry 1285 fbt unix page_upgrade return 1286 fbt unix page_lock_es entry 1287 fbt unix page_lock_es return 1288 fbt unix lgrp_shm_policy_set entry 1289 fbt unix lgrp_shm_policy_set return 1290 fbt unix segkmem_gc entry 1291 fbt unix segkmem_gc return 1292 fbt unix disp_anywork entry 1293 fbt unix disp_anywork return 1294 fbt unix prgetprxregsize entry 1295 fbt unix prgetprxregsize return 1296 fbt unix cpuid_pass1 entry 1297 fbt unix cpuid_pass1 return 1298 fbt unix cpuid_pass2 entry 1299 fbt unix cpuid_pass2 return 1300 fbt unix cpuid_pass3 entry 1301 fbt unix cpuid_pass3 return 1302 fbt unix cpuid_pass4 entry 1303 fbt unix cpuid_pass4 return 1304 fbt unix release_bootstrap entry 1305 fbt unix release_bootstrap return 1306 fbt unix i_ddi_rnumber_to_regspec entry 1307 fbt unix i_ddi_rnumber_to_regspec return 1308 fbt unix page_mem_avail entry 1309 fbt unix page_mem_avail return 1310 fbt unix page_pp_lock entry 1311 fbt unix page_pp_lock return ...
I pause here to explain that the fbt provider can instrument every function entry and return in the kernel. Making the observation that there are quite a few functions in the kernel, I then quit out of the output, and re-run the command — this time piping the output through wc. Usually, the output is something like this:
# dtrace -l | wc -l 32521
Anyone who has dealt with traditional instrumentation frameworks will typically express some shock at this — that there are (for example) 32,520 points of instrumentation on an optimized, stock system. Occasionally, someone who has perhaps heard of DTrace may pipe up: “Is that where that 30,000 number comes from that I’ve seen in in the press?” The quick answer is that this is indeed where that number comes from — but the longer answer is that this really only the beginning because with the pid provider, DTrace can instrument every instruction in every running app. For perhaps obvious reasons, we create these pid probes lazily — we don’t want to clog up the system with millions of unused probes.
After having listed all probes and counted them, my next invocation of dtrace is to list the probes yet again — but this time listing only probes that match the probe tuple “syscall:::entry". I explain that this means I want to match probes from the syscall provider named entry — and that I don’t care about the module and function. This output is simply a shorter version of the previous listing:
# dtrace -l -n syscall:::entry ID PROVIDER MODULE FUNCTION NAME 27 syscall nosys entry 29 syscall rexit entry 31 syscall forkall entry 33 syscall read entry 35 syscall write entry 37 syscall open entry 39 syscall close entry 41 syscall wait entry ...
I then explain that to enable these probes (instead of just listing them), I need to merely not include the “-l“:
# dtrace -n syscall:::entry dtrace: description 'syscall:::entry' matched 226 probes CPU ID FUNCTION:NAME 0 125 ioctl:entry 0 125 ioctl:entry 0 261 sysconfig:entry 0 261 sysconfig:entry 0 195 sigaction:entry 0 195 sigaction:entry 0 125 ioctl:entry 0 261 sysconfig:entry 0 61 brk:entry 0 61 brk:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 403 fstat64:entry 0 61 brk:entry 0 61 brk:entry 0 403 fstat64:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 35 write:entry 0 125 ioctl:entry 0 403 fstat64:entry 0 35 write:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 339 pollsys:entry 0 33 read:entry ...
I let this run on the screen for a while, explaining that we’ve gone from an optimized, uninstrumented system call table to an instrumented one — and that we’re now seeing every system call as it happens on the box. I explain that (on the one hand) this is novel information: in Solaris, we haven’t had a way to get system call information for the entire system. (I often remind the audience that this may look similar to truss , but it’s really quite different: truss operates only on a single process, and has an enormous probe effect from constantly stopping and running the target process.)
But while this information may be novel (at least for Solaris), it’s actually not that valuable — merely knowing that we executed these systems calls is unlikely to be terrible helpful. Rather, we may like to know which applications are inducing these system calls. To do this, we add a clause in the DTrace control language, D. The clause we need to add is quite simple; we’re going to use the trace action to record the current application name, which is stored in execname variable. So I ^C the previous enabling, and run the following:
# dtrace -n syscall:::entry'{trace(execname)}' dtrace: description 'syscall:::entry' matched 226 probes ... 0 125 ioctl:entry xterm 0 125 ioctl:entry xterm 0 339 pollsys:entry xterm 0 339 pollsys:entry xterm 0 313 lwp_sigmask:entry Xsun 0 313 lwp_sigmask:entry Xsun 0 313 lwp_sigmask:entry Xsun 0 313 lwp_sigmask:entry Xsun 0 33 read:entry Xsun 0 233 writev:entry Xsun 0 125 ioctl:entry xterm 0 33 read:entry xterm 0 339 pollsys:entry xterm 0 339 pollsys:entry xterm 0 125 ioctl:entry xterm 0 125 ioctl:entry xterm 0 339 pollsys:entry xterm 0 125 ioctl:entry xterm 0 125 ioctl:entry xterm 0 339 pollsys:entry xterm 0 339 pollsys:entry xterm 0 313 lwp_sigmask:entry Xsun 0 313 lwp_sigmask:entry Xsun 0 313 lwp_sigmask:entry Xsun 0 313 lwp_sigmask:entry Xsun 0 339 pollsys:entry Xsun 0 35 write:entry dtrace 0 125 ioctl:entry xterm 0 125 ioctl:entry xterm 0 339 pollsys:entry xterm 0 33 read:entry xterm 0 125 ioctl:entry xterm 0 35 write:entry xterm 0 125 ioctl:entry xterm 0 339 pollsys:entry xterm 0 125 ioctl:entry xterm 0 33 read:entry xterm 0 35 write:entry xterm 0 125 ioctl:entry xterm 0 125 ioctl:entry xterm 0 339 pollsys:entry xterm 0 339 pollsys:entry xterm 0 313 lwp_sigmask:entry Xsun ...
This is clearly more useful: we can see which apps are performing system calls. But seeing this also leads to a very important observation, one that some heckler in a savvy audience may have already made by now: what we’re seeing in the above output is the system calls to write all of this gunk out to the terminal. That is, if we were to take this output and cut it, sort it, uniq it, and sort it again, we would come to the conclusion that dtrace, xterm and Xsun were the problem — regardless of what was being executed on the machine.
This is a serious problem, and the DTrace solution represents a substantial difference from virtually all earler work in this domain: in DTrace, the aggregation of data is a first-class citizen. Instead of aggregating data postmortem with traditional Unix tools, DTrace allows data to be aggregated in situ. This aggregation can reduce the data flow by up to a factor of the number of data points — a tremendous savings of both space and time.
So to recast the example in terms of aggregations, we want to aggregate on the application name, with our aggregating action being to count:
# dtrace -n syscall:::entry'{@[execname] = count()}' dtrace: description 'syscall:::entry' matched 226 probes
When we run the above, we don’t see any output — just that we matched some number of probes. I explain that behind the scenes, we’re just keeping a table of application name, and the count of system calls. Whenever we ^C the above, we get the answer:
^C utmpd 4 automountd 8 inetd 9 svc.configd 9 syslogd 16 FvwmAuto 39 in.routed 48 svc.startd 97 MozillaFirebird- 125 sendmail 183 fvwm2 621 dhcpagent 1192 dtrace 2195 xclock 2894 vim 7760 xterm 11768 Xsun 24916 #
I often point out that in DTrace — as in good cocktail conversation — the answer to one question typically provokes the next question. Looking at the above output, we may have all sorts of questions. For example, we may well ask: what the hell is dhcpagent doing? To answer this question, we can add a predicate to our clause:
# dtrace -n syscall:::entry'/execname == "dhcpagent"/{@[probefunc] = count()}'
The predicate is contained between the forward slashes; the expression in the curly braces is only evaluated if the predicate evaluates to a non-zero value. And note that we have changed the tuple on which we are aggregating: instead of aggregating on execname (which, thanks to the predicate, will always be “dhcpagent“), we’re aggregating on the probefunc. For the syscall provider, the probefunc is the name of the system call.
Running the above for ten seconds or so, and then ^C‘ing yields the following (at least on my laptop):
^C pollsys 3 close 3 open 3 lwp_sigmask 6 fstat64 6 read 12 ioctl 12 llseek 15
Now, we may be interested in drilling down more into one of these system calls. Let’s say we want to know what dhcpagent is doing when it performs an open. We can effect this by narrowing our enabling a bit: instead of enabling all syscall probes named entry, we want to enable only those probes that are also from the function named open. We want to keep our predicate that the execname is dhcpagent, but this time we don’t want to aggregate on the probefunc (which, thanks to the narrowed enabling, will always be “open“); we want to aggregate on the name of the file that is being opened — which is the first argument to the open system call. To do this, we’ll aggregate on arg0. For slightly arcane reasons, we need to use the copyinstr action to effect this. The enabling:
# dtrace -n syscall::open:entry'/execname == "dhcpagent"/{@[copyinstr(arg0)] = count()}' dtrace: description 'syscall::open:entry' matched 1 probe
Again, running the above for about ten seconds:
^C /etc/default/dhcpagent 4
This tells us that there were four calls to the open system call from dhcpagent — and that all four were to the same file. Just on the face of it, this is suspicious: why is dhcpagent opening the file “/etc/default/dhcpagent” with such frequency? Hasn’t whoever wrote this code ever heard of stat?1 To answer the former question, we’re going to change our enabling to aggregate on dhcpagent‘s stack backtrace when it performs an open. This is done with the ustack action. The new enabling:
# dtrace -n syscall::open:entry'/execname == "dhcpagent"/{@[ustack()] = count()}' dtrace: description 'syscall::open:entry' matched 1 probe
Running the above for ten seconds or so and then ^C‘ing yields the following:
^C libc.so.1`__open+0xc libc.so.1`_endopen+0x92 libc.so.1`fopen+0x26 libcmd.so.1`defopen+0x40 dhcpagent`df_get_string+0x27 dhcpagent`df_get_int+0x1c dhcpagent`dhcp_requesting+0x248 libinetutil.so.1`iu_expire_timers+0x8a libinetutil.so.1`iu_handle_events+0x95 dhcpagent`main+0x300 805314a 4
This tells us that all four of the open calls were from the above stack trace.
From here we have many options, and our course of action would depend on how we wanted to investigate this. Up until this point, we have been instrumenting only the system call layer — the shim layer between applications and the kernel. We may wish to now instrument the application itself. To do this, we first need to know how many dhcpagent processes are contributing to the output we’re seeing above. There are lots of ways to do this; a simple way is to aggregate on pid:
# dtrace -n syscall:::entry'/execname == "dhcpagent"/{@[pid] = count()}' dtrace: description 'syscall:::entry' matched 226 probes ^C 43 80
This indicates that in the time between when we started the above, and the time that we ^C‘d, we saw 80 system calls from pid 43 — and that is was the only dhcpagent process that made any system calls. Now that we have the pid that we’re intrested in, we can instrument the application. For starters, let’s just instrument the defopen function in libcmd.so.1. (We know that it’s being called because we saw it in the stack backtrace of the open system call.) To do this:
# dtrace -n pid43::defopen:entry dtrace: description 'pid43::defopen:entry' matched 1 probe CPU ID FUNCTION:NAME 0 32622 defopen:entry 0 32622 defopen:entry 0 32622 defopen:entry 0 32622 defopen:entry ^C
With this we have instrumented the running app. When we ^C‘d, the app went back to being its usual optimized self. (Or at least, as optimized as dhcpagent ever is.)
If we wanted to trace both entry to and return from the defopen function, we could add another enabling to the mix:
# dtrace -n pid43::defopen:entry,pid43::defopen:return dtrace: description 'pid43::defopen:entry,pid43::defopen:return' matched 2 probes CPU ID FUNCTION:NAME 0 32622 defopen:entry 0 32623 defopen:return 0 32622 defopen:entry 0 32623 defopen:return 0 32622 defopen:entry 0 32623 defopen:return 0 32622 defopen:entry 0 32623 defopen:return ^C
But we may want to know more than this: we may want to know everything called from defopen. Answering this is a bit more sophisticated — and it’s a little too much for the command line. To answer this question, we head to an editor and type in the following DTrace script:
#!/usr/sbin/dtrace -s pid43::defopen:entry { self->follow = 1; } pid43:::entry, pid43:::return /self->follow/ {} pid43::defopen:return /self->follow/ { self->follow = 0; exit(0); }
So what is this script doing? We’re enabling defopen again, but this time we’re setting a thread-local variable — which we named “follow” to 1. We then enabled every function entry and return in the process, with the predicate that we’re only interested if our thread-local variable is non-zero. This has the effect of only tracing function entry and return if the function call was induced by defopen. We’re only interested in capturing one of these traces, which is the reason why we explicitly exit in the pid43::defopen:return clause.
Running the above is going to instrument the hell out of dhcpagent. We’re going to see that we matched many, many probes. Then we’ll see a bunch of output as defopen is called. Finally, we’ll be dropped back onto the shell prompt as we hit the exit action. By the time we see the shell again, we have restored the dhcpagent process to its uninstrumented self. Running the above, assuming it’s in dhcp.d:
# chmod +x dhcp.d # ./dhcp.d dtrace: script './dhcp.d' matched 10909 probes CPU ID FUNCTION:NAME 0 32622 defopen:entry 0 33328 _get_thr_data:entry 0 37823 thr_getspecific:entry 0 43272 thr_getspecific:return 0 38784 _get_thr_data:return 0 37176 fopen:entry 0 37161 _findiop:entry 0 37525 pthread_rwlock_wrlock:entry 0 37524 rw_wrlock_impl:entry 0 37513 rw_read_held:entry 0 42963 rw_read_held:return 0 37514 rw_write_held:entry 0 42964 rw_write_held:return 0 37518 rwlock_lock:entry 0 37789 sigon:entry 0 43238 sigon:return 0 42968 rwlock_lock:return 0 42974 rw_wrlock_impl:return 0 42975 pthread_rwlock_wrlock:return 0 37174 getiop:entry 0 37674 mutex_trylock:entry 0 43123 mutex_trylock:return 0 37676 mutex_unlock:entry 0 43125 mutex_unlock:return 0 42625 getiop:return 0 37174 getiop:entry 0 37674 mutex_trylock:entry 0 43123 mutex_trylock:return 0 37676 mutex_unlock:entry 0 43125 mutex_unlock:return 0 42625 getiop:return 0 37174 getiop:entry 0 37674 mutex_trylock:entry 0 43123 mutex_trylock:return 0 37676 mutex_unlock:entry 0 43125 mutex_unlock:return 0 42625 getiop:return 0 37174 getiop:entry 0 37674 mutex_trylock:entry 0 43123 mutex_trylock:return 0 35744 memset:entry 0 41198 memset:return 0 37676 mutex_unlock:entry 0 43125 mutex_unlock:return 0 42625 getiop:return 0 37531 pthread_rwlock_unlock:entry 0 37514 rw_write_held:entry 0 37680 mutex_held:entry 0 43129 mutex_held:return 0 42964 rw_write_held:return 0 37789 sigon:entry 0 43238 sigon:return 0 42981 pthread_rwlock_unlock:return 0 42612 _findiop:return 0 37123 _endopen:entry 0 37309 _open:entry 0 37951 __open:entry 0 43397 __open:return 0 42760 _open:return 0 37152 _flockget:entry 0 37670 mutex_lock:entry 0 37669 mutex_lock_impl:entry 0 43118 mutex_lock_impl:return 0 43119 mutex_lock:return 0 42603 _flockget:return 0 37676 mutex_unlock:entry 0 43125 mutex_unlock:return 0 42574 _endopen:return 0 42627 fopen:return 0 32623 defopen:return 0 32623 defopen:return #
This is interesting, but a little hard to sift through. Go back to dhcp.d, and add the following line:
#pragma D option flowindent
This sets a simple
# ./dhcp.d dtrace: script './dhcp.d' matched 10909 probes CPU FUNCTION 0 -> defopen 0 -> _get_thr_data 0 -> thr_getspecific 0 <- thr_getspecific 0 <- _get_thr_data 0 -> fopen 0 -> _findiop 0 -> pthread_rwlock_wrlock 0 -> rw_wrlock_impl 0 -> rw_read_held 0 <- rw_read_held 0 -> rw_write_held 0 <- rw_write_held 0 -> rwlock_lock 0 -> sigon 0 <- sigon 0 <- rwlock_lock 0 <- rw_wrlock_impl 0 <- pthread_rwlock_wrlock 0 -> getiop 0 -> mutex_trylock 0 <- mutex_trylock 0 -> mutex_unlock 0 <- mutex_unlock 0 <- getiop 0 -> getiop 0 -> mutex_trylock 0 <- mutex_trylock 0 -> mutex_unlock 0 <- mutex_unlock 0 <- getiop 0 -> getiop 0 -> mutex_trylock 0 <- mutex_trylock 0 -> mutex_unlock 0 <- mutex_unlock 0 <- getiop 0 -> getiop 0 -> mutex_trylock 0 <- mutex_trylock 0 -> memset 0 <- memset 0 -> mutex_unlock 0 <- mutex_unlock 0 <- getiop 0 -> pthread_rwlock_unlock 0 -> rw_write_held 0 -> mutex_held 0 <- mutex_held 0 <- rw_write_held 0 -> sigon 0 <- sigon 0 <- pthread_rwlock_unlock 0 <- _findiop 0 -> _endopen 0 -> _open 0 -> __open 0 <- __open 0 <- _open 0 -> _flockget 0 -> mutex_lock 0 -> mutex_lock_impl 0 <- mutex_lock_impl 0 <- mutex_lock 0 <- _flockget 0 -> mutex_unlock 0 <- mutex_unlock 0 <- _endopen 0 <- fopen 0 | defopen:return 0 <- defopen #
Now it's much easier to see what exactly is going on. One thing I may be interested in doing is tracing a timestamp at each of these points -- to get an idea of how much time we're spending in each of these functions. To do this, copy dhcp.d to dhcptime.d, and modify it to look like this:
#!/usr/sbin/dtrace -s #pragma D option flowindent pid43::defopen:entry { self->start = vtimestamp; } pid43:::entry, pid43:::return /self->start/ { trace(vtimestamp - self->start); } pid43::defopen:return /self->start/ { self->start = 0; exit(0); }
Running dhcptime.d:
# ./dhcptime.d dtrace: script './dhcptime.d' matched 10909 probes CPU FUNCTION 0 -> defopen 0 0 -> _get_thr_data 2276 0 -> thr_getspecific 6214 0 <- thr_getspecific 8445 0 <- _get_thr_data 10512 0 -> fopen 14296 0 -> _findiop 34498 0 -> pthread_rwlock_wrlock 37822 0 -> rw_wrlock_impl 39265 0 -> rw_read_held 42514 0 <- rw_read_held 61148 0 -> rw_write_held 64573 0 <- rw_write_held 66630 0 -> rwlock_lock 69802 0 -> sigon 89495 0 <- sigon 92150 0 <- rwlock_lock 94463 0 <- rw_wrlock_impl 96615 0 <- pthread_rwlock_wrlock 114732 0 -> getiop 117720 0 -> mutex_trylock 121825 0 <- mutex_trylock 141821 0 -> mutex_unlock 144477 0 <- mutex_unlock 146819 0 <- getiop 149181 0 -> getiop 150408 0 -> mutex_trylock 168143 0 <- mutex_trylock 169478 0 -> mutex_unlock 170679 0 <- mutex_unlock 171952 0 <- getiop 173196 0 -> getiop 174371 0 -> mutex_trylock 175493 0 <- mutex_trylock 176733 0 -> mutex_unlock 177897 0 <- mutex_unlock 179068 0 <- getiop 180403 0 -> getiop 181585 0 -> mutex_trylock 182721 0 <- mutex_trylock 184085 0 -> memset 185393 0 <- memset 186869 0 -> mutex_unlock 188002 0 <- mutex_unlock 189275 0 <- getiop 190758 0 -> pthread_rwlock_unlock 211137 0 -> rw_write_held 212521 0 -> mutex_held 214849 0 <- mutex_held 217061 0 <- rw_write_held 218361 0 -> sigon 237062 0 <- sigon 238458 0 <- pthread_rwlock_unlock 241227 0 <- _findiop 243451 0 -> _endopen 263709 0 -> _open 267861 0 -> __open 271268 0 <- __open 334013 0 <- _open 336267 0 -> _flockget 338543 0 -> mutex_lock 340668 0 -> mutex_lock_impl 342788 0 <- mutex_lock_impl 345111 0 <- mutex_lock 363767 0 <- _flockget 365602 0 -> mutex_unlock 366861 0 <- mutex_unlock 368156 0 <- _endopen 370564 0 <- fopen 389523 0 | defopen:return 391539 0 <- defopen #
The times in the above are all in nanoseconds. That is, the call to _get_thr_data was 2,276 nanoseconds after we called defopen, the first call to thr_getspecific was 6,214 nanoseconds, and so on. Note the big jump in __open above: we spent over sixty microseconds (334,013 - 271,268 > 60,000) in that one function. What happened there? We went into the kernel of course. (Remember, we found defopen by aggregating on the user stack from the open system call.) We may wish to know everything -- kernel and user -- that happens from defopen. To do this, go back to the original dhcp.d and change this:
pid43:::entry, pid43:::return /self->follow/ {}
To this:
pid43:::entry, pid43:::return, fbt::: /self->follow/ { printf(" (%s)", probeprov); }
The "fbt:::" line enables every probe from the fbt provider. Recall that this provider makes available a probe upon entry to and return from every function in the kernel. I've added an additional line that records the name of the current provider (the "probeprov" variable) to make clear if a probe is in the kernel or is in user-space. This script is really going to instrument the snot out of the system; running the new dhcp.d:
# ./dhcp.d dtrace: script './dhcp.d' matched 42092 probes CPU FUNCTION 0 -> defopen (pid43) 0 -> _get_thr_data (pid43) 0 -> thr_getspecific (pid43) 0 <- thr_getspecific (pid43) 0 <- _get_thr_data (pid43) 0 -> fopen (pid43) 0 -> _findiop (pid43) 0 -> pthread_rwlock_wrlock (pid43) 0 -> rw_wrlock_impl (pid43) 0 -> rw_read_held (pid43) 0 <- rw_read_held (pid43) 0 -> rw_write_held (pid43) 0 <- rw_write_held (pid43) 0 -> rwlock_lock (pid43) 0 -> sigon (pid43) 0 <- sigon (pid43) 0 <- rwlock_lock (pid43) 0 <- rw_wrlock_impl (pid43) 0 <- pthread_rwlock_wrlock (pid43) 0 -> getiop (pid43) 0 -> mutex_trylock (pid43) 0 <- mutex_trylock (pid43) 0 -> mutex_unlock (pid43) 0 <- mutex_unlock (pid43) 0 <- getiop (pid43) 0 -> getiop (pid43) 0 -> mutex_trylock (pid43) 0 <- mutex_trylock (pid43) 0 -> mutex_unlock (pid43) 0 <- mutex_unlock (pid43) 0 <- getiop (pid43) 0 -> getiop (pid43) 0 -> mutex_trylock (pid43) 0 <- mutex_trylock (pid43) 0 -> mutex_unlock (pid43) 0 <- mutex_unlock (pid43) 0 <- getiop (pid43) 0 -> getiop (pid43) 0 -> mutex_trylock (pid43) 0 <- mutex_trylock (pid43) 0 -> memset (pid43) 0 <- memset (pid43) 0 -> mutex_unlock (pid43) 0 <- mutex_unlock (pid43) 0 <- getiop (pid43) 0 -> pthread_rwlock_unlock (pid43) 0 -> rw_write_held (pid43) 0 -> mutex_held (pid43) 0 <- mutex_held (pid43) 0 <- rw_write_held (pid43) 0 -> sigon (pid43) 0 <- sigon (pid43) 0 <- pthread_rwlock_unlock (pid43) 0 <- _findiop (pid43) 0 -> _endopen (pid43) 0 -> _open (pid43) 0 -> __open (pid43) 0 -> syscall_mstate (fbt) 0 -> tsc_gethrtimeunscaled (fbt) 0 <- tsc_gethrtimeunscaled (fbt) 0 <- syscall_mstate (fbt) 0 -> open (fbt) 0 -> copen (fbt) 0 -> falloc (fbt) 0 -> ufalloc (fbt) 0 -> ufalloc_file (fbt) 0 -> fd_find (fbt) 0 <- fd_find (fbt) 0 -> fd_reserve (fbt) 0 <- fd_reserve (fbt) 0 <- ufalloc_file (fbt) 0 <- ufalloc (fbt) 0 -> kmem_cache_alloc (fbt) 0 <- kmem_cache_alloc (fbt) 0 -> crhold (fbt) 0 <- crhold (fbt) 0 <- falloc (fbt) 0 -> vn_openat (fbt) 0 -> lookupnameat (fbt) 0 -> pn_get_buf (fbt) 0 <- pn_get_buf (fbt) 0 -> lookuppnat (fbt) 0 -> lookuppnvp (fbt) 0 -> pn_fixslash (fbt) 0 <- pn_fixslash (fbt) 0 -> pn_getcomponent (fbt) 0 <- pn_getcomponent (fbt) 0 -> fop_lookup (fbt) 0 -> ufs_lookup (fbt) 0 -> dnlc_lookup (fbt) 0 <- dnlc_lookup (fbt) 0 -> ufs_iaccess (fbt) 0 -> crgetuid (fbt) 0 <- crgetuid (fbt) 0 <- ufs_iaccess (fbt) 0 <- ufs_lookup (fbt) 0 <- fop_lookup (fbt) 0 -> vn_mountedvfs (fbt) 0 <- vn_mountedvfs (fbt) 0 -> vn_rele (fbt) 0 <- vn_rele (fbt) 0 -> pn_getcomponent (fbt) 0 <- pn_getcomponent (fbt) 0 -> fop_lookup (fbt) 0 -> ufs_lookup (fbt) 0 -> dnlc_lookup (fbt) 0 <- dnlc_lookup (fbt) 0 -> ufs_iaccess (fbt) 0 -> crgetuid (fbt) 0 <- crgetuid (fbt) 0 <- ufs_iaccess (fbt) 0 <- ufs_lookup (fbt) 0 <- fop_lookup (fbt) 0 -> vn_mountedvfs (fbt) 0 <- vn_mountedvfs (fbt) 0 -> vn_rele (fbt) 0 <- vn_rele (fbt) 0 -> pn_getcomponent (fbt) 0 <- pn_getcomponent (fbt) 0 -> fop_lookup (fbt) 0 -> ufs_lookup (fbt) 0 -> dnlc_lookup (fbt) 0 <- dnlc_lookup (fbt) 0 -> ufs_iaccess (fbt) 0 -> crgetuid (fbt) 0 <- crgetuid (fbt) 0 <- ufs_iaccess (fbt) 0 <- ufs_lookup (fbt) 0 <- fop_lookup (fbt) 0 -> vn_mountedvfs (fbt) 0 <- vn_mountedvfs (fbt) 0 -> vn_rele (fbt) 0 <- vn_rele (fbt) 0 -> pn_setlast (fbt) 0 <- pn_setlast (fbt) 0 -> vn_path (fbt) 0 <- vn_path (fbt) 0 <- lookuppnvp (fbt) 0 <- lookuppnat (fbt) 0 <- lookupnameat (fbt) 0 -> fop_getattr (fbt) 0 -> ufs_getattr (fbt) 0 <- ufs_getattr (fbt) 0 <- fop_getattr (fbt) 0 -> fop_access (fbt) 0 -> ufs_access (fbt) 0 -> ufs_iaccess (fbt) 0 -> crgetuid (fbt) 0 <- crgetuid (fbt) 0 <- ufs_iaccess (fbt) 0 <- ufs_access (fbt) 0 <- fop_access (fbt) 0 -> fop_open (fbt) 0 -> ufs_open (fbt) 0 <- ufs_open (fbt) 0 -> vn_rele (fbt) 0 <- vn_rele (fbt) 0 <- fop_open (fbt) 0 <- vn_openat (fbt) 0 -> setf (fbt) 0 -> cv_broadcast (fbt) 0 <- cv_broadcast (fbt) 0 <- setf (fbt) 0 <- copen (fbt) 0 <- open (fbt) 0 -> syscall_mstate (fbt) 0 -> tsc_gethrtimeunscaled (fbt) 0 <- tsc_gethrtimeunscaled (fbt) 0 <- syscall_mstate (fbt) 0 <- __open (pid43) 0 <- _open (pid43) 0 -> _flockget (pid43) 0 -> mutex_lock (pid43) 0 -> mutex_lock_impl (pid43) 0 <- mutex_lock_impl (pid43) 0 <- mutex_lock (pid43) 0 <- _flockget (pid43) 0 -> mutex_unlock (pid43) 0 <- mutex_unlock (pid43) 0 <- _endopen (pid43) 0 <- fopen (pid43) 0 | defopen:return (pid43) 0 <- defopen #
As the transition from pid43 to fbt in the above indicates, this script shows us flow control across the user/kernel boundary. (At the risk of sounding too much like Ron Popeil, the above output is the clearest display of code flow across a protection boundary from any tool -- research or otherwise.)
By this point in the demo, the audience is usually convinced that DTrace can instrument huge tracts of the system that were never before observable; the focus usually shifts from "can this do anything?" to "can this do anything for me?" The short answer is "yes" (of course), but a longer answer is merited. Unfortunately, this blog entry has already gone on long enough however; I'll save the longer answer for another day...
1 For whatever it's worth, a bug has been filed on this issue. Once it's fixed, I will sadly have to find some other sub-optimal software with which to demonstrate DTrace. The good news is that such software isn't terribly hard to find.
2 Responses
“Once it’s fixed, I will sadly have to find some other sub-optimal software with which to demonstrate DTrace”. Brian puts the fear of DTrace into software developers – fix your code or you’ll appear in the next DTrace demo!
More seriously, with a few alternations, I think the above would make for a good “quick introduction” guide or some kind of “understand your system in 5 minutes with DTrace” article.
This is the best explanation I have read regarding DTrace, and it is the first time I have actually considered it worth running the latest Solaris.
I think the link for stat may be incorrect. Right now, it links to p_online(2). You probably want to link to fstat(2)