Tales from a Core File

Search
Close this search box.

Category: Miscellaneous

We at Joyent use DTrace for understanding and debugging userland applications just as often as we do for the kernel. That is part of the reason why we’ve worked on things like flamegraphs, the Node.js ustack helper, and the integration of libusdt in node module’s like restify and bunyan.

I’ve just put back some work that makes observing userland with DTrace a lot simpler and much more powerful. Before we meet the devil in the details, let’s start with an example:

$ cat startd.d
pid$target::start_instance:entry,
pid$target::stop_instance:entry
{
        printf("%s: %s\n", probefunc == "start_instance" ?
            "starting" : "stopping", stringof(args[1]->ri_i.i_fmri));
}
$ dtrace -qs startd.d -p $(pgrep -o startd)
...
stopping: svc:/system/intrd:default
stopping: svc:/system/cron:default
start_instance:entry starting: svc:/system/cron:default
...

If you’re familiar with DTrace you might realize that this doesn’t really look like what you’re expecting! Hey Robert, won’t that script blow up without the copyin? Also where did the args[] come from with the pid provider?!

The answer to this minor mystery is that DTrace can now leverage type information in userland programs. Not only does the compiler know the size and layout of types, it’ll also take care of adding calls to copyin for you so you can dereference members without fear. To explain how we’ve managed all of this, we need to go into a bit of back story.

The Compiler is the Enemy

Since the beginning of programming, we’ve needed to be able to debug the programs that we’ve written. A large chunk of time has been spent on tooling to be able to understand and root cause these bugs whether working on a live system or doing a post-mortem analysis on something like a core dump.

Unfortunately, the compiler is in many ways our enemy. Its goal is to take whatever well commented and understandable code we might have and transform it not only into something that a processor can understand, but often times transforming it through optimization passes into something that no longer resembles what we originally wrote.

This problem isn’t limited to languages like C and C++. In fact, many of the same problems apply when you use any compiler, be it your current compile to JavaScript language of the day (emscripten and coffeescript) or something like lex and yacc.

Fortunately, both the compiler and the linker are just software. Shortly after we first hit this problem, they were modified to produce debugging information that could be encoded into the binaries they produced. Folks even were able to encode this kind of information in the original ‘a.out’ executable format that came around in first edition UNIX.

One of the first popular formats that was used was called stabs. It was used on many operating systems for many years and you can still convince compilers like gcc, clang, and suncc to generate it. Since then, DWARF has become the most popular and commonly used format. The initial origin of DWARF came from Bell Labs, but it was rather unpopular because the debugging data that it created was just too large. Since then DWARF has become more standardized and more compact than the first version of DWARF. However, it is a rather complicated format.

With all of these formats there is a trade-off between expressibility and size. If the debugging information takes too much space, then people stop including it. Most available OS and package distributions do not incorporate debugging information. If you’re lucky, they separate that information into different packages. This means that when you’re debugging a problem in production you very often don’t have the very information you need. Even more frustrating, when this information is in a separate file and you’re trying to do post-mortem analysis, then you need to track that down and make sure that you have the right version of the debug information that corresponds to what you were using in production.

This situation is unsatisfying, but – we have other options! Sun developed CTF in Solaris 9 with the intent of using it with mdb and eventually DTrace. In illumos, we put CTF data in every kernel module, library, and a majority of applications. We don’t store all the information that you might get in, say, DWARF, but we store what we’ve found over the years to be the most useful information.

CTF data includes the following pieces of information:

o The definitions of all types and structures
o The arguments and types of each function
o The types of function return values
o The types of global variables

All of the CTF data for a given library, binary, or kernel module is found inside of what we call a CTF Container. The CTF container is found as its own section in an ELF object. A simple way to see if something in question has CTF is to run elfdump(1). Here’s an example:

$ elfdump /lib/libc.so | grep SUNW_ctf
Section Header[37]:  sh_name: .SUNW_ctf

If a library or program does not have CTF data then the section won’t show up in the list and the grep should turn up empty.

CTF and DTrace

If you’ve ever wondered how it is that DTrace knows types when you use various providers, like args[] in fbt, the answer to that is CTF. When you run DTrace, it loads relevant CTF containers for the kernel. In fact, even the basic types that D provides, such as an int or types that you define in a D script, end up in a CTF container. Consider the following dtrace invocation:

# dtrace -l -v -n 'fbt::ip_input:entry'
   ID   PROVIDER            MODULE                          FUNCTION
NAME
37546        fbt                ip                          ip_input
entry

        Probe Description Attributes
                Identifier Names: Private
                Data Semantics:   Private
                Dependency Class: Unknown

        Argument Attributes
                Identifier Names: Private
                Data Semantics:   Private
                Dependency Class: ISA

        Argument Types
                args[0]: struct ill_s *
                args[1]: ill_rx_ring_t *
                args[2]: mblk_t *
                args[3]: struct mac_header_info_s *

The D compiler used its CTF data for the ip module to determine the arguments and their types. We can then run something like:

# dtrace -qn 'fbt::ip_input:entry{ print(*args[0]); exit(0) }'
...
struct ill_s {
    pfillinput_t ill_inputfn = ip`ill_input_short_v4
    ill_if_t *ill_ifptr = 0xffffff0148a27ab8
    queue_t *ill_rq = 0xffffff014b65ba60
    queue_t *ill_wq = 0xffffff014b65bb58
    int ill_error = 0
    ipif_t *ill_ipif = 0xffffff014b6fc460
    uint_t ill_ipif_up_count = 0x1
    uint_t ill_max_frag = 0x5dc
    uint_t ill_current_frag = 0x5dc
    uint_t ill_mtu = 0x5dc
    uint_t ill_mc_mtu = 0x5dc
    uint_t ill_metric = 0
    char *ill_name = 0xffffff014bc642c8
...
}

DTrace uses the CTF data for the struct ill_s to interpret all of the data and correlate it with the appropriate members.

Bringing it to Userland

While DTrace happily consumes all of the CTF data for the various kernel modules, up to now it simply ignored all of the CTF data in userland applications. With my changes, DTrace will now consume that CTF data for referenced processes. Let’s go back to the example that we opened this blog post with. If we list those probes verbosely we see:

# dtrace -l -v -n
# 'pid$target::stop_instance:entry,pid$target::start_instance:entry' -p
# $(pgrep -o startd)
   ID   PROVIDER            MODULE                          FUNCTION
NAME
62420       pid8        svc.startd                     stop_instance
entry

        Probe Description Attributes
                Identifier Names: Private
                Data Semantics:   Private
                Dependency Class: Unknown

        Argument Attributes
                Identifier Names: Private
                Data Semantics:   Private
                Dependency Class: Unknown

        Argument Types
                args[0]: userland scf_handle_t *
                args[1]: userland restarter_inst_t *
                args[2]: userland stop_cause_t

62419       pid8        svc.startd                    start_instance
entry

        Probe Description Attributes
                Identifier Names: Private
                Data Semantics:   Private
                Dependency Class: Unknown

        Argument Attributes
                Identifier Names: Private
                Data Semantics:   Private
                Dependency Class: Unknown

        Argument Types
                args[0]: userland scf_handle_t *
                args[1]: userland restarter_inst_t *
                args[2]: userland int32_t

Before this change, the Argument Types section would be empty. Because svc.startd has CTF data, DTrace was able to figure out the types of startd’s functions. Without these changes, you’d have to manually define the types of a scf_handle_t and a restarter_inst_t and cast the raw arguments to the correct type. If you ended up with a structure that has a lot of nested structures, defining all of them in D can quickly become turtles all the way down.

Look Ma, no copyin!

DTrace runs in a special context in the kernel, and often times DTrace requires you to think about what’s going on. Just as the kernel can’t access arbitrary user memory without copying it in, neither can DTrace. Consider the following classic one liner:

dtrace -n 'syscall::open:entry{ trace(copyinstr(arg0)); }'

You’ll note that we have to use copyinstr. That tells DTrace that we need to copy in the string from userland into the kernel in order to do something with it, whether that be an aggregation, printing it out, or saving it for some later action. This copyin isn’t limited to just strings. If you wanted to dereference some member of a structure, you’d have to either copy in the full structure, or manually determine the offset location of the member you care about.

At the previous illumos hackathon, Adam Leventhal had the idea of introducing a keyword into D, the DTrace language, that would tell the D compiler that a type is from userland. The D compiler would then take care of copying in the data automatically. Together we built a working prototype, with the keyword userland.

While certainly useful on its own, it really shines when combined with CTF data, as in the pid provider. The pid provider automatically applies the userland keyword to all of the types that are found in args[]. This allowed us to skip the copyin of intermediate structures and write a simple expression. eg. in our initial example we are able to do something that looks like a normal dereference in C: args[1]->ri_i.i_fmri. Before this change, you would have had to do three copyins: one for args[1], one for ri_i, and a final one for the string i_fmri.

As an example of the kinds of scripts that motivated this, here’s a portion of a D script that I used to help debug part of an issue inside of QEMU and KVM:

$ cat event.d
...
/* After its mfence */
pid$target::vring_notify:entry
/self->trace/
{
    self->arg = arg1;
    this->data = (char *)(copyin(self->arg + 0x28, 8));
    self->sign = *(uint16_t *)(this->data+0x2);
}

...

pid$target::vring_notify:return
/self->trace && self->arg/
{
    this->data =  (char *)(copyin(self->arg + 0x28, 8));
    printf("%d notify signal index: 0x%04x notify? %d\n", timestamp,
        *(uint16_t *)(this->data + 0x2), arg1);
}
...

There are many parts of this script where I’ve had to manually encode structure offsets and structure sizes. In the larger script, I had to play games with looking at registers and the pid provider’s ability to instrument arbitrary assembly instructions. I for one am glad that I’ll have to write a lot less of these.

When you have no CTF

While no binary should be left behind, not everything has CTF data today. But the userland keyword can still be incredibly useful even without it. Whenever you’re making a cast, you can note that the type is a userland type with the userland keyword, and the D compiler will do all the heavy lifting from there.

Here’s an example from a program that has a traditional linked list, but doesn’t have any CTF data:

struct foo;

typedef struct foo {
        struct foo *foo_next;
        char *foo_value;
} foo_t;

pid$target::print_head:entry
{
        this->p = (userland foo_t *)arg0;
        trace(stringof(this->p->foo_next->foo_next->foo_value));
}

The nice thing with the userland keyword is that you don’t have to do any copyin or worry about figuring out structure sizes. The goal with all of this is to make it simpler and more intuitive to write D scripts.

Referring to types

As a part of all this, you can refer to types in arbitrary processes that are running on the system, as well as the target. The syntax is designed to be flexible enough to allow you to specify not just the pid, but the link map, library, and type name, but you can also just specify the pid and type that you want. While you can’t refer to macros inside these definitions, you can use the shorthand `pid“ to refer to the value of $TARGET.

For example, say you wanted to refer to a glob_t, which on illumos is defined via a typdef struct glob_t { ... } glob_t, there are a lot of different ways that you can do that. The following are all equivalent:

dtrace -n 'BEGIN{ trace((pid`glob_t *)0); }' -p 8
dtrace -n 'BEGIN{ trace((pid`libc.so.1`glob_t *)0); }' -p 8
dtrace -n 'BEGIN{ trace((pid`LM0`libc.so.1`glob_t *)0); }' -p 8
dtrace -n 'BEGIN{ trace((pid8`glob_t *)0); }'
dtrace -n 'BEGIN{ trace((pid8`libc.so.1`glob_t *)0); }'
dtrace -n 'BEGIN{ trace((pid8`LM0`libc.so.1`glob_t *)0); }'

dtrace -n 'BEGIN{ trace((struct pid`glob_t *)0); }' -p 8
dtrace -n 'BEGIN{ trace((struct pid`libc.so.1`glob_t *)0); }' -p 8
dtrace -n 'BEGIN{ trace((struct pid`LM0`libc.so.1`glob_t *)0); }' -p 8
dtrace -n 'BEGIN{ trace((struct pid8`glob_t *)0); }'
dtrace -n 'BEGIN{ trace((struct pid8`libc.so.1`glob_t *)0); }'
dtrace -n 'BEGIN{ trace((struct pid8`LM0`libc.so.1`glob_t *)0); }'

All of these would also work with the userland keyword. The userland keyword interacts with structs a bit differently than one might expect, so let’s show all of our above examples with the userland keyword:

dtrace -n 'BEGIN{ trace((userland pid`glob_t *)0); }' -p 8
dtrace -n 'BEGIN{ trace((userland pid`libc.so.1`glob_t *)0); }' -p 8
dtrace -n 'BEGIN{ trace((userland pid`LM0`libc.so.1`glob_t *)0); }' -p 8
dtrace -n 'BEGIN{ trace((userland pid8`glob_t *)0); }'
dtrace -n 'BEGIN{ trace((userland pid8`libc.so.1`glob_t *)0); }'
dtrace -n 'BEGIN{ trace((userland pid8`LM0`libc.so.1`glob_t *)0); }'

dtrace -n 'BEGIN{ trace((struct userland pid`glob_t *)0); }' -p 8
dtrace -n 'BEGIN{ trace((struct userland pid`libc.so.1`glob_t *)0); }' -p 8
dtrace -n 'BEGIN{ trace((struct userland pid`LM0`libc.so.1`glob_t *)0); }' -p 8
dtrace -n 'BEGIN{ trace((struct userland pid8`glob_t *)0); }'
dtrace -n 'BEGIN{ trace((struct userland pid8`libc.so.1`glob_t *)0); }'
dtrace -n 'BEGIN{ trace((struct userland pid8`LM0`libc.so.1`glob_t *)0); }'

What’s next?

From here you can get started with userland ctf and the userland keyword in DTrace in current releases of SmartOS. They’ll be making their way to an illumos distribution near you some time soon.

Now that we have this, we’re starting to make plans for the future. One idea that Adam had is to make it easy to scope a structure definition to a target process’s data model.

Another big task that we have is to make it easy to get CTF into programs and ideally, make it invisible to anyone using any compiler tool chain on illumos!

With that, happy tracing!

Last October, the first illumos hack-a-thon took place. Out of that a lot of interesting things were done and have since been integrated into illumos. Two of the more interesting gems were Adam Leventhal and Matt Ahrens adding dtrace -x temporal and Eric Schrock adding the DTrace print() action. Already print() is in the ranks of things where once you have it you really miss it when you don’t. During the hack-a-thon I had the chance to work Matt Amdur. Together we worked on another one of those nice to haves that has finally landed in illumos: tab completion for mdb.

md-what?

For those who have never used it, mdb is the Modular Debugger that comes as a part of illumos and was originally developed for Solaris 8. mdb is primarily used for post-mortem of user and kernel applications and kernel debugger. mdb isn’t a source level debugger, but it works quite well on core dumps from userland, inspects and modifies live kernel state without stopping the system, and provides facilities for traditional debugging where a program is stopped, stepped over, and inspected. mdb replaced adb which came out of AT&T. While mdb isn’t 100% compatible with adb, it does remind you that there’s ‘No algol 68 here’. For the full history, take a look at Mike Shapiro’s talk that he gave at the Brown CS 37th IPP Symposium.

One of the more useful pieces of mdb is its module API which allows other modules to deliver specifically tailored commands and walkers. This is used for everything from the v8 Javascript Engine to understanding cyclics. Between that, pipelines, and other niceties, there isn’t too much else you could want from your debugger.

What’s involved

The work that we’ve done falls into three parts:

  • A tab completion engine.
  • Changes to the module API and several new functions to allow dcmds
    to implement their own tab completion.
  • Tab completion support for several dcmds

Thanks to CTF data in the kernel, we can tab complete everything from walker names, to types and their members. We went and added tab completion to the following dcmds:

  • ::walk
  • ::sizeof
  • ::list
  • ::offsetof
  • ::print
  • The dcmds themselves

Seeing is believing: Tab completion in action

Completing dcmds

> ::pr[tab]
print
printf
project
prov_tab
prtconf
prtusb

Completing walkers

> ::walk ar[tab]
arc_buf_hdr_t
arc_buf_t
> ::walk arc_buf_

Completing types

> ::sizeof struct dt[tab]
struct dtrace_actdesc
struct dtrace_action
struct dtrace_aggbuffer
struct dtrace_aggdesc
struct dtrace_aggkey
struct dtrace_aggregation
struct dtrace_anon
struct dtrace_argdesc
struct dtrace_attribute
struct dtrace_bufdesc
struct dtrace_buffer
struct dtrace_conf
struct dtrace_cred
struct dtrace_difo
struct dtrace_diftype
struct dtrace_difv
struct dtrace_dstate
struct dtrace_dstate_percpu
struct dtrace_dynhash
struct dtrace_dynvar
struct dtrace_ecb
struct dtrace_ecbdesc
struct dtrace_enabling
struct dtrace_eprobedesc
struct dtrace_fmtdesc
struct dtrace_hash
...

Completing members

> ::offsetof zio_t io_v[tab]
io_vd
io_vdev_tree
io_vsd
io_vsd_ops

Walking across types with ::print

> p0::print proc_t p_zone->zone_n[tab]
zone_name
zone_ncpus
zone_ncpus_online
zone_netstack
zone_nlwps
zone_nlwps_ctl
zone_nlwps_lock
zone_nodename
zone_nprocs
zone_nprocs_ctl
zone_nprocs_kstat
zone_ntasks

In addition, just as you can walk across structure (.) and array ([]) dereferences in ::print invocations, you can also do the same with tab completion.

What’s next?

Now that mdb tab completion change is in illumos there’s already some work to add backends to new dcmds including:

  • ::printf
  • ::help
  • ::bp

What else would you like to see? Let us know in a comment or better yet, go ahead and implement it yourself!

One of the challenges when using any Operating System is answering the question ‘Is my hardware supported?’. To track this down, you often have to scour Internet sites, hoping someone else has already asked the question, or do other, more horrible machinations – or ask someone like me. If you’re running on an illumos-based system like SmartOS, OmniOS, or OpenIndiana, this just got a lot easier: I’ve created the list. Better yet, I’ve created a tool to automatically create the list.

The List

illumos now has a hardware compatibility list (HCL) available at http://illumos.org/hcl.

This list contains all the PCI and PCI Express devices that should work. If your PCI device isn’t listed there, don’t fret, it may still work. This list is a first strike at the problem of hardware compatibility, so things like specific motherboards aren’t listed there.

How it’s generated

The great thing about this list is that it’s automatically generated from the source code in illumos itself. Each driver on the system has a manifest that specifies what PCI IDs it supports. We parse each of these manifests and look up the names using the PCI ID Database, using a small library that I wrote. From there, we automatically generate the static web page that can be deployed. Thanks to K. Adam White for his invaluable help to stop me from fumbling around too much with front end web code and the others who have already come in and improved it.

All the code is available on github. The goal for all of this is to eventually be a part of the illumos-gate itself. If you have improvements or want to make the web page more visually appealing, we’d all welcome the contribution.

Last Monday was the illumos hack-a-thon. There, I worked with Matt Amdur on adding tab completion support to mdb — the illumos modular debugger. The hack-a-thon was wildly successful and a lot of fun, I hope to put together an entry on the hack-a-thon and give an overview of the projects that were worked on over the course of the next few days. During the hack-a-thon, Matt and I created a working prototype that would complete the types and members using ::print, but there was still some good work for us to do. One of the challenges that we were facing was some unexpected behavior whenever the mdb pager was invoked. We were seeing different actions depending on which actions you took from the pager: continue, quit, or abort.

If you take a look at the source code, you’ll see that sometimes we’ll leave this function by calling longjmp(3c). There’s a lot of different places that we call setjmp(3c) and sigsetjmp(3c) in mdb, so tracking down where we were going just based on looking at the source code would be tricky. So, we want to answer the question, where are we jumping to? There are a few different ways we can do this (and more that aren’t listed):

  1. Inspect the source code
  2. Use mdb to debug mdb
  3. Use the DTrace pid provider and trace a certain number of instructions before we assume we’ve gotten there
  4. Use the DTrace pid provider and look at the jmp_buf to get the address of where we were jumping

Ultimately, I decided to go with option four, knowing that I would have to solve this problem again at some point in the future. The first step is to look at the definition of the jmp_buf definition. For the full definition take a look at setjmp_iso.h. Here’s the snippet that actually defines the type:

     82 #if defined(__i386) || defined(__amd64) || \
     83 	defined(__sparc) || defined(__sparcv9)
     84 #if defined(_LP64) || defined(_I32LPx)
     85 typedef long	jmp_buf[_JBLEN];
     86 #else
     87 typedef int	jmp_buf[_JBLEN];
     88 #endif
     89 #else
     90 #error "ISA not supported"
     91 #endif

Basically, the jmp_buf is just an array where we store some of registers. Unfortunately this isn’t sufficient to figure out where to go. So instead, we need to take a look at the implementation. setjmp is implemented in assembly for the particular architecture. Here it is for x86 and amd64. Now that we have the implementation, let’s figure out what to do. As a heads up, if you’re looking at any of these .s files, the numbers are actually in base 10, which is different from what you get when you look at the mdb output which has them in hex. Let’s take a quick look at the longjmp source for a 32-bit system and dig into what’s going on and how we know what to do:

     73 	ENTRY(longjmp)
     74 	movl	4(%esp),%edx	/ first parameter after return addr
     75 	movl	8(%esp),%eax	/ second parameter
     76 	movl	0(%edx),%ebx	/ restore ebx
     77 	movl	4(%edx),%esi	/ restore esi
     78 	movl	8(%edx),%edi	/ restore edi
     79 	movl	12(%edx),%ebp	/ restore caller's ebp
     80 	movl	16(%edx),%esp	/ restore caller's esp
     81
     82 	movl	24(%edx), %ecx
     83 	test	%ecx, %ecx	/ test flag word
     84 	jz	1f
     85 	xorl	%ecx, %ecx	/ if set, clear ul_siglink
     86 	movl	%ecx, %gs:UL_SIGLINK
     87 1:
     88 	test	%eax,%eax	/ if val != 0
     89 	jnz	1f		/ 	return val
     90 	incl	%eax		/ else return 1
     91 1:
     92 	jmp	*20(%edx)	/ return to caller
     93 	SET_SIZE(longjmp)

The function is pretty well commented, so we can follow along pretty easily. Basically we load the jmp_buf that was passed in into %edx, add 0x14 to that value and then jump to that piece of code. So now we know exactly what the address we’re returning to is. With this in hand, we only have two tasks left: transforming this address into a function and offset, and doing this all with a simple DTrace script. Solving the first problem is actually pretty easy. We can just use the DTrace uaddr function which will translate it into an address and offset for us. The script itself is now an exercise in copyin and arithmetic. Here’s the main part of the script:

/*
 * Given a sigbuf translate that into where the longjmp is taking us.
 *
 * On i386 the address is 0x14 into the jmp_buf.
 * On amd64 the address is 0x38 into the jmp_buf.
 */

pid$1::longjmp:entry
{
        uaddr(curpsinfo->pr_dmodel == PR_MODEL_ILP32 ?
            *(uint32_t *)copyin(arg0 + 0x14, sizeof (uint32_t)) :
            *(uint64_t *)copyin(arg0 + 0x38, sizeof (uint64_t)));
}

Now, if we run this, here’s what we get:

[root@bh1-build2 (bh1) /var/tmp]# dtrace -s longjmp.d $(pgrep -z rm mdb)
dtrace: script 'longjmp.d' matched 1 probe
CPU     ID                    FUNCTION:NAME
  8  69580                    longjmp:entry   mdb`mdb_run+0x38

Now we know exactly where we ended up after the longjmp(), and this will method will work on both 32-bit and 64-bit x86 systems. If you’d like to download the script, you can just download it from here.

Welcome to this humble blog. I’m currently an engineer on the Fishworks team. I’ll have some interesting things to talk about from the work I do. Before that I graduated from Brown’s CS Department. During my time there I did research with Prof. Tom Doeppner, I helped run the TA Program, and TAed several courses, most notably the Operating Systems Course and the Distributed Systems Course.

When I have spare cycles to spend on technical projects I like to look at silently layering filesystems via FUSE, experimenting with media control via mplayer and Wiimotes, and tinkering around with systems in whatever seems intriguing.

Many thanks to Bryan for giving me some space up here.

Recent Posts

September 27, 2019
September 6, 2019
October 1, 2014

Archives