Your MDB fell into my DTrace!

Yesterday, several of us from Delphix, Nexenta, Joyent, and elsewhere, convened before the OpenStorage summit as part of an illumos hackathon.  The idea was to get a bunch of illumos coders in a room, brainstorm a bunch of small project ideas, and then break off to go implement them over the course of the day.  That was the idea, at least – in reality we didn’t know what to expect or how it would turn out.  Suffice to say that the hackathon was an amazing success.  There were a lot of cool ideas, and a lot of great mentors in the room that could lead people through unfamiliar territory.

For my mini-project (suggested by ahl), I implemented MDB’s ::print functionality in DTrace via a new print() action. Today, we have the trace() action, but the result is somewhat less than useful when dealing with structs, as it degenerates into tracemem():

# dtrace -qn 'BEGIN{trace(`p0); exit(0)}'
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 00 00 00 00 00 00 00 00 60 02 c3 fb ff ff ff ff  ........`.......
        10: c8 c9 c6 fb ff ff ff ff 00 00 00 00 00 00 00 00  ................
        20: b0 ad 14 c6 00 ff ff ff 00 00 00 00 02 00 00 00  ................
        ...

The results aren’t pretty, and we end up throwing away all that useful proc_t type information. With a little tweaks to dtrace, and some cribbing from mdb_print.c, we can do much better:

# dtrace -qn 'BEGIN{print(`p0); exit(0)}'
proc_t {
    struct vnode *p_exec = 0
    struct as *p_as = 0xfffffffffbc30260
    struct plock *p_lockp = 0xfffffffffbc6c9c8
    kmutex_t p_crlock = {
        void *[1] _opaque = [ 0 ]
    }
    struct cred *p_cred = 0xffffff00c614adb0
    int p_swapcnt = 0
    char p_stat = '02'
    ....

Much better! Now, how did we get there from here? The answer was an interesting journey through libdtrace, the kernel dtrace implementation, CTF, and the horrors of bitfields.

To action or not to action?

The first question I set out to answer is what the user-visible interface should be. It seemed clear that this should be an operation on the same level as trace(), allowing arbitrary D expressions, but simply preserving the type of the result and pretty-printing it later. After briefly considering printt() (for “print type”), I decided upon just print(), since this seemed like a logical My first inclination was to create a new DTRACEACT_PRINT, but after some discussion with Adam, we decided this was extraneous – the behavior was identical to DTRACEACT_DIFEXPR (the internal name for trace), but just with type information.

Through the looking glass with types and formats

The real issue is that what we compile (dtrace statements) and what we consume (dtrace epids and records) are two very different things, and never the twain shall meet. At the time we go to generate the DIFEXPR statement in dt_cc.c, we have the CTF data in hand. We don’t want to change the DIF we generate, simply do post-processing on the other side, so we just need some way to get back to that type information in dt_consume_cpu(). We can’t simply hang it off our dtrace statement, as that would break anonymous tracing (and violate the rest of the DTrace architecture to boot).

Thankfully, this problem had already been solved for printf() (and related actions) because we need to preserve the original format string for the exact same reason. To do this, we take the action-specific integer argument, and use it to point into the DOF string table, where we stash the original format string. I simply had to hijack dtrace_dof_create() and have it do the same thing for the type information, right?

If only it could be so simple. There were two complications here: there is a lot of code that explicitly treats these as printf strings, and parses them into internal argv-style representations. Pretending our types were just format strings would cause all kinds of problems in this code. So I had to modify libdtrace to treat this more explicitly as raw ‘string data’ that is (optionally) used with the DIFEXPR action. Even with that in place, the formats I was sending down were not making it back out of the kernel. Because the argument is action-specific, the kernel needed to be modified to recognize this new argument in dtrace_ecb_action_add. With that change in place, I was able to get the format string back in userland when consuming the CPU buffers.

Bitfields, or why the D compiler cost me an hour of my life

With the trace data and type string in hand, I then proceeded to copy the mdb ::print code, first from apptrace (which turned out to be complete garbage) and then fixing it up bit by bit. Finally, after tweaking the code for an hour or two, I had it looking pretty much like identical ::print output. But when I fed it a klwp_t structure, I found that the user_desc_t structure bitfields weren’t being printed correctly:

# dtrace -n 'BEGIN{print(*((user_desc_t*)0xffffff00cb0a4d90)); exit(0)}'
dtrace: description 'BEGIN' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0      1                           :BEGIN user_desc_t {
    unsigned long usd_lolimit = 0xcff3000000ffff
    unsigned long usd_lobase = 0xcff3000000
    unsigned long usd_midbase = 0xcff300
    unsigned long usd_type = 0xcff3
    unsigned long usd_dpl :64 = 0xcff3
    unsigned long usd_p :64 = 0xcff3
    unsigned long usd_hilimit = 0xcf
    unsigned long usd_avl :64 = 0xcf
    unsigned long usd_long :64 = 0xcf
    unsigned long usd_def32 :64 = 0xcf
    unsigned long usd_gran :64 = 0xcf
    unsigned long usd_hibase = 0
}

I spent an hour trying to debug this, only to find that the CTF IDs weren’t matching what I expected from the underlying object. I finally tracked it down to the fact that the D compiler, by virtue of processing the /usr/lib/dtrace files, pulls in its own version of klwp_t from the system header files. But it botches the bitfields, leaving the user with a subtly incorrect data. Switching the type to be genunix`user_desc_t fixed the problem.

What’s next

Given the usefulness of this feature, the next steps are to clean up the code, get it reviewed, and push to the illumos gate. It should hopefully be finding its way to an illumos distribution near you soon. Here’s a final print() invocation to leave you with:

# dtrace -n 'zio_done:entry{print(*args[0]); exit(0)}'
dtrace: description 'zio_done:entry' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0  42594                   zio_done:entry zio_t {
    zbookmark_t io_bookmark = {
        uint64_t zb_objset = 0
        uint64_t zb_object = 0
        int64_t zb_level = 0
        uint64_t zb_blkid = 0
    }
    zio_prop_t io_prop = {
        enum zio_checksum zp_checksum = ZIO_CHECKSUM_INHERIT
        enum zio_compress zp_compress = ZIO_COMPRESS_INHERIT
        dmu_object_type_t zp_type = DMU_OT_NONE
        uint8_t zp_level = 0
        uint8_t zp_copies = 0
        uint8_t zp_dedup = 0
        uint8_t zp_dedup_verify = 0
    }
    zio_type_t io_type = ZIO_TYPE_NULL
    enum zio_child io_child_type = ZIO_CHILD_VDEV
    int io_cmd = 0
    uint8_t io_priority = 0
    uint8_t io_reexecute = 0
    uint8_t [2] io_state = [ 0x1, 0 ]
    uint64_t io_txg = 0
    spa_t *io_spa = 0xffffff00c6806580
    blkptr_t *io_bp = 0
    blkptr_t *io_bp_override = 0
    blkptr_t io_bp_copy = {
        dva_t [3] blk_dva = [
            dva_t {
                uint64_t [2] dva_word = [ 0, 0 ]
            },
    ...
Posted on October 26, 2011 at 6:32 am by eschrock · Permalink
In: Uncategorized