From VP of Engineering to CTO

If you search for “cto vs. vp of engineering”, one of the top hits is a presentation that I gave with Jason Hoffman at Monki Gras 2012. Aside from some exceptionally apt clip art, the crux of our talk was that these two roles should not be thought of as caricatures (e.g. the CTO as a silver tongue with grand vision but lacking practical know-how and the VP of Engineering as a technocrat who makes the TPS reports run on time), but rather as a team that together leads a company’s technical endeavors. Yes, one is more outward- and future-looking and the other more team- and product-focused — but if the difference becomes too stark (that is, if the CTO and VP of Engineering can’t fill in for one another in a pinch) there may be a deeper cultural divide between vision and execution. As such, the CTO and the VP of Engineering must themselves represent the balance present in every successful engineer: they must be able to both together understand the world as it is — and envision the world as it could be.

This presentation has been on my mind recently because today my role at Joyent is changing: I am transitioning from VP of Engineering to CTO, and Mark Cavage is taking on the role of VP of Engineering. For me, this is an invigorating change in a couple of dimensions. First and foremost, I am excited to be working together with Mark in a formalized leadership capacity. The vitality of the CTO/VP of Engineering dynamic stems from the duo’s ability to function as a team, and I believe that Mark and I will be an effective one in this regard. (And Mark apparently forgives me for cussing him out when he conceived of what became Manta.)

Secondly, I am looking forward to talking to customers a bit more. Joyent is in a terrific position in that our vision for cloud computing is not mere rhetoric, but actual running service and shipping product. We are uniquely differentiated by the four technical pillars of our stack: SmartOS, node.js, SmartDataCenter and — as newly introduced last year — our revolutionary Manta storage service. These are each deep technologies in their own right, and especially at their intersections, they unlock capabilities that the market wants and needs — and our challenge now is as much communicating what we’ve done (and why we’ve done it) as it is continuing to execute. So while I have always engaged directly with customers, the new role will likely mean more time on planes and trains as I visit more customers (and prospective customers) to better understand how our technologies can help them solve their thorniest problems.

Finally, I am looking forward to the single most important role of the CTO: establishing the broader purpose of our technical endeavor. This purpose becomes the root of a company’s culture, as culture without purpose is mere costume. For Joyent and Joyeurs our purpose is simple: we’re here to change computing. As I mentioned in my Surge 2013 talk on technical leadership (video), superlative technologists are drawn to mission, team and problem — and in Joyent's case, the mission of changing computing (and the courage to tackle whatever problems that entails) has attracted an exceptionally strong team that I consider myself blessed to call my peers. I consider it a great honor to be Joyent's CTO, and I look forward to working with Mark and the team to continue to — in Steve Jobs' famous words — kick dents in the universe!

Posted on April 15, 2014 at 8:07 am by bmc · Permalink · Leave a comment
In: Uncategorized

agghist, aggzoom and aggpack

As I have often remarked, DTrace is a workhorse, not a show horse: the features that we have added to DTrace over the years come not from theoretical notions but rather from actual needs in production environments. This is as true now as it was a decade ago, and even in core abstractions, extensive use of DTrace seems to give rise to new ways of thinking about them. In particular, Brendan recently had a couple of feature ideas for aggregation processing that have turned out to be really interesting…

agghist

When one performs a count() or sum() aggregating action, the result is a table that consists of values and numbers, e.g.:

# dtrace -n syscall:::entry'{@[execname] = count()}'
dtrace: description 'syscall:::entry' matched 233 probes
^C

  utmpd                                                             2
  metadata                                                          5
  pickup                                                           14
  ur-agent                                                         18
  epmd                                                             20
  fmd                                                              20
  jsontool.js                                                      33
  vmadmd                                                           33
  master                                                           41
  mdnsd                                                            41
  intrd                                                            45
  devfsadm                                                         52
  bridged                                                          81
  mkdir                                                            83
  ipmgmtd                                                          95
  cat                                                              96
  ls                                                              100
  truss                                                           101
  ipmon                                                           106
  sendmail                                                        111
  dirname                                                         115
  svc.startd                                                      125
  svc.configd                                                     153
  ksh93                                                           164
  zoneadmd                                                        165
  svcprop                                                         258
  sshd                                                            262
  bash                                                            291
  zpool                                                           436
  date                                                            448
  cron                                                            470
  lldpd                                                           584
  dump-minutely-sd                                                611
  haproxy                                                         760
  nscd                                                           1275
  zfs                                                            1966
  zoneadm                                                        2414
  java                                                           2916
  tail                                                           3750
  postgres                                                       6702
  redis-server                                                  21283
  dtrace                                                        28308
  node                                                          39836
  beam.smp                                                      55940

Brendan’s observation was that it would be neat to (optionally) use the histogram-style aggregation printing with these count()/sum() aggregations to be able to more easily differentiate values. For that, we have the new “-x agghist” option:

# dtrace -n syscall:::entry'{@[execname] = count()}' -x agghist
dtrace: description 'syscall:::entry' matched 233 probes
^C

              key  ------------- Distribution ------------- count
            utmpd |                                         2
         metadata |                                         5
           pickup |                                         14
             epmd |                                         20
              fmd |                                         23
         ur-agent |                                         25
             init |                                         27
            mdnsd |                                         30
      jsontool.js |                                         33
           vmadmd |                                         36
           master |                                         41
            intrd |                                         45
         devfsadm |                                         52
          bridged |                                         78
            mkdir |                                         83
              cat |                                         96
          ipmgmtd |                                         97
               ls |                                         100
         sendmail |                                         101
            ipmon |                                         102
          dirname |                                         115
       svc.startd |                                         125
            truss |                                         140
      svc.configd |                                         153
            ksh93 |                                         164
             sshd |                                         248
          svcprop |                                         258
             bash |                                         290
         zoneadmd |                                         387
            zpool |                                         436
             date |                                         448
             cron |                                         470
            lldpd |                                         562
 dump-minutely-sd |                                         615
          haproxy |                                         622
             nscd |                                         808
              zfs |                                         1966
          zoneadm |@                                        2414
             java |@                                        3003
             tail |@                                        3607
         postgres |@                                        5728
     redis-server |@@@@@                                    20611
           dtrace |@@@@@@                                   26966
             node |@@@@@@@@@@                               39825
         beam.smp |@@@@@@@@@@@@@                            55942    

It’s obviously the same information, but presented with a quicker visual cue as to the distribution. Now, one may well note that this output has a lot of dead whitespace in it — read on.

aggzoom

The DTrace histogram-style output came directly from lockstat, which proceeded it by several years. lockstat was important for DTrace in several ways: aside from showing the power of production instrumentation, lockstat pointed to the need for first-class aggregations, for disjoint instrumentation providers and for multiplexed consumers (for which it was repaid by having its guts ripped out, being reimplemented as both a DTrace provider and a DTrace consumer). Due to some combination of my laziness and my reverence for the lockstat-style histogram, I simply lifted the lockstat processing code — which means I inherited (or stole, depending on your perspective) the decisions Jeff had made with regard to histogram rendering. In particular, lockstat determines the height of a bar by taking the bucket’s share of the total and multiplying it by the full height of the histogram. That is, if you add up all of the heights of all of the bars, they will add up to the full height of the histogram. The benefit of this is that it quickly tells you relative distribution: if you see a full-height bar, you know that that bucket represents essentially all of the values. But the problem is that if the number of buckets is large and/or the values of those buckets are relatively evenly distributed, the result is a bunch of very short bars (or worse, zero height bars) and dead whitespace. This can become so annoying to DTrace users that Brendan has been known to observe that this is the one improvement over DTrace that he can point to in SystemTap: that instead of dividing the height of the histogram among all of the bars, each bar has a height in proportion to the histogram height that is its value in proportion to the bucket of greatest value. Of course, the shape of the distribution doesn’t change — one simply is automatically scaling the height of the highest bar to the height of the histogram and adjusting all other bars accordingly.

To allow for this behavior, I added “-x aggzoom“; running the same example as above with this new option:

# dtrace -n syscall:::entry'{@[execname] = count()}' -x agghist -x aggzoom
dtrace: description 'syscall:::entry' matched 233 probes
^C

              key  ------------- Distribution ------------- count
            utmpd |                                         2
         metadata |                                         7
           pickup |                                         14
         ur-agent |                                         20
              fmd |                                         25
             epmd |                                         26
      jsontool.js |                                         33
            mdnsd |                                         39
           master |                                         41
           vmadmd |                                         44
         rsyslogd |                                         57
         devfsadm |                                         66
            mkdir |                                         83
            intrd |                                         90
              cat |                                         96
          bridged |                                         99
               ls |                                         100
          dirname |                                         115
            truss |                                         125
            ipmon |                                         130
         sendmail |                                         131
          ipmgmtd |                                         133
            ksh93 |                                         164
       svc.startd |                                         179
             init |                                         189
             sshd |                                         253
          svcprop |                                         258
             bash |                                         283
            zpool |                                         436
             date |                                         448
             cron |                                         470
 dump-minutely-sd |                                         611
            lldpd |                                         716
      svc.configd |@                                        939
          haproxy |@                                        1097
         zoneadmd |@                                        1455
              zfs |@                                        1966
          zoneadm |@@                                       2414
             nscd |@@                                       2867
             java |@@@                                      4277
             tail |@@@                                      4579
         postgres |@@@@@@                                   9389
     redis-server |@@@@@@@@@@@@@@@@@@                       26177
           dtrace |@@@@@@@@@@@@@@@@@@@@@@@                  34530
             node |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        48151
         beam.smp |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   55940

It’s zoomtastic!

aggpack

If you follow Brendan’s blog, you know that he’s always experimenting with ways of visually communicating a maximal amount of system data. One recent visualization that has been particularly interesting are his frequency trails — a kind of stacked sparkline. Thinking about density of presentation sparked Brendan to observe that he often needs to try to visually correlate multiple quantize()/lquantize() aggregation keys, e.g. this classic DTrace “one-liner” to show the distribution of system call time (in nanoseconds) by system call:

# dtrace -n syscall:::entry'{self->ts = timestamp}' \
    -n syscall:::return'/self->ts/{@[probefunc] = \
    quantize(timestamp - self->ts); self->ts = 0}'
dtrace: description 'syscall:::entry' matched 233 probes
dtrace: description 'syscall:::return' matched 233 probes
^C

  sigpending
           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
            2048 |                                         0        

  llseek
           value  ------------- Distribution ------------- count
             256 |                                         0
             512 |@@@@@@@@@@@@@@@@@@@@                     1
            1024 |@@@@@@@@@@@@@@@@@@@@                     1
            2048 |                                         0        

  fstat
           value  ------------- Distribution ------------- count
            1024 |                                         0
            2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
            4096 |                                         0        

  setcontext
           value  ------------- Distribution ------------- count
            1024 |                                         0
            2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
            4096 |                                         0        

  fstat64
           value  ------------- Distribution ------------- count
             256 |                                         0
             512 |@@@@@@@@@@@@@@@@@@@@                     2
            1024 |@@@@@@@@@@@@@@@@@@@@                     2
            2048 |                                         0        

  sigaction
           value  ------------- Distribution ------------- count
             256 |                                         0
             512 |@@@@@@@@@@@@@@@@@@@@                     2
            1024 |@@@@@@@@@@@@@@@@@@@@                     2
            2048 |                                         0        

  getuid
           value  ------------- Distribution ------------- count
             128 |                                         0
             256 |@@@@@@@@@@@@@@@@@@@@                     1
             512 |                                         0
            1024 |                                         0
            2048 |                                         0
            4096 |@@@@@@@@@@@@@@@@@@@@                     1
            8192 |                                         0        

  accept
           value  ------------- Distribution ------------- count
            1024 |                                         0
            2048 |@@@@@@@@@@@@@@@@@@@@                     1
            4096 |@@@@@@@@@@@@@@@@@@@@                     1
            8192 |                                         0        

  sysconfig
           value  ------------- Distribution ------------- count
             256 |                                         0
             512 |@@@@@@@@@@@@@                            2
            1024 |@@@@@@@@@@@@@@@@@@@@                     3
            2048 |@@@@@@@                                  1
            4096 |                                         0        

  bind
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2
            8192 |                                         0        

  lwp_sigmask
           value  ------------- Distribution ------------- count
             256 |                                         0
             512 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       11
            1024 |@@@                                      1
            2048 |@@@                                      1
            4096 |                                         0        

  recvfrom
           value  ------------- Distribution ------------- count
            1024 |                                         0
            2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 5
            4096 |                                         0        

  setsockopt
           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |@@@@@@@@@@                               1
            2048 |@@@@@@@@@@                               1
            4096 |@@@@@@@@@@@@@@@@@@@@                     2
            8192 |                                         0        

  fcntl
           value  ------------- Distribution ------------- count
             128 |                                         0
             256 |@@@                                      1
             512 |@@@@@@@@@@@@@@@@@@@@                     8
            1024 |@@@@@@@@@@@@@@@                          6
            2048 |@@@                                      1
            4096 |                                         0        

  systeminfo
           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |@@@@@@@@@@@@@@@                          3
            2048 |@@@@@@@@@@@@@@@@@@@@@@@@@                5
            4096 |                                         0        

  schedctl
           value  ------------- Distribution ------------- count
            8192 |                                         0
           16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
           32768 |                                         0        

  getsockopt
           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@          11
            2048 |@@@@@@@@@                                3
            4096 |                                         0        

  stat
           value  ------------- Distribution ------------- count
            1024 |                                         0
            2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              4
            4096 |                                         0
            8192 |@@@@@@@@@@@@@                            2
           16384 |                                         0        

  recvmsg
           value  ------------- Distribution ------------- count
            1024 |                                         0
            2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        15
            4096 |@@@@                                     2
            8192 |@@                                       1
           16384 |                                         0        

  connect
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              4
            8192 |                                         0
           16384 |@@@@@@@@@@@@@                            2
           32768 |                                         0        

  brk
           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |@@@@@@@@@@                               2
            2048 |@@@@@@@@@@@@@@@                          3
            4096 |                                         0
            8192 |@@@@@@@@@@                               2
           16384 |                                         0
           32768 |@@@@@                                    1
           65536 |                                         0        

  so_socket
           value  ------------- Distribution ------------- count
            1024 |                                         0
            2048 |@@@@@@@@@@@@@@@@@@@@@@                   5
            4096 |                                         0
            8192 |@@@@@@@@@                                2
           16384 |@@@@@@@@@                                2
           32768 |                                         0        

  mmap
           value  ------------- Distribution ------------- count
           32768 |                                         0
           65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
          131072 |                                         0        

  putmsg
           value  ------------- Distribution ------------- count
           32768 |                                         0
           65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
          131072 |                                         0        

  p_online
           value  ------------- Distribution ------------- count
             128 |                                         0
             256 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  250
             512 |@                                        5
            1024 |                                         0
            2048 |                                         1
            4096 |                                         0        

  getpid
           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |@@@@@@@@@@                               9
            2048 |@@@@@@@@@@@@@@@@@@@@@@@                  21
            4096 |@@                                       2
            8192 |@@@@                                     4
           16384 |                                         0        

  close
           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |@@@@@@@@@@@@@@@@@@@                      13
            2048 |@@@@@@@                                  5
            4096 |@@@@@@@                                  5
            8192 |@@@                                      2
           16384 |@@@@                                     3
           32768 |                                         0        

  lseek
           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |@@@@@@@@@@@@@@                           22
            2048 |@@@@@@@@@@@@@@@@@@@@@@@                  37
            4096 |@@                                       3
            8192 |@                                        2
           16384 |                                         0        

  open
           value  ------------- Distribution ------------- count
            1024 |                                         0
            2048 |@@@@@@@@@@@@@                            6
            4096 |@@@@@@@                                  3
            8192 |@@@@@@@@@@@@@                            6
           16384 |@@                                       1
           32768 |@@@@                                     2
           65536 |                                         0        

  lwp_cond_signal
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@@@@@@                                   2
            8192 |                                         0
           16384 |@@@@@@@@@@@@@@@@@@@@@@@                  8
           32768 |@@@@@@@@@@@                              4
           65536 |                                         0        

  sendmsg
           value  ------------- Distribution ------------- count
            8192 |                                         0
           16384 |@@@@                                     1
           32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              6
           65536 |@@@@@@@@@                                2
          131072 |                                         0        

  gtime
           value  ------------- Distribution ------------- count
             128 |                                         0
             256 |@@@@@@@@@@@@@@@@@@@@                     576
             512 |@@@@@@@@@@@@@@@@                         452
            1024 |@@                                       56
            2048 |@                                        34
            4096 |                                         3
            8192 |                                         8
           16384 |                                         1
           32768 |                                         1
           65536 |                                         0        

  read
           value  ------------- Distribution ------------- count
             256 |                                         0
             512 |                                         1
            1024 |@@@@@@                                   19
            2048 |@@@@@@@@@@                               34
            4096 |@@@@@                                    15
            8192 |@@@@@@@@@@@@@@@@@@                       61
           16384 |@                                        2
           32768 |                                         0        

  stat64
           value  ------------- Distribution ------------- count
            1024 |                                         0
            2048 |@                                        2
            4096 |@                                        1
            8192 |@@@                                      5
           16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@          52
           32768 |@@@@                                     7
           65536 |                                         0        

  send
           value  ------------- Distribution ------------- count
            1024 |                                         0
            2048 |@@                                       2
            4096 |@                                        1
            8192 |@@@@@@@@@@@@@@@@@@                       22
           16384 |@@@@@@@@@@                               12
           32768 |@@@@@                                    6
           65536 |@@@                                      4
          131072 |@@                                       3
          262144 |                                         0        

  write
           value  ------------- Distribution ------------- count
             256 |                                         0
             512 |@@                                       5
            1024 |@@@@                                     8
            2048 |                                         0
            4096 |@@@                                      6
            8192 |@@@@@@@@@@@                              23
           16384 |@@@@@@@@                                 16
           32768 |@@@@@@@@@@                               22
           65536 |@@                                       4
          131072 |                                         0
          262144 |                                         1
          524288 |                                         0        

  doorfs
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@@@@@@@@@@@@@@@@@@@@                     1
            8192 |                                         0
           16384 |                                         0
           32768 |                                         0
           65536 |                                         0
          131072 |                                         0
          262144 |                                         0
          524288 |                                         0
         1048576 |                                         0
         2097152 |@@@@@@@@@@@@@@@@@@@@                     1
         4194304 |                                         0        

  yield
           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |@@@@@@@@@@@@@@@@@@@@@                    198
            2048 |@                                        5
            4096 |                                         1
            8192 |@@@@@@                                   61
           16384 |@@@@@@@@@@@@                             110
           32768 |                                         2
           65536 |                                         2
          131072 |                                         0
          262144 |                                         1
          524288 |                                         0        

  recv
           value  ------------- Distribution ------------- count
             256 |                                         0
             512 |@@@@@@@@@@@@                             17
            1024 |@@@@                                     6
            2048 |@@@@@@@@@@@@@                            19
            4096 |@@@@                                     6
            8192 |                                         0
           16384 |                                         0
           32768 |                                         0
           65536 |                                         0
          131072 |@@                                       3
          262144 |@@@@                                     5
          524288 |                                         0
         1048576 |                                         0
         2097152 |                                         0
         4194304 |                                         0
         8388608 |                                         0
        16777216 |@                                        1
        33554432 |                                         0        

  nanosleep
           value  ------------- Distribution ------------- count
         4194304 |                                         0
         8388608 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 201
        16777216 |                                         0
        33554432 |                                         0
        67108864 |                                         0
       134217728 |                                         0
       268435456 |                                         0
       536870912 |                                         2
      1073741824 |                                         0        

  ioctl
           value  ------------- Distribution ------------- count
             128 |                                         0
             256 |@                                        50
             512 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   2157
            1024 |                                         17
            2048 |                                         8
            4096 |                                         10
            8192 |                                         13
           16384 |                                         6
           32768 |                                         3
           65536 |                                         7
          131072 |                                         0
          262144 |                                         0
          524288 |                                         0
         1048576 |                                         0
         2097152 |                                         0
         4194304 |                                         0
         8388608 |                                         2
        16777216 |                                         0
        33554432 |                                         0
        67108864 |                                         6
       134217728 |                                         0
       268435456 |                                         1
       536870912 |                                         4
      1073741824 |                                         0        

  lwp_cond_wait
           value  ------------- Distribution ------------- count
        16777216 |                                         0
        33554432 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@          46
        67108864 |@@@@                                     6
       134217728 |                                         0
       268435456 |@                                        2
       536870912 |@@@@                                     6
      1073741824 |                                         0        

  pollsys
           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |@                                        2
            2048 |                                         1
            4096 |                                         1
            8192 |                                         1
           16384 |                                         0
           32768 |                                         0
           65536 |@                                        2
          131072 |                                         0
          262144 |@@                                       6
          524288 |@                                        2
         1048576 |@@                                       5
         2097152 |@                                        2
         4194304 |                                         0
         8388608 |@@@                                      8
        16777216 |                                         0
        33554432 |                                         1
        67108864 |@@@@@@@@@@@@@@@@@@                       56
       134217728 |@@@@@@@@@                                28
       268435456 |                                         0
       536870912 |@@@                                      8
      1073741824 |                                         1
      2147483648 |                                         0        

  lwp_park
           value  ------------- Distribution ------------- count
            1024 |                                         0
            2048 |                                         6
            4096 |@                                        21
            8192 |@                                        20
           16384 |@@@@@@@@@                                134
           32768 |@@@@@@@@@@                               145
           65536 |@@@@@@@@@@                               153
          131072 |@                                        11
          262144 |                                         3
          524288 |                                         0
         1048576 |                                         0
         2097152 |                                         0
         4194304 |                                         0
         8388608 |@@@@@@                                   89
        16777216 |                                         0
        33554432 |                                         0
        67108864 |                                         0
       134217728 |                                         2
       268435456 |                                         4
       536870912 |@                                        18
      1073741824 |                                         2
      2147483648 |                                         0        

  portfs
           value  ------------- Distribution ------------- count
             256 |                                         0
             512 |                                         4
            1024 |@@@                                      41
            2048 |@@@@                                     44
            4096 |@                                        14
            8192 |                                         2
           16384 |                                         1
           32768 |                                         2
           65536 |                                         1
          131072 |                                         3
          262144 |@                                        10
          524288 |                                         3
         1048576 |                                         6
         2097152 |                                         3
         4194304 |                                         0
         8388608 |@@@@@@@@@@@@@@@@@@@@@@@                  272
        16777216 |                                         1
        33554432 |                                         1
        67108864 |                                         4
       134217728 |                                         2
       268435456 |                                         5
       536870912 |@@@@@                                    58
      1073741824 |                                         3
      2147483648 |                                         1
      4294967296 |                                         0        

For those keeping score, that would be 482 lines of output — and that was for just a couple of seconds. Brendan’s observation was that it would be great to tip these aggregations on their side (so their bars point up-and-down, not left-to-right) — with the output for each key appearing on one line. This is clearly a great idea, and “-x aggpack” was born:

# dtrace -n syscall:::entry'{self->ts = vtimestamp}' \
    -n syscall:::return'/self->ts/{@[probefunc] = \
    quantize(vtimestamp - self->ts); self->ts = 0}' -x aggpack
dtrace: description 'syscall:::entry' matched 233 probes
dtrace: description 'syscall:::return' matched 233 probes
^C

              key  min .------------------. max     | count
       sigpending    8 :     X            : 1048576 | 1
     lwp_continue    8 :      X           : 1048576 | 1
           uucopy    8 :      X           : 1048576 | 1
           getuid    8 :  x   x           : 1048576 | 2
         lwp_kill    8 :       X          : 1048576 | 1
        sigaction    8 :   xx x           : 1048576 | 4
           llseek    8 :     x x          : 1048576 | 2
          fstat64    8 :     xx           : 1048576 | 4
       setcontext    8 :      xx          : 1048576 | 2
      lwp_sigmask    8 :   xx _           : 1048576 | 10
       systeminfo    8 :     xx           : 1048576 | 4
        sysconfig    8 :   ____x          : 1048576 | 6
           accept    8 :       xx         : 1048576 | 2
             bind    8 :         X        : 1048576 | 2
            fstat    8 :     x    x       : 1048576 | 2
         recvfrom    8 :        X         : 1048576 | 5
           doorfs    8 :         xx       : 1048576 | 2
              brk    8 :     ___xx        : 1048576 | 8
       lwp_create    8 :           X      : 1048576 | 1
         schedctl    8 :       x   x      : 1048576 | 2
         p_online    8 :   X_  _          : 1048576 | 256
           getpid    8 :      xx_ _       : 1048576 | 28
          recvmsg    8 :       xx_        : 1048576 | 18
             open    8 :         xx x     : 1048576 | 4
            lseek    8 :       xx _ _     : 1048576 | 9
       getsockopt    8 :      _xx         : 1048576 | 43
             stat    8 :        x_x _     : 1048576 | 7
             mmap    8 :             X    : 1048576 | 1
           putmsg    8 :             X    : 1048576 | 1
             recv    8 :     _xxx__       : 1048576 | 65
            gtime    8 :   X_____         : 1048576 | 899
            fcntl    8 :   ___xx          : 1048576 | 208
       setsockopt    8 :      __x__       : 1048576 | 55
          sendmsg    8 :           xxx    : 1048576 | 9
  lwp_cond_signal    8 :       _x__x      : 1048576 | 33
            close    8 :       __xx_      : 1048576 | 65
    lwp_cond_wait    8 :        _xx_      : 1048576 | 69
           munmap    8 :                X : 1048576 | 1
             read    8 :    ___x_x_ _ _   : 1048576 | 137
           stat64    8 :        ___X      : 1048576 | 49
            yield    8 : _    X___        : 1048576 | 1314
        so_socket    8 :        ___x_     : 1048576 | 60
        nanosleep    8 :         _x__     : 1048576 | 77
             send    8 :      ____xx__    : 1048576 | 66
          pollsys    8 :       _ _xx_     : 1048576 | 86
            ioctl    8 :    X_________    : 1048576 | 3259
          connect    8 :         _ _xx_   : 1048576 | 57
           mmap64    8 :        __x _   x : 1048576 | 14
            write    8 :     __ __x_____  : 1048576 | 115
           portfs    8 :     ______x_     : 1048576 | 619
         lwp_park    8 :        _xx_x_    : 1048576 | 654

This communicates essentially the same amount of information in just a fraction of the output (55 lines versus 482) — and makes it (much) easier to quickly compare distributions across aggregation keys. That said, one of the challenges here is that ASCII can be very limiting when trying to come with characters of clearly different height. When I demonstrated a prototype of this at the illumos BOF at Surge, a couple of folks volunteered that I should look into using the Unicode Block Elements for this output. Here’s the same enabling, but rendered using these elements:

# dtrace -n syscall:::entry'{self->ts = vtimestamp}' \
    -n syscall:::return'/self->ts/{@[probefunc] = \
    quantize(vtimestamp - self->ts); self->ts = 0}' -x aggpack
dtrace: description 'syscall:::entry' matched 233 probes
dtrace: description 'syscall:::return' matched 233 probes
^C

              key  min .-------------------. max     | count
         lwp_self   32 :   █               : 8388608 | 1
        sysconfig   32 : ▃▃▃               : 8388608 | 3
       sigpending   32 :    █              : 8388608 | 1
             pset   32 :  ▆ ▃              : 8388608 | 3
      getsockname   32 :     █             : 8388608 | 1
       systeminfo   32 :   ▅▄              : 8388608 | 5
            fstat   32 :      █            : 8388608 | 1
           llseek   32 :    ▆▃             : 8388608 | 3
          memcntl   32 :      █            : 8388608 | 1
      resolvepath   32 :      █            : 8388608 | 1
           semsys   32 :      █            : 8388608 | 1
       setcontext   32 :     █             : 8388608 | 2
          setpgrp   32 :      █            : 8388608 | 1
          fstat64   32 :   ▃▃▃             : 8388608 | 6
        sigaction   32 : ▃▃▂▃              : 8388608 | 17
           accept   32 :       █           : 8388608 | 1
           access   32 :       █           : 8388608 | 1
           munmap   32 :       █           : 8388608 | 1
        setitimer   32 :    ▅▃▃            : 8388608 | 4
      lwp_sigmask   32 : ▄▃▃▂              : 8388608 | 26
             pipe   32 :        █          : 8388608 | 1
          waitsys   32 :   ▅    ▅          : 8388608 | 2
             stat   32 :     ▃ ▆           : 8388608 | 3
             bind   32 :      ▅▄           : 8388608 | 5
          mmapobj   32 :         █         : 8388608 | 1
           open64   32 :         █         : 8388608 | 1
         p_online   32 : █▁ ▁              : 8388608 | 256
           getpid   32 :   ▁▆▂             : 8388608 | 45
         schedctl   32 :         █         : 8388608 | 2
       getsockopt   32 :    ▃▅▂            : 8388608 | 32
       setsockopt   32 :   ▁▁▃▆            : 8388608 | 31
            fcntl   32 : ▂▁▃▃▂             : 8388608 | 114
          recvmsg   32 :    ▁▄▃▂▁▁         : 8388608 | 17
           putmsg   32 :           █       : 8388608 | 1
             mmap   32 :       ▅   ▅       : 8388608 | 2
           writev   32 :         ▆ ▃       : 8388608 | 3
            lseek   32 :   ▁▃▃▄▁▁          : 8388608 | 76
              brk   32 :  ▃▂▃▃▂▁▁ ▁        : 8388608 | 124
            gtime   32 : ▇▁▁▁▁▁ ▁          : 8388608 | 1405
          sendmsg   32 :        ▂▅▄        : 8388608 | 10
             recv   32 :   ▂▃▃▂▂▁          : 8388608 | 159
            yield   32 :    █▁▁▁           : 8388608 | 409
             open   32 :      ▃▂▄▁▂        : 8388608 | 30
            close   32 :    ▂▃▂▃▂▁         : 8388608 | 60
        so_socket   32 :       ▁▃▆         : 8388608 | 29
  lwp_cond_signal   32 :    ▁▃▂▂▁▃▁        : 8388608 | 59
        nanosleep   32 :       ▂▆▂         : 8388608 | 74
    lwp_cond_wait   32 :      ▂▃▄▂         : 8388608 | 120
          connect   32 :         ▂▅▃       : 8388608 | 24
           stat64   32 :       ▁▂▇▁        : 8388608 | 77
          pollsys   32 :   ▁▁▁ ▁▅▄         : 8388608 | 148
             send   32 :    ▁▂▂▂▂▃▂▁       : 8388608 | 137
             read   32 :   ▁▂▃▂▃▂▁▁▁▁      : 8388608 | 283
            ioctl   32 :  ▇▁▁▁▁▁▁▁▁▁▁      : 8388608 | 3697
            write   32 :   ▁▁▁▂▂▃▂▁▁ ▁▁    : 8388608 | 207
          forksys   32 :                 █ : 8388608 | 1
           portfs   32 :   ▁▂▂▁▁▂▄▁ ▁      : 8388608 | 807
         lwp_park   32 :  ▁ ▁▁▁▃▃▂▃▁       : 8388608 | 919

Delicious! (Assuming, of course, that these look right for you — which they may or may not, depending on how your monospaced font renders the Unicode Block Elements.) After one look at the Unicode Block Elements, it clearly had to be the default behavior — but if your terminal is rendered in a font that can’t display the UTF-8 encodings of these characters (less common) or if they render in a way that is not monospaced despite being in a putatively monospaced font (more common), I also added a “-x encoding” option that can be set to “ascii” to force the ASCII output.

Returning to our example, if the above is too much dead space for you, you can combine it with aggzoom:

# dtrace -n syscall:::entry'{self->ts = vtimestamp}' \
    -n syscall:::return'/self->ts/{@[probefunc] = \
    quantize(vtimestamp - self->ts); self->ts = 0}' -x aggpack -x aggzoom
dtrace: description 'syscall:::entry' matched 233 probes
dtrace: description 'syscall:::return' matched 233 probes
^C

              key  min .----------------. max     | count
     lwp_continue   32 :    █           : 1048576 | 1
       sigpending   32 :    █           : 1048576 | 1
           uucopy   32 :    █           : 1048576 | 1
         lwp_kill   32 :     █          : 1048576 | 1
        sigaction   32 : ▄▄ █           : 1048576 | 4
             pset   32 :  █  ▄          : 1048576 | 3
        sysconfig   32 : ███ █          : 1048576 | 4
       setcontext   32 :    ██          : 1048576 | 2
            fstat   32 :      █         : 1048576 | 1
         recvfrom   32 :      █         : 1048576 | 1
       systeminfo   32 :   ▅█           : 1048576 | 5
      lwp_sigmask   32 : ██▂▅           : 1048576 | 14
           accept   32 :     █ █        : 1048576 | 2
          recvmsg   32 :     █ █        : 1048576 | 2
             stat   32 :        █       : 1048576 | 2
         p_online   32 : █▁ ▁           : 1048576 | 256
         schedctl   32 :      █  █      : 1048576 | 2
              brk   32 :    ▄███▄       : 1048576 | 8
       getsockopt   32 :    ▅██         : 1048576 | 21
           getpid   32 :    █▇          : 1048576 | 39
       setsockopt   32 :     ▃█▂        : 1048576 | 16
       lwp_create   32 :          █     : 1048576 | 1
          sendmsg   32 :          █     : 1048576 | 1
            fcntl   32 : ▃▄▃▆█▁         : 1048576 | 60
           writev   32 :           █    : 1048576 | 1
            lseek   32 :    ▆▇█▂        : 1048576 | 65
             recv   32 :  ▁▆▃█▆▅        : 1048576 | 62
           putmsg   32 :           █    : 1048576 | 2
             open   32 :      █ ▅▂▃     : 1048576 | 16
            gtime   32 : █▁▁▁▁  ▁       : 1048576 | 1365
            close   32 :    ▂█▂█▄▃      : 1048576 | 36
  lwp_cond_signal   32 :     ▂▂█▅▅▂     : 1048576 | 18
        so_socket   32 :      ▁▂▁█      : 1048576 | 19
             mmap   32 :     ▆▃ █▄ ▆    : 1048576 | 13
        nanosleep   32 :       ▃█▃      : 1048576 | 34
             read   32 :   ▁▂▅▄█▅▂      : 1048576 | 144
    lwp_cond_wait   32 :       ▃█▂▁     : 1048576 | 74
            yield   32 :    █▁▁▁ ▁      : 1048576 | 1187
             send   32 :     ▁▂▅█▅▄▂    : 1048576 | 50
          connect   32 :      ▂▂ ▅██▃   : 1048576 | 20
           stat64   32 :     ▁▁▁▃█▁     : 1048576 | 80
          pollsys   32 :     ▁▁▁█▆      : 1048576 | 136
            write   32 :  ▂▄ ▃▂▄█▆▃▂ ▂  : 1048576 | 77
            ioctl   32 :  █▂▁▁▁▁▁▁ ▁    : 1048576 | 4587
           mmap64   32 :      ▂▄▄▂▅ ▂ █ : 1048576 | 20
           portfs   32 :  ▁▁▂▂▁▂▂█▂     : 1048576 | 601
         lwp_park   32 :    ▁▁▂▆▇▃█▁▁   : 1048576 | 787

Here’s another fun one:

# dtrace -n BEGIN'{start = timestamp}' \
    -n sched:::on-cpu'{@[execname] = \
    lquantize((timestamp - start) / 1000000000, 0, 30, 1)}' \
    -x aggpack -n tick-1sec'/++i >= 30/{exit(0)}' -q
              key     min .--------------------------------. max      | count
             sshd     < 0 : █                              : >= 30    | 1
            zpool     < 0 : █                              : >= 30    | 8
              zfs     < 0 : █                              : >= 30    | 19
          zoneadm     < 0 : █                              : >= 30    | 19
            utmpd     < 0 :     █                          : >= 30    | 1
            intrd     < 0 :                             █  : >= 30    | 2
             epmd     < 0 :   ▂    ▂    ▂    ▂    ▂    ▂   : >= 30    | 6
            mdnsd     < 0 :   ▂    ▂    ▂    ▂    ▂    ▂   : >= 30    | 6
         sendmail     < 0 :   ▂    ▂    ▂    ▂    ▂    ▂   : >= 30    | 6
         ur-agent     < 0 :   ▂▂         ▂         ▂   ▃▂  : >= 30    | 14
           vmadmd     < 0 :     ▃    ▂    ▂    ▂    ▂    ▃ : >= 30    | 22
         devfsadm     < 0 : ▁ ▁ ▁ ▁ ▁ ▁ ▁ ▁ ▁ ▁ ▁ ▁ ▁ ▁ ▁  : >= 30    | 30
          bridged     < 0 : ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ : >= 30    | 30
          fsflush     < 0 : ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ : >= 30    | 32
              fmd     < 0 :        ▁▁        ▁▇        ▁▁  : >= 30    | 41
            ipmon     < 0 : ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ : >= 30    | 51
          ipmgmtd     < 0 :         ▃         ▃         ▃  : >= 30    | 66
            lldpd     < 0 : ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ : >= 30    | 90
           dtrace     < 0 : ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ : >= 30    | 111
          haproxy     < 0 : ▂▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁ : >= 30    | 113
      svc.configd     < 0 :         ▂▁        ▄▁        ▂▁ : >= 30    | 126
       svc.startd     < 0 :         ▃▁        ▃▁        ▃▁ : >= 30    | 150
         beam.smp     < 0 : ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ : >= 30    | 209
             nscd     < 0 : ▁       ▃▂        ▃▂        ▃▂ : >= 30    | 192
         postgres     < 0 : ▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ : >= 30    | 598
     redis-server     < 0 : ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ : >= 30    | 598
      zpool-zones     < 0 :     ▂  ▁ ▂    ▃  ▁ ▂    ▂  ▁ ▃ : >= 30    | 623
             java     < 0 : ▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁ : >= 30    | 1211
             tail     < 0 : ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ : >= 30    | 1178
             node     < 0 : ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁ : >= 30    | 11195
            sched     < 0 : ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ : >= 30    | 22101

This (quickly) shows you the second offset for CPU scheduling events from the start of the D script for various applications, and points to some obvious conclusions: there are clearly a bunch of programs running once every ten seconds, a couple more running once every five seconds, and so on.

encoding

Seeing this, why should aggpack hog all that Unicode hawtness? While only aggpack will use the UTF-8 encodings by default, I also extended the encoding option to allow a “utf8” setting that forces the UTF-8 encoding of the Unicode Block Elements to be used for all aggregation display. For example, to return to our earlier agghist example:

# dtrace -n syscall:::entry'{@[execname] = count()}' \
    -x agghist -x aggzoom -x encoding=utf8
dtrace: description 'syscall:::entry' matched 233 probes
dtrace: description 'syscall:::entry' matched 233 probes
^C

              key  ------------- Distribution ------------- count
            utmpd |                                         2
             sshd |                                         10
           pickup |                                         14
              fmd |                                         20
         ur-agent |                                         21
             epmd |                                         22
           vmadmd |                                         31
      jsontool.js |                                         33
            mdnsd |                                         33
           master |                                         41
            intrd |                                         45
         devfsadm |                                         54
          ipmgmtd |                                         55
          bridged |                                         81
            mkdir |                                         83
              cat |                                         96
         sendmail |                                         101
            ipmon |                                         108
          dirname |                                         115
       svc.startd |                                         129
            ksh93 |                                         164
               ls |                                         182
          svcprop |▏                                        258
             bash |▏                                        290
            zpool |▎                                        436
             date |▎                                        448
             cron |▎                                        470
            lldpd |▍                                        595
 dump-minutely-sd |▍                                        609
          haproxy |▍                                        654
      svc.configd |▌                                        907
         zoneadmd |█                                        1554
              zfs |█▎                                       1979
          zoneadm |█▋                                       2414
             nscd |█▋                                       2522
             java |██▏                                      3205
             tail |██▌                                      3775
         postgres |████▏                                    6083
     redis-server |██████████████▋                          21710
           dtrace |███████████████████▎                     28509
             node |█████████████████████████▍               37427
         beam.smp |██████████████████████████████████████   55948

To paraphrase Kent Brockman, I for one welcome our new Unicode Block Element overlords — and I look forward to toiling in their underground sugar caves!

Availability

All of these new options have been integrated into SmartOS (and we aim to get them up into illumos). If you’re a Joyent public cloud customer, you already all have these improvements or they will be coming to you when your platform is next upgraded (depending on the compute node that you have landed on). And if you’re a DTrace user and like Brendan and have some new way that you’d like to see data visualized (or have any other DTrace feature request), don’t hesitate to speak up — DTrace only improves when those of us who depend on it every day envision a way that it could be even better!

Posted on November 10, 2013 at 9:30 pm by bmc · Permalink · Comments Closed
In: Uncategorized

Happy 10th Birthday, DTrace!

Ten years ago this morning — at 10:27a local time, to be exact — DTrace was integrated. On the occasion of DTrace’s fifth birthday a half-decade ago, I reflected on much of the drama of the final DTrace splashdown, but much has happend in the DTrace community in the last five years; some highlights:

And a bunch of other stuff that’s either more esoteric (e.g., llquantize) or that I now use so frequently that I’ve forgotten that there was a time that we didn’t have it (-x temporal FTW!). But if I could go back to my five-year-ago self, the most surprising development to my past-self may be how smoothly DTrace and its community survived the collapse of Sun and the craven re-closing of Solaris. It’s a testament to the power of open source and open communities that DTrace has — along with its operating system of record, illumos — emerged from these events empowered and invigorated. So I would probably show my past-self my LISA talk on the rise of illumos, and he and I would both get a good chuckle out of it all — and then no doubt erupt in fisticuffs over the technical impossibility of fds[] in the non-global zone

And lest anyone think that we’re done, the future is bright, as the El Dorado of user-land types in DTrace may be gleaming on the horizon courtesy of Adam and Robert Mustacchi. I think I speak on behalf of all of us in the broader DTrace community when I say that I’m very much looking forward to continuing to use, improve and expand DTrace for the next decade and beyond!

Posted on September 3, 2013 at 7:00 am by bmc · Permalink · Comments Closed
In: Uncategorized

Serving up disaster porn with Manta

Years ago, Ben Fried liberated me by giving me the words to describe myself: I am a disaster porn addict. The sordid specifics of my addiction would surely shock the unafflicted: there are well-thumbed NTSB final accident reports hidden under my matress; I prowl the internet late at night for pictures of dermoid cysts; and I routinely binge on the vexed Soviet missions to Mars.

In terms of my software work, my predilection for (if not addiction to) systems failure has manifested itself in an acute interest in postmortem debugging and debuggability. You can see this through my career, be it postmortem memory leak detection in 1999, postmortem object type identifcation in 2002, postmortem DTrace in 2003 or postmortem debugging of JavaScript in 2012.

That said, postmortem debugging does have a dark side — or at least, a tedious side: the dumps themselves are annoyingly large artifacts to deal with. Small ones are only tens of megabytes, but a core dump from a modestly large VM — and certainly a crash dump of a running operating system — will easily be gigabytes or tens of gigabyes in size. This problem is not new, and my career has been pockmarked by centralized dump servers that were constantly running low of space[1]. To free up space in the virtual morgue, dumps from resolved issues are inevitably torched on a regular basis. This act strikes me as a a kind of desecration; every dump — even one whose immediate failure is understood — is a sacred snapshot of a running system, and we can’t know what questions we might have in the future that may be answerable by the past. There have been many times in my career when debugging a new problem has led to my asking new questions of old dumps.

At Joyent, we have historically gotten by on the dump management problem with the traditional centralized dump servers managed with creaky shell scripts — but it wasn’t pretty, and I lived in fear of dumps slipping through the cracks. Ironically, it was the development of Manta that made our kludged-together solution entirely untenable: while in the Joyent public cloud we may have the luxury of (broadly) ignoring core dumps that may correspond to errant user code, in Manta, we care about every core dump from every service — we always want to understand why any component fails. But because it’s a distributed service, a single bug could cause many components to fail — and generate many, many dumps. In the earlier stages of Manta development, we quickly found ourselves buried under an avalanche of hundreds of dumps. Many of them were due to known issues, of course, but I was concerned that a hithertofore unseen issue would remain hidden among the rubble — and that we would lose our one shot to debug a problem that would return to haunt us in production.

A sexy solution

Of course, the answer to the Manta dump problem was clear: Manta may have been presenting us with a big data problem, but it also provided us a big data solution. Indeed, Manta is perfectly suited for the dump problem: dumps are big, but they also require computation (if in modest measure) to make any real use of them, and Manta’s compute abstraction of the operating system is perfect for running the debugger. There were just two small wrinkles, the first being that Manta as designed only allowed computation access to an object via a (non-interactive) job, posing a problem for the fundamentally interactive act of debugging. This wasn’t a technological limitation per se (after all, the operating system naturally includes interactive elements like the shell and so on), but allowing an interactive shell to be easily created from a Manta job was going to require some new plumbing. Fortunately, some lunchtime moaning on my part managed to convince Joshua Clulow to write this (if only to quiet my bleating), and the unadulterated awesomeness that is mlogin was born.

The second wrinkle was even smaller: for historical (if not antiquated) reasons, kernel crash dumps don’t have just one file, but two — a crash dump and a “name list” that contained the symbol table. But for over a decade we have also had the symbol table in the dump itself, and it was clear the vestigial appendix that was unix.0 needed to be removed and libkvm modified to dig it out of the dump — a straightforward fix.

Wrinkles smoothed, we had the necessary foundation in Manta to implement Thoth, a Manta-based system for dump management. Using the node-manta SDK to interact with Manta, Thoth allows for uploading dumps to Manta, tracking auxiliary information about each dump, debugging a dump (natch), querying dumps, and (most interestingly) automatically analyzing many dumps in parallel. (If you are in the SmartOS or broader illumos community and you want to use Thoth, it’s open source and available on GitHub.)

While the problem Thoth solves may have but limited appeal, some of the patterns it uses may be more broadly applicable to those building Manta-based systems. As you might imagine, Thoth uploads a dump by calculating a unique hash for the dump on the client[2]. Once the hash is calculated, a Manta directory is created that contains the dump itself. For the metadata associated with the dump (machine, application, datacenter and so on), I took a quick-and-dirty approach: Thoth stores the metadata about a dump in a JSON payload that lives beside the dump in the same directory. Here is what a JSON payload might look like:

% thoth info 6433c1ccfb41929d
{
  "name": "/thoth/stor/thoth/6433c1ccfb41929dedf3257a8d9160ea",
  "dump": "/thoth/stor/thoth/6433c1ccfb41929dedf3257a8d9160ea/core.svc.startd.70377",
  "pid": "70377",
  "cmd": "svc.startd",
  "psargs": "/lib/svc/bin/svc.startd",
  "platform": "joyent_20130625T221319Z",
  "node": "eb9ca020-77a6-41fd-aabb-8f95305f9aa6",
  "version": "1",
  "time": 1373566304,
  "stack": [
    "libc.so.1`_lwp_kill+0x15()",
    "libc.so.1`raise+0x2b()",
    "libc.so.1`abort+0x10e()",
    "utmpx_postfork+0x44()",
    "fork_common+0x186()",
    "fork_configd+0x8d()",
    "fork_configd_thread+0x2ca()",
    "libc.so.1`_thrp_setup+0x88()",
    "libc.so.1`_lwp_start()"
  ],
  "type": "core",
}

To actually debug a dump, one can use the “debug” command, which simply uses mlogin to allow interactive debugging of the dump via mdb:

% thoth debug 6433c1ccfb41929d
thoth: debugging 6433c1ccfb41929dedf3257a8d9160ea
 * created interactive job -- 8ea7ce47-a58d-4862-a070-a220ae23e7ce
 * waiting for session... - established
thoth: dump info in $THOTH_INFO
Loading modules: [ svc.startd libumem.so.1 libc.so.1 libuutil.so.1 libnvpair.so.1 ld.so.1 ]
>

Importantly, this allows you to share a dump with others without moving the dump and without needing to grant ssh access or punch holes in firewalls — they need only be able to have access to the object. For an open source system like ours, this alone is huge: if we see a panic in (say) ZFS, I would like to enlist (say) Matt or George to help debug it — even if it’s to verify a hypothesis that we have developed. Transporting tens of gigs around becomes so painful that we simply don’t do it — and the quality of software suffers.

To query dumps, Thoth runs simple Manta compute jobs on the (small) metadata objects. For example, if you wanted to find all of the dumps from a particular machine, you would run:

% thoth ls node=eb9ca020-77a6-41fd-aabb-8f95305f9aa6
thoth: creating job to list
thoth: created job 2e8af859-ffbf-4757-8b54-b7865307d4d9
thoth: waiting for completion of job 2e8af859-ffbf-4757-8b54-b7865307d4d9
DUMP             TYPE  TIME                NODE/CMD         TICKET
0ccf27af56fe18b9 core  2013-07-11T18:11:43 svc.startd       OS-2359
6433c1ccfb41929d core  2013-07-11T18:11:44 svc.startd       OS-2359

This kicks off a job that looks at all of the metadata objects in parallel in a map phase, pulls out those that match the specified machine and passes them to a reduce phase that simply serves to coalesce the results into a larger JSON object. We can use the mjob command along with Trent Mick‘s excellent json tool to see the first phase:

% mjob get 2e8af859-ffbf-4757-8b54-b7865307d4d9 | json phases[0].exec
mls /thoth/stor/thoth | awk '{ printf("/thoth/stor/thoth/%s/info.json\n", $1) }' | xargs mcat

So this, as it turns out, is a tad fancy: in the context of a job, it runs mls on the thoth directory, formats that as an absolute path name to the JSON object, and then uses the venerable xargs to send that output as arguments to mcat. mcat interprets its arguments as Manta objects, and runs the next phase of a job in parallel on those objects. We could — if we wanted — do the mls on the client and then pass in the results, but it would require an unnecessary roundtrip between server and client for each object; the power of Manta is that you can get the server to do work for you! Now let’s take a look at the next phase:

% mjob get 2e8af859-ffbf-4757-8b54-b7865307d4d9 | json phases[1].exec
json -c 'node=="eb9ca020-77a6-41fd-aabb-8f95305f9aa6"'

This is very simple: it just uses json‘s -c option to pass the entire JSON blob if the “node” property matches the specified value. Finally, the last phase:

% mjob get 2e8af859-ffbf-4757-8b54-b7865307d4d9 | json phases[2].exec
json -g

This uses json (again) to aggregate all of the JSON blobs into one array that contains all of them — and this array is what is returned as the output of the job. Now, as one might imagine, all of this is a slight abuse of Manta, as it’s using Manta to implement a database (albeit crudely) — but Manta is high performing enough that this works well, and it’s much simpler than having to spin up infrastructure dedicated to Thoth (which would require sizing a database, making it available, backing it up, etc). Manta loves it quick and dirty!

Getting kinky

Leveraging Manta’s unique strengths, Thoth also supports a notion of analyzers, simple shell scripts that run in the context of a Manta job. For example, here is an analyzer that determines if a dump is a duplicate of a known svc.startd issue:

#
# This analyzer only applies to core files
#
if [[ "$THOTH_TYPE" != "core" ]]; then
	exit 0;
fi

#
# This is only relevant for svc.startd
#
if [[ `cat $THOTH_INFO | json cmd` != "svc.startd" ]]; then
	exit 0;
fi

#
# This is only OS-2359 if we have utmpx_postfork in our stack
#
if ( ! echo ::stack | mdb $THOTH_DUMP | grep utmpx_postfork > /dev/null ); then
	exit 0;
fi

#
# We have a winner! Set the ticket.
#
thoth_ticket OS-2359
echo $THOTH_NAME: successfully diagnosed as OS-2359

Thoth’s “analyze” command can then be used to run this against one dump, any dumps that match a particular specification, or all dumps.

The money shot

Here’s a more involved analyzer that pulls an SMF FMRI out of the dump and — looking at the time of the dump — finds the log (in Manta!) that corresponds to that service at that hour and looks for a string (“PANIC”) in that log that corresponds to service failure. This log snippet is then attached to the dump via a custom property (“paniclog”) that can then be queried by other analyzers:

if [[ "$THOTH_TYPE" != "core" ]]; then
	exit 0
fi

if ( ! pargs -e $THOTH_DUMP | grep -w SMF_FMRI > /dev/null ); then
	exit 0
fi

fmri=`pargs -e $THOTH_DUMP | grep -w SMF_FMRI | cut -d= -f2-`
time=`cat $THOTH_INFO | json time`
node=`cat $THOTH_INFO | json node`

path=`basename $fmri | cut -d: -f1`/`date -d @$time +%Y/%m/%d/%H`/${node}.log

echo === $THOTH_NAME ===

echo log: $path

if ( ! mget /$MANTA_USER/stor/logs/$path > log.out ); then
	echo "  (not found)"
	exit 0
fi

grep -B 10 -A 10 PANIC log.out > panic.out

echo paniclog:
cat panic.out

thoth_set paniclog < panic.out

In a distributed system, logs are a critical component of system state — being able to quickly (and automatically) couple those logs with core dumps has been huge for us in terms of quickly diagnosing issues.

Pillow talk

Thoth is definitely a quick-and-dirty system — I was really designing it to satisfy our own immediate needs (and yes, addictions). If the usage patterns changed (if, for example, one were going to have millions of dumps instead of thousands), one would clearly want a proper database fronting Manta. Of course, it must also be said that one would then need to deal with annoying infrastructure issues like sizing that database, making it available, backing it up, etc. And if (when?) this needs to be implemented, my first approach would be to broadly keep the structure as it is, but add a Manta job that iterates over all metadata and adds the necessary rows in the appropriate tables of a centralized reporting database. This allows Manta to store the canonical data, leveraging the in situ compute to populate caching tiers and/or reporting databases.

As an engineer, it’s always gratifying to build something that you yourself want to use. Strange as it may sound to some, Thoth epitomizes that for me: it is the dump management and analysis framework that I never dreamed possible — and thanks to Manta, it was amazingly easy (and fun!) to build. Most importantly, in just the short while that we’ve had Thoth fully wired up, we have already caught a few issues that would have likely escaped notice before. (Thanks to a dump harvested by Thoth, we caught a pretty nasty OS bug that had managed to escape detection for many years.) Even though Thoth’s purpose is admittedly narrow, it’s an example of the kinds of things that you can build on top of Manta. We have many more such ideas in the works — and we’re not the only ones; Manta is inspiring many with their own big data problems — from e-commerce companies to to scientific computing and much in between!


[1] Pouring one out for my dead homie, cores2.central

[2] Shout-out to Robert‘s excellent node-ctype which allows for binary dumps to be easily parsed client-side

Posted on July 23, 2013 at 5:05 pm by bmc · Permalink · 2 Comments
In: Uncategorized

Manta: From revelation to product

If you haven’t seen it, you should read Mark’s blog entry on Manta, the revolutionary new object storage service we announced today that features in situ compute. The idea for this is beautifully simple: couple a ZFS-based distributed object storage system with the OS-level virtualization in SmartOS to deliver a system that allows arbitrary compute to be spun up where objects live. That is, not only can you store and retrieve objects (as you can with any internet-facing object store), you can also specify compute jobs to be operated upon those objects without requiring data motion. (If you still need to be convinced that this represents a new paradigm of object storage, check out this screencast.)

Sometimes simple ideas can seem obvious in hindsight — especially as the nuance of historical context is lost to time — so for the record let me be clear on this point: this idea wasn’t obvious. I say this unequivocally because I myself was trying think about how we could use the technological differentiators in SmartOS to yield a better object store — and it was very hard to not seek that differentiator in ZFS. As myopic as it may seem now in retrospect, I simply couldn’t look beyond ZFS — it just had to hold the riddle to a next generation object store!

But while ZFS is essential to Manta, it is ultimately as an implementation detail; the technology that serves as the essential enabler is the OS-level virtualization provided by Zones, which allow us to easily, quickly and securely spin up on-the-metal, multi-tenant compute on storage nodes without fear of compromising the integrity of the system. Zones hit the sweet spot: hardware virtualization (e.g. KVM) is at too low a level of abstraction to allow this efficiently, and higher levels of virtualization (e.g. PaaS offerings) sacrifice expressive power and introduce significant multi-tenancy complexity and risk.

Of course, all of this was obvious once Mark had the initial insight to build on Zones; what we needed to build was instantly self-evident. This flash of total clarity is rare in a career; I have only felt it a handful of times and it’s such an intense moment that it becomes locked in memory. I remember exactly where I was when Bonwick described to me the first ideas for what became ZFS (my dimly lit office in MPK17, circa 2000) — and I remember exactly where I was when I described to Bonwick my first ideas for what became DTrace (in Bart’s old blue van on Willow Road crossing the 101, February 1996). Given this, there was one thing about Manta that troubled me: I couldn’t remember where I was when Mark described the idea to me. I knew that it came from Mark, and I knew that it was sometime in the fall of 2011, but I couldn’t remember the details of the conversation. In talking to Mark about this, he couldn’t remember either — so I decided to go through old IM logs to determine when we first started talking about it to help us both date it.

And in going through my logs, it became clear why I couldn’t remember that initial conversation — because there wasn’t one, at least not in the traditional sense: it happened over IM. This (accidentally) captured for posterity a moment of which one has so few: having one’s brain blasted by enlightenment. (I know that it will disappoint my mother that I dealt with this essentially by swearing, so let me pause to explain that this isn’t her fault; as she herself points out from time to time, I wasn’t raised this way.) So here is my initial conversation with Mark, with some sensitive details redacted:

Of course, a flash of insight is a long way from a product — and that conversation over a year and a half ago was just the beginning of a long journey. As Mark mentioned, shortly after this conversation, he was joined by Dave who led the charge on Manta computation. To this duo, we added Yunong, Nate and Fred. Beyond that core Manta team, Keith developed the hardware architecture for Manta, Jerry developed the first actual Manta code with hyprlofs, and Bill and Matt wrote a deployment management system for Manta — and Matt further developed the DNS service that is at the heart of the system.

As Mark mentioned, Manta is built on top of SmartDataCanter, and as such the engineering work behind it was crucial to Manta: Orlando developed the compute node service that is involved with the provision of every Manta service, Pedro built the workflow engine that actually implements provisioning and Kevin developed the operator console that you’ll have to just trust me is awesome. John developed the auth mechanism that many first-time users will use to create their Joyent accounts today, and Andrés developed the name service replication that assures that those new users will be able to store to Manta.

In terms of SDKs, Marsell developed the ruby-manta SDK, Trent developed both the Python SDK (including mantash!) and Bunyan — the node.js logging service that has been proven indispensable time and time again when debugging Manta issues. Speaking of node.js, no one will be surprised to learn that Manta is largely implemented in node — and that TJ and Isaac were both clutch in helping us debug some nasty issues down the stretch, reinforcing our conviction that the entire stack should be under one (virtual) roof!

Josh Wilsdon developed the vmadm that lies at the heart of SmartOS provisioning — and deserves special mention for the particularly heavy hand he applied to Manta in some of our final load testing; any system that can survive Josh is production-ready! Robert and Rob both jumped in on countless networking issues — they were both critical for implementing some necessary but complicated changes to Manta’s physical networking topology. Brendan provided the performance analysis that is his hallmark, joining up with Robert to form Team Tunable — from whom no TCP tunable is safe!

Jonathan developed the multi-archecture package support which became necessary for the Manta implementation and Filip made sure all sorts of arcane software ran on SmartOS in the Manta compute zone. (When you find that what you’re looking for is already installed in a Manta compute zone, you have Filip to thank!) Finally, Josh Clulow developed mlogin — which really must be tried to be believed. If you’re trying to understand the Manta compute model or if you just want to play around, give mlogin a whirl!

From Mark’s initial flash of insight (and my barrage of swearing) to finished product, it has been a long road, but we are proud to deliver it to you today; welcome to Manta — and dare we say to the next generation of object storage!

Posted on June 25, 2013 at 6:16 am by bmc · Permalink · One Comment
In: Uncategorized

A systems software double-header: Surge and GOTO

I recently returned from a systems software double-header, presenting at two of the industry’s best conferences: Surge in Baltimore and GOTO in Aarhus, Denmark. These conferences have much in common: they are both expertly run; they both seek out top technical content; they both attract top technologists; and (it must be said) they both take excellent care of their speakers!

At Surge, I presented with Brendan Gregg on DIRT in production. This was a dense presentation, but it was fun for us to recount a sampling of the kinds of issues that we have seen in the new breed latency-sensitive data-intensive application. I think Brendan and I both agreed that this presentation could have easily been two or three times the length; we have plenty of scars from finding latency outliers in DIRTy systems! Beyond presenting, the lightning talks were a highlight for me at this year’s Surge. I am an inveterate disaster porn addict, so it should come as no surprise that I particularly enjoyed Chris Burroughs‘ retelling of how his day that he had set aside to prepare his lightning talk was beset by disaster (including a PDU failure at the moment that it seemed things couldn’t get any worse), and then Scott Sanders on the surprisingly nasty failure modes that one can see in a mobile app when the root-cause is a botched datacenter upgrade in your mobile provider. For my own lightning talk, I dug into the perverse history of what may well be the most obscure two-letter Unix command: the “old and rarely used” ta; hopefully others enjoyed learning about the history of this odd little wart as much as I did!

Update: This lightning talk is now online and can be viewed here.


Dave and me with Lars Bak, Kasper Lund and the Dart (né V8) team

After Surge, Dave and I headed east, to Denmark. There, we presented on dynamic languages in production. Our purposes for going to Aarhus were more than just GOTO, however: we also went to meet with Lars Bak, Kasper Lund, Erik Corry, the inimitable Mr. Aleph and the other engineers behind V8. As the birthplace of V8, Aarhus has always had a Dagobah-like appeal to me: I feel that when on the freezing ice planet that is mdb_v8.c, Dave and I might as well have seen Obi-Wan Kenobi as an apparition, telling us we must go to Aarhus if we wanted to truly understand objects-inl.h. Dave and I immensely enjoyed our time with Lars and his team; it is rare to be in the presence of a team that has single-handedly changed the trajectory of software — and it is rarer still to be able bring one’s own technology to such an august table. We were thrilled to be able to demonstrate the confluence of our technology (specifically, MDB and DTrace) with theirs — and to brainstorm ways that we might better solve the abstract problem of production debuggability of dynamic environments. This is a problem that we’ve been after in one way or another for nearly a decade, but this was the first time that we were having the conversation early enough in the VM implementation phase (specifically, of Dart) that one can hope to get ahead of the problem instead of chasing it. In particular, Lars’s team has a very interesting idea for debuggability of Dart that — if successful — may allow it to achieve our nirvana of total postmortem debuggability in a dynamic environment. That, coupled with its enlightened (C-based!) native interface, left Dave and me optimistic about the long-term prospects of Dart on the server side. It’s obviously early days still for Dart, but given the pedigree and talent of the team and their disposition towards making it both highly performing and highly debuggable, it’s clearly something to pay attention to on the server side.

Presentation and meeting accomplished, I still had a burning question: what was all of this terrific VM talent doing in the second city of a small country with an excruciatingly difficult language to pronounce? (Aside: ask a Dane to say “dead red red-eyed rotten smoked trout” in Danish.) In talking to Lars, Kasper and Erik about how they ended up in Aarhus a common theme emerged: all had — in one way or another — been drawn to work with Professor Ole Lehrmann Madsen on the VM for the BETA language. (No, I hadn’t heard of BETA either, and yes, it’s in all caps — the name is apparently meant to be screamed.) Everyone I spoke with had such fond memories of BETA (“you know, it was actually a very interesting system…”) that I have taken to calling this school of VM engineers the BETA School. The work and influence of the engineers of the BETA School extend beyond Google, with its members having done foundational VM work at VMware, Sun and a slew of startups.

I suppose I shouldn’t be surprised that this cluster of engineers has a professor and a university at its root; Brown’s Tom Doeppner has served a similar role for us, with his CS169 giving rise to an entire generation of OS innovation. The two schools of engineers share much in common: both are highly technical and close-knit — but with a history of being open to newcomers and an emphasis on developing their junior members. (Of course, the difference is that we of the CS169 School don’t all live in Providence.) And the intersection of the two schools is particularly interesting: VMware’s seminal ASPLOS paper is a joint product of Keith Adams of the CS169 School and Ole Agesen of the BETA School.

Curiosity about Aarhus and its origins sated, I arrived home from my systems software double header exhausted but also inspired: there seems to be no limit of hard, interesting problems in our domain — and many profoundly talented engineers to tackle them!

Posted on October 8, 2012 at 12:00 am by bmc · Permalink · One Comment
In: Uncategorized

Post-revolutionary open source

I spent last week in Porto Alegre, Brazil at FISL (the Fórum Internacional Software Livre — which Americans will pronounce “fizzle” but which the Brazilians pronounce more like “fees-lay”). I was excited to go to FISL based on the past experiences of former Sun Microsystems folks like Eric Schrock and Deirdré Straughan — and I thought that it would provide an excellent opportunity to present what I’ve learned over the past decade or so about how companies screw up open source. My presentation, corporate open source anti-patterns, was well received — surprisingly so, in fact. In particular, I had expected controversy around my claim that anti-collaborative strong-copyleft licenses — which is to say, the GPLv2 — are an open source anti-pattern. As I noted in my presentation, I believe that the GPLv2 serves more to erect walls between open source projects than it does to liberate proprietary ones. To support this assertion, I pointed to the fact that, despite significant demand, Linux does not have ports of DTrace and ZFS generally available — even though they have been open source for the better part of a decade and have been ported to many other operating systems. But as I made the case for moving away from the GPLv2 (and towards collaboration-friendly licenses like MPLv2), I was surprised to see heads bobbing in agreement. And when it came time for questions, the dissent (such as it was) was not regarding my assertion about GPLv2, but rather about copyright assignment: I claimed (and claim) that one must be very careful about having copyright assignment in an open source project because it relies on the good behavior of a single actor. But one of the questioners pointed out that lack of copyright assignment has the very important drawback of making changing the license essentially impossible. The questioner’s background is relevant here: it was Jean-Baptiste Kempf, one of the core contributors to VLC. JB’s question came from his own experience: he had just led the VLC team through relicensing libVLC, an exercise made excruciating by their lack of copyright assignment. Why were they relicensing their core software? Ironically, because they had come to my same conclusion with respect to anti-collaborative licensing: they relicensed libVLC from GPLv2 to LGPLv2 to allow it to be more readily shared with other projects. And of course, VLC’s not alone — meteor.js is another high-profile project that came out as GPLv2 but has since relicensed to MIT.

So what’s going on here? I think the short answer is that open source is firmly in its post-revolutionary phase: it is so throughly mainstream that it is being entirely de-radicalized. While some extremists may resist this trend, it actually reflects the depth of victory for open source: the triumph is so complete that the young software engineers of today simply don’t see proprietary software as a threat. Like an aspiring populace after a long period of conflict, the software engineers behind today’s top open source projects are looking much more to a future of collaboration than to the past of internecine rivalries. While there are many indicators of this new era — from the license breakdown of Github’s top projects to the rise of Github itself — I think that the MPLv2 seems to particularly capture this zeitgeist: when it was designed, it was done so to be explicitly compatible with both GPL (GPLv2 and GPLv3) and Apache. So the open source revolution may be over, but that’s because the war has been won — may we not squander its spoils!

Posted on August 1, 2012 at 9:49 pm by bmc · Permalink · One Comment
In: Uncategorized

DTrace in the zone

When the zones facility was originally developed many years ago, our focus on integrating it with respect to DTrace was firmly on the global zone: we wanted to be sure that we could use DTrace from the global zone to instrument the entire system, across all zones. This was (and remains) unique functionality — no other implementation of OS-based virtualization has the whole-system dynamic instrumentation to go with it. But as that focus implies, the ability to use DTrace in anything but the global zone was not a consideration. Fortunately, in 2006, Dan Price on the zones team took it upon himself to allow DTrace to be used in the non-global zone. This was a tricky body of work that had to carefully leverage the DTrace privilege model developed by Adam, and it allowed root users in the non-global zone to have access to the syscall provider and any user-level providers in their zone (both USDT-provided and pid-provided).

As great as this initial work was, when I started at Joyent nearly two years ago, DTrace support in the zone hadn’t really advanced beyond it — and there were still significant shortcomings. In particular, out of (reasonable) paranoia, nothing in the kernel could be instrumented or recorded by an enabling in the non-global zone. The problem is that many commonly used D variables — most glaringly, cpu, curlwpsinfo, curpsinfo, and fds[] — rely on access to kernel structures. Worse, because of our success in abstracting away the kernel with these variables, users didn’t even know that they were trying to record illegal kernel data — which meant that the error message when it failed was entirely unintuitive:

  [my-non-global-zone ~]# dtrace -n BEGIN'{trace(curpsinfo->pr_psargs)}'
  dtrace: description 'BEGIN' matched 1 probe
  dtrace: error on enabled probe ID 1 (ID 1: dtrace:::BEGIN): invalid kernel access in action #1 at DIF offset 44

The upshot of this was, as one Joyent customer put it to me bluntly, “whenever I type in a script from Brendan’s book, it doesn’t work.” And just to make sure I got the message, the customer sharpened the point considerably: “I’m deployed on SmartOS because I want to be able to use DTrace, but you won’t let me have it!” At that point, I was pretty much weeping and begging for forgiveness — we clearly had to do better.

As I considered this problem, though, a somewhat obvious solution to at least a portion of the problem became clear. The most painful missing variables — curpsinfo and curlwpsinfo — are abstractions on the kernel’s proc_t and kthread_t structures, respectively. There are two important observations to be made about these structures. First, looking at these structures, there is nothing about one’s own proc_t or kthread_t structures that one can use for privilege escalation: it’s either stuff you already can figure out about yourself via /proc, or kernel addresses that you can’t do anything with anyway. Second (and essentially), we know that these structures cannot disappear or otherwise change identity while in probe context: these structures are only freed after the process or thread that they represent has exited — and we know by merits of our execution in probe context that we have not, in fact, exited. So we could solve a good portion of the problem just by allowing access to these structures for processes for which one has the dtrace_proc privilege. This change was straightforward (and the highlight of it for me personally was that it reminded me of some clever code that Adam had written that dynamically instruments DIF to restrict loads as needed). So this took care of curpsinfo, curlwpsinfo, and cpu — but it also brought into stark relief an even tougher problem: fds[].

The use of fds[] in the non-global zone has long been a problem, and it’s a brutal one, as fds[] requires grovelling around in kernel data structures. For context, here is the definition of fds[] for illumos, as defined in io.d:

inline fileinfo_t fds[int fd] = xlate  (
    fd >= 0 && fd t_procp->p_user.u_finfo.fi_nfiles ?
    curthread->t_procp->p_user.u_finfo.fi_list[fd].uf_file : NULL);

Clearly, one cannot simply allow such arbitrary kernel pointer chasing from the non-global zone without risk of privilege escalation. At dtrace.conf, I implied (mistakenly, as it turns out) that this problem depends on (or would be otherwise solved by) dynamic translators. As I thought about this problem after the conference, I realized that I was wrong: even if the logic to walk those structures were in-kernel (as it would be, for example, with the addition of a new subroutine), it did not change the fundamental problem that the structures that needed to be referenced could themselves disappear (and worse, change identity) during the execution of an enabling in probe context. As I considered the problem, I realized that I had been too hung up on making this arbitrary — and not focussed enough on the specifics of getting fds[] to work.

With the focus sharpening to fds[], I realized that some details of the the kernel’s implementation could actually make this reasonable to implement. In particular, Jeff‘s implementation of allocating slots in fi_list means that an fi_list never goes away (it is doubled but the old fi_list is not freed) and that fi_nfiles is never greater than the memory referred to by fi_list. (This technique — memory retiring — can be used to implement a highly concurrent table; the interested reader is pointed to both the ACM Queue article that Jeff and I wrote that describes it in detail and to the implementation itself.) So as long as one is only able to get the file_t for one’s own file descriptor, we could know that the array of file_t pointers would not itself be freed over the course of probe context. That brings us to the file_t itself: unlike the fi_list, the file_t can be freed while one is in probe context because one thread could be closing a file while another is in probe context referencing fds[]. Solving this problem required modifying the kernel, if one slightly: an added a hook to the closef() path that DTrace can optionally use to issue a dtrace_sync(). (A dtrace_sync() issues a synchronous cross-call to all CPUs; because DTrace disables interrupts in probe context, this can be used as a barrier with respect to probe context.)

Adding a hook of this nature requires an understanding of the degree to which the underlying code path is performance-critical. That is, to contemplate adding this hook, we needed to ask: how hot is closef(), anyway? Historically in my career as a software engineer, this kind of question would be answered with a combination of ass scratching and hand waving. Post-DTrace, of course, we can answer those questions directly — but only if we have a machine that has a representative workload (which we only had to a very limited degree at Sun). But one of my favorite things about working at Joyent is that we don’t have to guess when we have a question like this: we can just fire up DTrace on every node in the public cloud for a couple of seconds and get the actual data from (many) actual production machines. And in this case, the data was at least a little bit surprising: we had (many) nodes that were calling closef() thousands of times per second. (The hottest node saw nearly 5,000 closef()’s per second over a ten second period — and there were many that saw on the order of 2,000 per second.) This is way too hot to always be executing a dtrace_sync() when DTrace is loaded (as it effectively always is on our systems), so we had to be sure that we only executed the dtrace_sync() in the presence of an enabling that actually used fds[] with reduced privileges (e.g., from a non-global zone).

This meant that fds[] needed to be implemented in terms of a subroutine such that we could know when one of these is enabled. That is, we needed a D subroutine to translate from a file descriptor to a file_t pointer within the current context. Given this, the name of the subroutine was obvious: it had to be getf() — the Unix routine that does exactly this in the kernel, and has since Fourth Edition Unix, circa 1973. (Aside: one of the reasons I love illumos is because of this heritage: compare the block comment above the Seventh Edition implementation of getf() to that above getf() currently in illumos — yes, we still have comments written by Ken!) So the change to allow fds[] in the non-global zone
ended up being more involved, but in the end, it wasn’t too bad — especially given the functionality that it unlocked.

With these two changes made, Brendan and I brainstormed about what remained out of reach for the non-global zone — and the only other thing that we could come up with as commonly wanting in the non-global zone would be access to the sched and proc providers. Allowing use of these in-kernel providers would require allowing enabling of the probes but not firing them when outside the context of one’s own zone (similar to the syscall provider) and further explicitly forbidding getting anything about privileged context (specifically, register state). Fortunately, I had had to do exactly this to fix a somewhat nasty zones-related issue with the profile provider, and it seemed that it wouldn’t be too bad to extend this with these slightly new semantics. This proved to be the case, and the change to allow sched, proc, vminfo and sysinfo providers in the non-global zone ended up being very modest.

So with these three changes, I am relieved to report that DTrace is now completely usable in the non-global zone — and all without sacrificing the security model of zones! If you are a Joyent cloud customer, we will be rolling out a platform with this modification across the cloud (it necessitates a reboot, so don’t expect it before your next scheduled maintenance window); if you are a SmartOS user, look for this in our next SmartOS release; and if you are using another illumos-based distro (e.g., OpenIndiana or OmniOS) look for it in an upcoming release — we will be integrating these changes into illumos, so you can expect them to be in downstream distros soon. And here’s to DTrace in the zone!

Posted on June 7, 2012 at 12:18 am by bmc · Permalink · 5 Comments
In: Uncategorized

Debugging node.js memory leaks

Part of the value of dynamic and interpreted environments is that they handle the complexities of dynamic memory allocation. In particular, one needn’t explicitly free memory that is no longer in use: objects that are no longer referenceable are found automatically and destroyed via garbage collection. While garbage collection simplifies the program’s relationship with memory, it not mean the end of all memory-based pathologies: if an application retains a reference to an object that is ultimately rooted in a global scope, that object won’t be considered garbage and the memory associated with it will not be returned to the system. If enough such objects build up, allocation will ultimately fail (memory is, after all, finite) and the program will (usually) fail along with it. While this is not strictly — from the native code perspective, anyway — a memory leak (the application has not leaked memory so much as neglected to unreference a semantically unused object), the effect is nonetheless the same and the same nomenclature is used.

While all garbage collected environments create the potential to create such leaks, it can be particularly easy in JavaScript: closures create implicit references to variables within their scopes — references that might not be immediately obvious to the programmer. And node.js adds a new dimension of peril with its strictly asynchronous interface with the system: if backpressure from slow upstream services (I/O, networking, database services, etc.) isn’t carefully passed to downstream consumers, memory will begin to fill with the intermediate state. (That is, what one gains in concurrency of operations one may pay for in memory.) And of course, node.js is on the server — where the long-running nature of services means that the effect of a memory leak is much more likely to be felt and to affect service. Take all of these together, and you can easily see why virtually anyone who has stood up node.js in production will identify memory leaks as their most significant open issue.

The state of the art for node.js memory leak detection — as concisely described by Felix in his node memory leak tutorial — is to use the v8-profiler and Danny Coates’ node-inspector. While this method is a lot better than nothing, it’s very much oriented to the developer in development. This is a debilitating shortcoming because memory leaks are often observed only after days (hours, if you’re unlucky) of running in production. At Joyent, we have long wanted to tackle the problem of providing the necessary tooling to identify node.js memory leaks in production. When we are so lucky as to have these in native code (that is, node.js add-ons), we can use libumem and ::findleaks — a technology that we have used to find thousands of production memory leaks over the years. But this is (unfortunately) not the common case: the common case is a leak in JavaScript — be it node.js or application code — for which our existing production techniques are useless.

So for node.js leaks in production, one has been left with essentially a Gedanken experiment: consider your load and peruse source code looking for a leak. Given how diabolical these leaks can be, it’s amazing to me that anyone ever finds these leaks. (And it’s less surprising that they take an excruciating amount of effort over a long period of time.) That’s been the (sad) state of things for quite some time, but recently, this issue boiled over for us: Voxer, a Joyent customer and long in the vanguard of node.js development, was running into a nasty memory leak: a leak sufficiently acute that the service in question was having to be restarted on a regular basis, but not so much that it was reproducible in development. With the urgency high on their side, we again kicked around this seemingly hopeless problem, focussing on the concrete data that we had: core dumps exhibiting the high memory usage, obtained at our request via gcore(1) before service restart. Could we do anything with those?

As an aside, a few months ago, Joyent’s Dave Pacheco added some unbelievable postmortem debugging support for node. (If postmortem debugging is new to you, you might be interested in checking out my presentation on debugging production systems from QCon San Francisco. I had the privilege of demonstrating Dave’s work in that presentation — and if you listen very closely at 43:26, you can hear the audience gasp when I demonstrate the amazing ::jsprint.) But Dave hasn’t built the infrastructure for walking the data structures of the V8 heap from an MDB dmod — and it was clear that doing so would be brittle and error-prone.

As Dave, Brendan and I were kicking around increasingly desperate ideas (including running strings(1) on the dump — an idea not totally without merit — and some wild visualization ideas), a much simpler idea collectively occurred to us: given that we understood via Dave’s MDB V8 support how a given object is laid out, and given that an object needed quite a bit of self-consistency with referenced but otherwise orthogonal structures, what about just iterating over all of the anonymous memory in the core and looking for objects? That is, iterate over every single address, and see if that address could be interpreted as an object. On the one hand, it was terribly brute force — but given the level of consistency required of an object in V8, it seemed that it wouldn’t pick up too many false positives. The more we discussed it, the more plausible it became, but with Dave fully occupied (on another saucy project we have cooking at Joyent — more on that later), I roped up and headed into his MDB support for V8…

The result — ::findjsobjects — takes only a few minutes to run on large dumps, but provides some important new light on the problem. The output of ::findjsobjects consists of representative objects, the number of instances of that object and the number of properties on the object — followed by the constructor and first few properties of the objects. For example, here is the dump on a gcore of a (non-pathological) Joyent node-based facility:

> ::findjsobjects -v
findjsobjects:         elapsed time (seconds) => 20
findjsobjects:                   heap objects => 6079488
findjsobjects:             JavaScript objects => 4097
findjsobjects:              processed objects => 1734
findjsobjects:                 unique objects => 161
OBJECT   #OBJECTS #PROPS CONSTRUCTOR: PROPS
fc4671fd        1      1 Object: flags
fe68f981        1      1 Object: showVersion
fe8f64d9        1      1 Object: EventEmitter
fe690429        1      1 Object: Credentials
fc465fa1        1      1 Object: lib
fc46300d        1      1 Object: free
fc4efbb9        1      1 Object: close
fc46c2f9        1      1 Object: push
fc46bb21        1      1 Object: uncaughtException
fe8ea871        1      1 Object: _idleTimeout
fe8f3ed1        1      1 Object: _makeLong
fc4e7c95        1      1 Object: types
fc46bae9        1      1 Object: allowHalfOpen
...
fc45e249       12      4 Object: type, min, max, default
fc4f2889       12      4 Object: index, fields, methods, name
fd2b8ded       13      4 Object: enumerable, writable, configurable, value
fc0f68a5       14      1 SlowBuffer: length
fe7bac79       18      3 Object: path, fn, keys
fc0e9d21       20      5 Object: _onTimeout, _idleTimeout, _idlePrev, ...
fc45facd       21      4 NativeModule: loaded, id, exports, filename
fc45f571       23      8 Module: paths, loaded, id, parent, exports, ...
fc4607f9       35      1 Object: constructor
fc0f86c9       56      3 Buffer: length, offset, parent
fc0fc92d       57      2 Arguments: length, callee
fe696f59       91      3 Object: index, fields, name
fc4f3785       91      4 Object: fields, name, methodIndex, classIndex
fe697289      246      2 Object: domain, name
fc0f87d9      308      1 Buffer:

Now, any one of those objects can be printed with ::jsprint. For example, let’s take fc45e249 from the above output:

> fc45e249::jsprint
{
    type: number,
    min: 10,
    max: 1000,
    default: 300,
}

Note that that’s only a representative object — there are (in the above case) 12 objects that have that same property signature. ::findjsobjects can get you all of them when you specify the address of the reference object:

> fc45e249::findjsobjects
fc45e249
fc46fd31
fc467ae5
fc45ecb5
fc45ec99
fc45ec11
fc45ebb5
fc45eb41
fc45eb25
fc45e3d1
fc45e3b5
fc45e399

And because MDB is the debugger Unix was meant to have, that output can be piped to ::jsprint:

> fc45e249::findjsobjects | ::jsprint
{
    type: number,
    min: 10,
    max: 1000,
    default: 300,
}
{
    type: number,
    min: 0,
    max: 5000,
    default: 5000,
}
{
    type: number,
    min: 0,
    max: 5000,
    default: undefined,
}
...

Okay, fine — but where are these objects referenced? ::findjsobjects has an option for that:

> fc45e249::findjsobjects -r
fc45e249 referred to by fc45e1e9.height

This tells us (or tries to) who is referencing that first (representative) object. Printing that out (with the “-a” option to show the addresses of the objects):

> fc45e1e9::jsprint -a
fc45e1e9: {
    ymax: fe78e061: undefined,
    hues: fe78e061: undefined,
    height: fc45e249: {
        type: fe78e361: number,
        min: 14: 10,
        max: 7d0: 1000,
        default: 258: 300,
    },
    selected: fc45e3fd: {
        type: fe7a2465: array,
        default: fc45e439: [...],
    },
    ...

So if we want to learn where all of these objects are referenced, we can again use a pipeline within MDB:

> fc45e249::findjsobjects | ::findjsobjects -r
fc45e249 referred to by fc45e1e9.height
fc46fd31 referred to by fc46b159.timeout
fc467ae5 is not referred to by a known object.
fc45ecb5 referred to by fc45eadd.ymax
fc45ec99 is not referred to by a known object.
fc45ec11 referred to by fc45eadd.nbuckets
fc45ebb5 referred to by fc45eadd.height
fc45eb41 referred to by fc45eadd.ymin
fc45eb25 referred to by fc45eadd.width
fc45e3d1 referred to by fc45e1e9.nbuckets
fc45e3b5 referred to by fc45e1e9.ymin
fc45e399 referred to by fc45e1e9.width

Of course, the proof of a debugger is in the debugging; would ::findjsobjects actually be of use on the Voxer dumps that served as its motivation? Here is the (elided) output from running it on a big Voxer dump:

> ::findjsobjects -v
findjsobjects:         elapsed time (seconds) => 292
findjsobjects:                   heap objects => 8624128
findjsobjects:             JavaScript objects => 112501
findjsobjects:              processed objects => 100424
findjsobjects:                 unique objects => 241
OBJECT   #OBJECTS #PROPS CONSTRUCTOR: PROPS
fe806139        1      1 Object: Queue
fc424131        1      1 Object: Credentials
fc424091        1      1 Object: version
fc4e3281        1      1 Object: message
fc404f6d        1      1 Object: uncaughtException
...
fafcb229     1007     23 ClientRequest: outputEncodings, _headerSent, ...
fafc5e75     1034      5 Timing: req_start, res_end, res_bytes, req_end, ...
fafcbecd     1037      3 Object: aborted, data, end
 8045475     1060      1 Object:
fb0cee9d     1220      9 HTTPParser: socket, incoming, onHeadersComplete, ...
fafc58d5     1271     25 Socket: _connectQueue, bytesRead, _httpMessage, ...
fafc4335     1311     16 ServerResponse: outputEncodings, statusCode, ...
fafc4245     1673      1 Object: slab
fafc44d5     1702      5 Object: search, query, path, href, pathname
fafc440d     1784     14 Client: buffered_writes, name, timing, ...
fafc41c5     1796      3 Object: error, timeout, close
fafc4469     1811      3 Object: address, family, port
fafc42a1     2197      2 Object: connection, host
fbe10625     2389      2 Arguments: callee, length
fafc4251     2759     15 IncomingMessage: statusCode, httpVersionMajor, ...
fafc42ad     3652      0 Object:
fafc6785    11746      1 Object: oncomplete
fb7abc29    15155      1 Object: buffer
fb7a6081    15155      3 Object: , oncomplete, cb
fb121439    15378      3 Buffer: offset, parent, length

This immediately confirmed a hunch that Matt had had that this was a buffer leak. And for Isaac — who had been working this issue from the Gedanken side and was already zeroing in on certain subsystems — this data was surprising in as much as it was so confirming: he was already on the right path. In short order, he nailed it, and the fix is in node 0.6.17.

The fix was low risk, so Voxer redeployed with it immediately — and for the first time in quite some time, memory utilization was flat. This was a huge win — and was the reason for Matt’s tantalizing tweet. The advantages of this approach are that it requires absolutely no modification to one’s node.js programs — no special flags and no different options. And it operates purely postmortem. Thanks to help from gcore(1), core dumps can be taken over time for a single process, and those dumps can then be analyzed off-line.

Even with ::findjsobjects, debugging node.js memory leaks is still tough. And there are certainly lots of improvements to be made here — there are currently some objects that we do not know how to correctly interpret, and we know that we know that we can improve our algorithm for finding object references — but this shines a bright new light into what had previously been a black hole!

If you want to play around with this, you’ll need SmartOS or your favorite illumos distro (which, it must be said, you can get simply by provisioning on the Joyent cloud). You’ll need an updated v8.so — which you can either build yourself from illumos-joyent or you can download a binary. From there, follow Dave’s instructions. Don’t hesitate to ping me if you get stuck or have requests for enhancement — and here’s to flat memory usage on your production node.js service!

Posted on May 5, 2012 at 1:07 pm by bmc · Permalink · Comments Closed
In: Uncategorized

Standing up SmartDataCenter

Over the past year, we at Joyent have been weaving some threads that may have at times seemed disconnected: last fall, we developed a prototype of DTrace-based cloud instrumentation as part of hosting the first Node Knockout competition; in the spring, we launched our no.de service with full-fledged real-time cloud analytics; and several weeks ago, we announced the public availability of SmartOS, the first operating system to combine ZFS, DTrace, OS-based virtualization and hardware virtualization in the same operating system kernel. These activities may have seemed disjoint, but there was in fact a unifying theme: our expansion at Joyent from a public cloud operator to a software company — from a company that operates cloud software to one that also makes software for cloud operators.

This transition — from service to product — is an arduous one, for the simple reason that a service can make up for any shortcomings with operational heroics that are denied to a product. Some of these gaps may appear superficial (e.g., documentation), but many are deep and architectural; that one is unable to throw meat at a system changes many fundamental engineering decisions. (It should be said that some products don’t internalize these differences and manage to survive despite their labor-intensive shortcomings, but this is thanks to the caulking called “professional services” — and is usually greased by an enterprise software sales rep with an easy rapport and a hell of a golf handicap.) As the constraints of being a product affect engineering decisions, so do the constraints of operating as a service; if a product has no understanding or empathy for how it will be deployed into production, it will be at best difficult to operate, and at worst unacceptable for operation.

So at Joyent, we have long believed that we need both a product that can be used to provide a kick-ass service and a kick-ass service to prove to ourselves that we hit the mark. While our work over the past year has been to develop that product — SmartDataCenter — we have all known that we wouldn’t be done until that product was itself operating not just emerging services like no.de, but also the larger Joyent public cloud, and over the summer we began to plan a broad-based stand-up of SmartDataCenter. With our SmartOS-based KVM solid and on schedule, and with the substantial enabling work up the stack falling into place, we set for ourselves an incredibly ambitious target: to not only ship this revolutionary new version of the software, but to also — at the same time — stand up that product in production in the Joyent public cloud. Any engineer’s natural conservatism would resist such a plan — conflating shipping a product with a production deployment feels chancey at best and foolish at worst — but we saw an opportunity to summon a singular, company-wide focus that would force us to deliver a much better product, and would give us and our software customers the satisfaction of knowing that SmartDataCenter as shipped runs Joyent. More viscerally, we were simply too hot on getting SmartOS-based hardware virtualization in production — and oh, those DTrace-based KVM analytics! — to contemplate any path less aggressive.

Of course, it wasn’t easy; there were many intense late-night Jabber sessions as we chased bugs and implemented solutions. But as time went on — and as we cut release candidates and began to pound on the product in staging envirionments — it became clear that the product was converging, and a giddy optimism began to spread. Especially for our operations team, many of whom are long-time Joyent vets, this was a special kind of homecoming, as they saw their vision and wisdom embodied in the design decisions of the product. This sentiment was epitomized by one of the senior members of the Joyent public cloud operations team — an engineer who has tremendous respect in the company, and who is known for his candor — as he wrote:

It seems the past ideas of a better tomorrow have certainly come to light. I think I speak for all of the JPC when I say “Bravo” to the dev team for a solid version of the software. We have many rivers (hopefully small creeks) to cross but I am ultimately confident in what I have seen and deployed so far. It’s incredible to see a unified convergence on what has been the Joyent dream since I have been here.

For us in development, these words carried tremendous weight; we had endeavored to build the product that our team would want to deploy and operate, and it was singularly inspiring to see them so enthusiastic about it! (Though I will confess that this kind of overwhelmingly positive sentiment from such a battle-hardened operations engineer is so rare as to almost make me nervous — and yes, I am knocking on wood as I write this.)

With that, I’m pleased to announce that the (SmartDataCenter-based) Joyent public cloud is open for business! You can provision (via web portal or API) either virtual OS instances (SmartMachines) or full hardware-virtual VMs; they’ll provision like a rocket (ZFS clones FTW!), and you’ll get a high-performing, highly reliable environment with world-beating real-time analytics. And should you wish to stand up your own cloud with the same advantages — either a private cloud within your enterprise or a public cloud to serve your customers — know that you can own and operate your very own SmartDataCenter!

Posted on September 15, 2011 at 2:03 am by bmc · Permalink · Comments Closed
In: Uncategorized