SmartDataCenter and the merits of being opinionated

Recently, Randy Bias of EMC (formerly of CloudScaling) wrote an excellent piece on Why “Vanilla OpenStack” doesn’t exist and never will. If you haven’t read it and you are anywhere near a private cloud effort, you should consider it a must-read: Randy debunks the myth of a vanilla OpenStack in great detail. And it apparently does need debunking; as Randy outlines, those who are deploying an on-premises cloud expect:

  • A uniform, monolithic cloud operating system (like Linux)
  • Set of well-integrated and interoperable components
  • Interoperability with their own vendors of choice in hardware, software, and public cloud

We at Joyent can vouch for these expectations, because years ago we had the same aspirations for our own public cloud. Though perhaps unlike others, we have also believed in the operating system as differentiator — and specifically, that OS containers are the foundation of elastic infrastructure — so we didn’t wait for a system to emerge, but rather endeavored to write our own. That is, given the foundation of our own container-based operating system — SmartOS — we set out to build exactly what Randy describes: a set of well-integrated, interoperable components on top of a uniform, monolithic cloud operating system that would allow us to leverage the economics of commodity hardware. This became SmartDataCenter, a container-centric distributed system upon which we built our own cloud and which we open sourced this past November.

The difference between SmartDataCenter and OpenStack mirrors the difference between the expectations for OpenStack and the reality that Randy outlines: where OpenStack is accommodating of many different visions for the cloud, SmartDataCenter is deliberately opinionated. In SmartDataCenter you don’t pick the storage substrate (it’s ZFS) or the hypervisor (it’s SmartOS) or the network virtualization (it’s Crossbow). While OpenStack deliberately accommodates swapping in different architectural models, SmartDataCenter deliberately rejects it: we designed it for commodity storage (shared-nothing — for good reason), commodity network equipment (no proprietary SDN) and (certainly) commodity compute. So while we’re agnostic with respect to hardware (as long as it’s x86-based and Ethernet-based), we are prescriptivist with respect to the software foundation that runs upon it. The upshot is that the integrator/operator retains control over hardware (and the different economic tradeoffs that that control allows), but needn’t otherwise design the system themselves — which we know from experience can result in greatly reduced times of deployment. (Indeed, one of the great prides of SmartDataCenter is our ease of install: provided you’re racked, stacked and cabled, you can get a cloud stood up in a matter of hours rather than days, weeks or longer.)

So in apparent contrast to OpenStack, SmartDataCenter only comes in “vanilla” (in Randy’s parlance). This is not to say that SmartDataCenter is in any way plain; to the contrary, by having such a deliberately designed foundation, we can unlock rapid innovation, viz. our emerging Docker integration with SmartDataCenter that allows for Docker containers to be deployed securely and directly on the metal. We are very excited about the prospects of Docker on SmartDataCenter, and so are other people. So in as much as SmartDataCenter is vanilla, it definitely comes with whipped cream and a cherry on top!

Posted on February 5, 2015 at 5:45 pm by bmc · Permalink · Comments Closed
In: Uncategorized

Predicteria 2015

Fifteen years ago, I initiated a time-honored tradition among my colleagues in kernel development at Sun: shortly after the first of every year, we would get together at our favorite local restaurant to form predictions for the coming year. We made one-year, three-year and six-year predictions for both our technologies and more broadly for the industry. We did this for nine years running — from 2000 to 2008 inclusive — and came to know the annual ritual as a play on the restaurant name: Predicteria.

I have always been in interested in our past notions of the future (hoverboards and self-lacing shoes FTW!), and looking back now at nearly a decade of our predictions has led me to an inescapable (and perhaps obvious) conclusion: predictions tell you more about the present than the future. That is, predictions reflect the zeitgeist of the day — both in substance and in tone: in good years, people predict halcyon days; in bad ones, the apocalypse. And when a particular company or technology happened to be in the news or otherwise on the collective mind, predictions tended to be centered around it: it was often the case that several people would predict that a certain company would be acquired or that a certain technology would flourish — or perish. (Let the record reflect that the demise of Itanium was accurately predicted many times over.)

Which is not to say that we never made gutsy predictions; in 2006, a colleague made a one-year prediction that “GOOG embarrassed by revelation of unauthorized US government spying at Gmail.” The timing may have been off, but the concern was disturbingly prescient. Sometimes the predictions were right, but for the wrong reasons: in 2003, one of my three-year predictions was that “Apple develops new ‘must-have’ gadget called the iPhone, a digital camera/MP3 player/cell phone.” This turned out to be stunningly accurate, even down to the timing (and it was by far my most accurate big prediction over the years), but if you can’t tell by the snide tone, I thought that such a thing would be Glass-like in its ludicrousness; I had not an inkling as to its transformative power. (And indeed, when the iPhone did in fact emerge a few years later, several at Predicteria predicted that it would be a disappointing flop.)

But accurate predictions were the exception, not the rule; our predictions were usually wrong — often wildly so. Evergreen wildly wrong predictions included: the rise of carbon nanotube-based memory, the relevance of quantum computing, and the death of tape, disk or volatile DRAM (each predicted several times over). We were also wrong by our omissions: as a group, we entirely failed to predict cloud computing — or even the rise of hardware-based virtualization.

I give all of this as a backdrop to some predictions for the coming year. If my experience taught me anything, it’s that these predictions may very well be right on trajectory, but wrong on timing — and that they may well capture current thinking more than they meaningfully predict the future. They also may be (rightfully) criticized for, as they say, talking our book — but we have made our bets based on where we think things are going, not vice versa. And finally, I apologize that these are somewhat milquetoast predictions; I’m afraid that practical concerns muffle the gutsy predictions that name names and boldly predict their fates!

Without further ado, looking forward to 2015:

Right or wrong, these predictions point to an exciting 2015. And if nothing else you can rely on my for a candid self-assessment of my predictions — you’ll just need to wait fifteen years or so!

Posted on January 6, 2015 at 2:14 pm by bmc · Permalink · Comments Closed
In: Uncategorized

2014 in review: Docker rising

When looking back on 2014 from an infrastructure perspective, it’s hard not to have one word on the lips: Docker. (Or, as we are wont to do in Silicon Valley when a technology is particularly hot, have the same word on the lips three times over à la Gabbo: “Docker, Docker, DOCKER!”) While Docker has existed since 2013, 2014 was indisputably the year in which it transcended from an interesting project to a transformative technology — a shift which had profound ramifications for us at Joyent.

The enthusiasm for Docker has been invigorating: it validates Joyent’s core hypothesis that OS-based virtualization is the infrastructure substrate of the future. That said, going into 2014, there was also a clear impedance mismatch: while Docker was refreshingly open to being cross-platform, the reality is that it was being deployed exclusively on Linux — and that the budding encyclopedia of Docker images was exclusively Linux-based. Our operating system, SmartOS, is an illumos derivative that it many ways is similar to Linux (they’re both essentially Unix, after all), but it’s also different enough to be an impediment. So the arrival of Docker in 2013 left us headed into 2014 with a kind of dilemma: how can we enable Docker on our proven SmartOS-based substrate for OS containers while still allowing existing Linux-based images to function?

Into this quandary came a happy accident: David Mackay, an illumos community member, revived lx branded zones, work that had been explored some number of years ago to execute complete Linux binary environments in an illumos zone. This work was so old that, frankly, we didn’t feel it was likely to be salvageable — but we were pleasantly surprised when it seemed to still function for some modern binaries. (If it needs to be said, this is yet another example of why we so fervently believe in open source: it allows for others to explore ideas that may seem too radical for commercial entities with mortgages to pay and mouths to feed.)

Energized by the community, Joyent engineer Jerry Jelinek went to work in the spring, bolstering the emulation layer and getting it to work with progressively more and more modern Linux systems. By late summer, 32-bit was working remarkably well on Ubuntu 14.04 (an odyssey that I detailed in my illumos day Surge presentation) and we were ready to make an attempt at the summit: 64-bit Linux emulation. Like much bringup work, the 64-bit work was excruciating because it was very hard to forecast: you can be one bug away from a functioning system or a hundred — and the only way to really know is to grind through them all. Fortunately, we are nothing if not persistent, and by late fall we had 64-bit working on most stuff — and thanks to early adopter community members like Jorge Schrauwen, we were able to quickly find increasingly obscure software to validate it against. (Notes to self: (1) “Cabal hell” is a thing and (2) I bet HHVM is unaware of the implicit dependency they have on Linux address space layout.)

With the LX branded zone work looking very promising, Joyent engineer Josh Wilsdon led a team studying Docker to determine the best way to implement it on SmartOS for our orchestration software, SmartDataCenter. In doing this, we learned about a great Docker strength: its remote API. This API allows us to do exactly what robust APIs have allowed us to do for time immemorial: replace one implementation with a different one without breaking upstack software. Implementing a Docker API endpoint would also allow for a datacenter-wide Docker view that would solve many other problems for us as well; in late autumn, we set out building sdc-docker, a Docker engine for SDC that we have been developing in the open. As with the LX branded zone work, we are far enough along to validate the approach: we know that we can make this work.

In parallel to these two bodies of work, a third group of Joyent engineers led by Robert Mustacchi was tackling a long-standing problem: extending the infrastructure present in SmartOS for robust (and secure!) network virtualization for OS containers to the formation of virtual layer two networks that can span an entire datacenter (that is, finally breaking the shackles of .1q VLANs). We have wanted to do this for quite some time, but the rise of Docker has given this work a new urgency: of the Linux problems with respect to OS-based containers, network virtualization is clearly among the most acute — and we have heard over and over again that it has become an impediment to Docker in production. Robert and team have made great progress and by the end of 2014 had the first signs of life from the SDC integration point for this work.

The SmartDataCenter-based aspects of our Docker and network virtualization work embody an important point of distinction: while OpenStack has been accused of being “a software particle-board designed by committee”, SDC has been deliberately engineered based on our experience actually running a public cloud at scale. That said, OpenStack has had one (and arguably, only one) historic advantage: it is open source. While the components of SDC (namely, SmartOS and node.js) have been open, SDC itself was not. The rise of Docker — and the clear need for an open, container-based stack instead of some committee-designed VMware retread — allowed us to summon the organizational will to take an essential leap: on November 6th, we open sourced SDC and Manta.

Speaking of Manta: with respect to containers, Joyent has been living in the future (which, in case it sounds awesome, is actually very difficult; being ahead of the vanguard is a decidedly mixed blessing). If the broader world is finally understanding the merits of OS-based virtualization with respect to standing compute, it still hasn’t figured out that it has profound ramifications for scale-out storage. However, with the rise of Docker in 2014, we have more confidence than ever that this understanding will come in due time — and by open sourcing Manta we hope to accelerate it. (And certainly, you can imagine that we’ll help connect the dots by allowing Manta jobs to be phrased as Docker containers in 2015.)

Add it all up — the enthusiasm for Docker, the great progress of the LX-branded zone work, the Docker engine for SDC, the first-class network virtualization that we’re building into the system — and then give it the kicker of an entirely open source SmartDataCenter and Manta, and you can see that it’s been a hell of a 2014 for us. Indeed, it’s been a hell of a 2014 for the entire Docker community, and we believe that Matt Asay got it exactly right when he wrote that “Docker, hot as it was in 2014, will be even hotter in 2015.”

So here’s to a hot 2014 — and even hotter 2015!

Posted on January 2, 2015 at 4:03 pm by bmc · Permalink · Comments Closed
In: Uncategorized

SmartDataCenter and Manta are now open source

Today we are announcing that we are open sourcing the two systems at the heart of our business: SmartDataCenter and the Manta object storage platform. SmartDataCenter is the container-based orchestration software that runs the Joyent public cloud; we have used it for the better half of a decade to run on-the-metal OS containers — securely and at scale. Manta is our multi-tenant ZFS-based object storage platform that provides first-class compute by allowing OS containers to be spun up directly upon objects — effecting arbitrary computation at scale without data movement. The unifying technological foundation beneath both SmartDataCenter and Manta is OS-based virtualization, a technology that Joyent pioneered in the cloud way back in 2006. We have long known the transformative power of OS containers, so it has been both exciting and validating for us to see the rise of Docker and the broadening of appreciation for OS-based virtualization. SmartDataCenter and Manta show that containers aren’t merely a fad or developer plaything but rather a fundamental technological advance that represents the foundation for the next generation of computing — and we believe that open sourcing them advances the adoption of container-based architectures more broadly.

Without any further ado — and to assure that we don’t fall into the most prominent of my own corporate open source anti-patterns — here is the source for SmartDataCenter and the source for Manta. These are sophisticated systems with many moving parts, and you’ll see that these two repositories are in fact meta-repositories that explain the design of each of the systems and then point to the (many) components that comprise them (all now open source, natch). We believe that some of these subcomponents will likely find use entirely outside of SDC and Manta. For example, Manatee is a ZooKeeper-based system that manages Postgres replication and automates failover; Moray is a key-value service that lives on top of Postgres. Taken together, Manatee and Moray implement a highly-available key-value service that we use as the foundation for many other components in SDC and Manta — and one that we think others will find useful as well.

In terms of source code mechanics, you’ll see that many of the components are implemented in either node.js or by extending C-based systems. This is not by fiat but rather by the choices of individual engineers; over the past four years, as we learned about the nuances of node.js error handling and as we invested heavily in tooling for running node.js in production, node.js became the right tool for many of our jobs — and we used it for many of the services that constitute SDC and Manta.

And because any conversation about open source has to address licensing at some point or another, let’s get that out of the way: we opted for the Mozilla Public License 2.0. While relatively new, there is a lot to like about this license: its file-based copyleft allows it to be proprietary-friendly while also forcing certain kinds of derived work to be contributed back; its explicit patent license discourages litigation, offering some measure of troll protection; its explicit warranting of original work obviates the need for a contributor license agreement (we’re not so into CLAs); and (best of all, in my opinion), it has been explicitly designed to co-exist with other open source licenses in larger derived works. Mozilla did terrific work on MPL 2.0, and we hope to see it adopted by other companies that share our thinking around open source!

In terms of the business ramifications, at Joyent we have long been believers in open source as a business model; as the leaders of the node.js and SmartOS projects, we have seen the power of open source to start new conversations, open up new markets and (importantly) yield new customers. Ten years ago, I wrote that open source is “a loss leader — minus the loss, of course”; after a decade of experience with open source business models, I would add that open source also serves as sales outreach without cold calls, as a channel without loss of margin, and as a marketing campaign without advertisements. But while we have directly experienced the business advantages of open source, we at Joyent have also lived something of a dual life: node.js and SmartOS have been open source, but the distributed systems that we have built using these core technologies have remained largely behind our walls. So that these systems are now open source does not change the fundamentals of our business model: if you would like to consume SmartDataCenter or Manta as a service, you can spin up an instance on the public cloud or use our Manta storage service. Similarly, if you want a support contract and/or professional services to run either SmartDataCenter or Manta on-premises, we’ll sell them to you. Based on our past experiences with open source, we do know that there will be one important change: these technologies will find their way into the hands of those that we have no other way of reaching — and that some fraction of these will become customers. Also based on past experience, we know that some (presumably much smaller) fraction of these new technologists will — by merits of their interest in and contributions to these projects — one day join us as engineers at Joyent. Bluntly, open source is our farm system, and broadening our hiring channel during a blazingly hot market for software talent is playing no small role in our decision here. In short, this is not an act of altruism: it is a business decision — if a multifaceted one that we believe has benefits beyond the balance sheet.

Welcome to open source SDC and Manta — and long-live the container revolution!

Posted on November 3, 2014 at 4:16 pm by bmc · Permalink · Comments Closed
In: Uncategorized

Broadening node.js contributions

Several years ago, I gave a presentation on corporate open source anti-patterns. Several of my anti-patterns were clear and unequivocal (e.g., don’t announce that you’re open sourcing something without making the source code available, dummy!), but others were more complicated. One of the more nuanced anti-patterns was around copyright assignment and contributor license agreements: while I believe these constructs to be well-intended (namely, to preserve relicensing options for the open source project and to protect that project from third-party claims of copyright and patent infringement), I believe that they are not without significant risks with respect to the health of the community. Even at their very best, CLAs and copyright assignments act as a drag on contributions as new corporate contributors are forced to seek out their legal department — which seems like asking people to go to the dentist before their pull request can be considered. And that’s the very best case; at worst, these agreements and assignments grant a corporate entity (or, as I have personally learned the hard way, its acquirer) the latitude for gross misbehavior. Because this very worst scenario had burned us in the illumos community, illumos has been without CLA and copyright assignment since its inception: as with Linux, contributors hold copyright to their own contributions and agree to license it under the prevailing terms of the source base. Further, we at Joyent have also adopted this approach in the many open source components we develop in the node.js ecosystem: like many (most?) GitHub-hosted projects, there is no CLA or copyright assignment for node-bunyan, node-restify, ldap.js, node-vasync, etc. But while many Joyent-led projects have been without copyright assignment and CLA, one very significant Joyent-led project has had a CLA: node.js itself.

While node.js is a Joyent-led project, I also believe that communities must make their own decisions — and a CLA is a sufficiently nuanced issue that reasonable people can disagree on its ultimate merits. That is, despite my own views on a CLA, I have viewed the responsibility for the CLA as residing with the node.js leadership team, not with me. The upshot has been that the node.js status quo of a CLA (one essentially inherited from Google’s CLA for V8) has remained in place for several years.

Given this background you can imagine that I found it very heartwarming that when node.js core lead TJ Fontaine returned from his recent Node on the Road tour, one of the conclusions he came to was that the CLA had outlived its usefulness — and that we should simply obliterate it. I am pleased to announce that today, we are doing just that: we have eliminated the CLA for node.js. Doing this lowers the barrier to entry for node.js contributors thereby broadening the contributor base. It also brings node.js in line with other projects that Joyent leads and (not unimportantly!) assures that we ourselves are not falling into corporate open source anti-patterns!

Posted on June 11, 2014 at 9:15 am by bmc · Permalink · Comments Closed
In: Uncategorized

From VP of Engineering to CTO

If you search for “cto vs. vp of engineering”, one of the top hits is a presentation that I gave with Jason Hoffman at Monki Gras 2012. Aside from some exceptionally apt clip art, the crux of our talk was that these two roles should not be thought of as caricatures (e.g. the CTO as a silver tongue with grand vision but lacking practical know-how and the VP of Engineering as a technocrat who makes the TPS reports run on time), but rather as a team that together leads a company’s technical endeavors. Yes, one is more outward- and future-looking and the other more team- and product-focused — but if the difference becomes too stark (that is, if the CTO and VP of Engineering can’t fill in for one another in a pinch) there may be a deeper cultural divide between vision and execution. As such, the CTO and the VP of Engineering must themselves represent the balance present in every successful engineer: they must be able to both together understand the world as it is — and envision the world as it could be.

This presentation has been on my mind recently because today my role at Joyent is changing: I am transitioning from VP of Engineering to CTO, and Mark Cavage is taking on the role of VP of Engineering. For me, this is an invigorating change in a couple of dimensions. First and foremost, I am excited to be working together with Mark in a formalized leadership capacity. The vitality of the CTO/VP of Engineering dynamic stems from the duo’s ability to function as a team, and I believe that Mark and I will be an effective one in this regard. (And Mark apparently forgives me for cussing him out when he conceived of what became Manta.)

Secondly, I am looking forward to talking to customers a bit more. Joyent is in a terrific position in that our vision for cloud computing is not mere rhetoric, but actual running service and shipping product. We are uniquely differentiated by the four technical pillars of our stack: SmartOS, node.js, SmartDataCenter and — as newly introduced last year — our revolutionary Manta storage service. These are each deep technologies in their own right, and especially at their intersections, they unlock capabilities that the market wants and needs — and our challenge now is as much communicating what we’ve done (and why we’ve done it) as it is continuing to execute. So while I have always engaged directly with customers, the new role will likely mean more time on planes and trains as I visit more customers (and prospective customers) to better understand how our technologies can help them solve their thorniest problems.

Finally, I am looking forward to the single most important role of the CTO: establishing the broader purpose of our technical endeavor. This purpose becomes the root of a company’s culture, as culture without purpose is mere costume. For Joyent and Joyeurs our purpose is simple: we’re here to change computing. As I mentioned in my Surge 2013 talk on technical leadership (video), superlative technologists are drawn to mission, team and problem — and in Joyent's case, the mission of changing computing (and the courage to tackle whatever problems that entails) has attracted an exceptionally strong team that I consider myself blessed to call my peers. I consider it a great honor to be Joyent's CTO, and I look forward to working with Mark and the team to continue to — in Steve Jobs' famous words — kick dents in the universe!

Posted on April 15, 2014 at 8:07 am by bmc · Permalink · Comments Closed
In: Uncategorized

agghist, aggzoom and aggpack

As I have often remarked, DTrace is a workhorse, not a show horse: the features that we have added to DTrace over the years come not from theoretical notions but rather from actual needs in production environments. This is as true now as it was a decade ago, and even in core abstractions, extensive use of DTrace seems to give rise to new ways of thinking about them. In particular, Brendan recently had a couple of feature ideas for aggregation processing that have turned out to be really interesting…

agghist

When one performs a count() or sum() aggregating action, the result is a table that consists of values and numbers, e.g.:

# dtrace -n syscall:::entry'{@[execname] = count()}'
dtrace: description 'syscall:::entry' matched 233 probes
^C

  utmpd                                                             2
  metadata                                                          5
  pickup                                                           14
  ur-agent                                                         18
  epmd                                                             20
  fmd                                                              20
  jsontool.js                                                      33
  vmadmd                                                           33
  master                                                           41
  mdnsd                                                            41
  intrd                                                            45
  devfsadm                                                         52
  bridged                                                          81
  mkdir                                                            83
  ipmgmtd                                                          95
  cat                                                              96
  ls                                                              100
  truss                                                           101
  ipmon                                                           106
  sendmail                                                        111
  dirname                                                         115
  svc.startd                                                      125
  svc.configd                                                     153
  ksh93                                                           164
  zoneadmd                                                        165
  svcprop                                                         258
  sshd                                                            262
  bash                                                            291
  zpool                                                           436
  date                                                            448
  cron                                                            470
  lldpd                                                           584
  dump-minutely-sd                                                611
  haproxy                                                         760
  nscd                                                           1275
  zfs                                                            1966
  zoneadm                                                        2414
  java                                                           2916
  tail                                                           3750
  postgres                                                       6702
  redis-server                                                  21283
  dtrace                                                        28308
  node                                                          39836
  beam.smp                                                      55940

Brendan’s observation was that it would be neat to (optionally) use the histogram-style aggregation printing with these count()/sum() aggregations to be able to more easily differentiate values. For that, we have the new “-x agghist” option:

# dtrace -n syscall:::entry'{@[execname] = count()}' -x agghist
dtrace: description 'syscall:::entry' matched 233 probes
^C

              key  ------------- Distribution ------------- count
            utmpd |                                         2
         metadata |                                         5
           pickup |                                         14
             epmd |                                         20
              fmd |                                         23
         ur-agent |                                         25
             init |                                         27
            mdnsd |                                         30
      jsontool.js |                                         33
           vmadmd |                                         36
           master |                                         41
            intrd |                                         45
         devfsadm |                                         52
          bridged |                                         78
            mkdir |                                         83
              cat |                                         96
          ipmgmtd |                                         97
               ls |                                         100
         sendmail |                                         101
            ipmon |                                         102
          dirname |                                         115
       svc.startd |                                         125
            truss |                                         140
      svc.configd |                                         153
            ksh93 |                                         164
             sshd |                                         248
          svcprop |                                         258
             bash |                                         290
         zoneadmd |                                         387
            zpool |                                         436
             date |                                         448
             cron |                                         470
            lldpd |                                         562
 dump-minutely-sd |                                         615
          haproxy |                                         622
             nscd |                                         808
              zfs |                                         1966
          zoneadm |@                                        2414
             java |@                                        3003
             tail |@                                        3607
         postgres |@                                        5728
     redis-server |@@@@@                                    20611
           dtrace |@@@@@@                                   26966
             node |@@@@@@@@@@                               39825
         beam.smp |@@@@@@@@@@@@@                            55942    

It’s obviously the same information, but presented with a quicker visual cue as to the distribution. Now, one may well note that this output has a lot of dead whitespace in it — read on.

aggzoom

The DTrace histogram-style output came directly from lockstat, which proceeded it by several years. lockstat was important for DTrace in several ways: aside from showing the power of production instrumentation, lockstat pointed to the need for first-class aggregations, for disjoint instrumentation providers and for multiplexed consumers (for which it was repaid by having its guts ripped out, being reimplemented as both a DTrace provider and a DTrace consumer). Due to some combination of my laziness and my reverence for the lockstat-style histogram, I simply lifted the lockstat processing code — which means I inherited (or stole, depending on your perspective) the decisions Jeff had made with regard to histogram rendering. In particular, lockstat determines the height of a bar by taking the bucket’s share of the total and multiplying it by the full height of the histogram. That is, if you add up all of the heights of all of the bars, they will add up to the full height of the histogram. The benefit of this is that it quickly tells you relative distribution: if you see a full-height bar, you know that that bucket represents essentially all of the values. But the problem is that if the number of buckets is large and/or the values of those buckets are relatively evenly distributed, the result is a bunch of very short bars (or worse, zero height bars) and dead whitespace. This can become so annoying to DTrace users that Brendan has been known to observe that this is the one improvement over DTrace that he can point to in SystemTap: that instead of dividing the height of the histogram among all of the bars, each bar has a height in proportion to the histogram height that is its value in proportion to the bucket of greatest value. Of course, the shape of the distribution doesn’t change — one simply is automatically scaling the height of the highest bar to the height of the histogram and adjusting all other bars accordingly.

To allow for this behavior, I added “-x aggzoom“; running the same example as above with this new option:

# dtrace -n syscall:::entry'{@[execname] = count()}' -x agghist -x aggzoom
dtrace: description 'syscall:::entry' matched 233 probes
^C

              key  ------------- Distribution ------------- count
            utmpd |                                         2
         metadata |                                         7
           pickup |                                         14
         ur-agent |                                         20
              fmd |                                         25
             epmd |                                         26
      jsontool.js |                                         33
            mdnsd |                                         39
           master |                                         41
           vmadmd |                                         44
         rsyslogd |                                         57
         devfsadm |                                         66
            mkdir |                                         83
            intrd |                                         90
              cat |                                         96
          bridged |                                         99
               ls |                                         100
          dirname |                                         115
            truss |                                         125
            ipmon |                                         130
         sendmail |                                         131
          ipmgmtd |                                         133
            ksh93 |                                         164
       svc.startd |                                         179
             init |                                         189
             sshd |                                         253
          svcprop |                                         258
             bash |                                         283
            zpool |                                         436
             date |                                         448
             cron |                                         470
 dump-minutely-sd |                                         611
            lldpd |                                         716
      svc.configd |@                                        939
          haproxy |@                                        1097
         zoneadmd |@                                        1455
              zfs |@                                        1966
          zoneadm |@@                                       2414
             nscd |@@                                       2867
             java |@@@                                      4277
             tail |@@@                                      4579
         postgres |@@@@@@                                   9389
     redis-server |@@@@@@@@@@@@@@@@@@                       26177
           dtrace |@@@@@@@@@@@@@@@@@@@@@@@                  34530
             node |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        48151
         beam.smp |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   55940

It’s zoomtastic!

aggpack

If you follow Brendan’s blog, you know that he’s always experimenting with ways of visually communicating a maximal amount of system data. One recent visualization that has been particularly interesting are his frequency trails — a kind of stacked sparkline. Thinking about density of presentation sparked Brendan to observe that he often needs to try to visually correlate multiple quantize()/lquantize() aggregation keys, e.g. this classic DTrace “one-liner” to show the distribution of system call time (in nanoseconds) by system call:

# dtrace -n syscall:::entry'{self->ts = timestamp}' \
    -n syscall:::return'/self->ts/{@[probefunc] = \
    quantize(timestamp - self->ts); self->ts = 0}'
dtrace: description 'syscall:::entry' matched 233 probes
dtrace: description 'syscall:::return' matched 233 probes
^C

  sigpending
           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
            2048 |                                         0        

  llseek
           value  ------------- Distribution ------------- count
             256 |                                         0
             512 |@@@@@@@@@@@@@@@@@@@@                     1
            1024 |@@@@@@@@@@@@@@@@@@@@                     1
            2048 |                                         0        

  fstat
           value  ------------- Distribution ------------- count
            1024 |                                         0
            2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
            4096 |                                         0        

  setcontext
           value  ------------- Distribution ------------- count
            1024 |                                         0
            2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
            4096 |                                         0        

  fstat64
           value  ------------- Distribution ------------- count
             256 |                                         0
             512 |@@@@@@@@@@@@@@@@@@@@                     2
            1024 |@@@@@@@@@@@@@@@@@@@@                     2
            2048 |                                         0        

  sigaction
           value  ------------- Distribution ------------- count
             256 |                                         0
             512 |@@@@@@@@@@@@@@@@@@@@                     2
            1024 |@@@@@@@@@@@@@@@@@@@@                     2
            2048 |                                         0        

  getuid
           value  ------------- Distribution ------------- count
             128 |                                         0
             256 |@@@@@@@@@@@@@@@@@@@@                     1
             512 |                                         0
            1024 |                                         0
            2048 |                                         0
            4096 |@@@@@@@@@@@@@@@@@@@@                     1
            8192 |                                         0        

  accept
           value  ------------- Distribution ------------- count
            1024 |                                         0
            2048 |@@@@@@@@@@@@@@@@@@@@                     1
            4096 |@@@@@@@@@@@@@@@@@@@@                     1
            8192 |                                         0        

  sysconfig
           value  ------------- Distribution ------------- count
             256 |                                         0
             512 |@@@@@@@@@@@@@                            2
            1024 |@@@@@@@@@@@@@@@@@@@@                     3
            2048 |@@@@@@@                                  1
            4096 |                                         0        

  bind
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2
            8192 |                                         0        

  lwp_sigmask
           value  ------------- Distribution ------------- count
             256 |                                         0
             512 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       11
            1024 |@@@                                      1
            2048 |@@@                                      1
            4096 |                                         0        

  recvfrom
           value  ------------- Distribution ------------- count
            1024 |                                         0
            2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 5
            4096 |                                         0        

  setsockopt
           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |@@@@@@@@@@                               1
            2048 |@@@@@@@@@@                               1
            4096 |@@@@@@@@@@@@@@@@@@@@                     2
            8192 |                                         0        

  fcntl
           value  ------------- Distribution ------------- count
             128 |                                         0
             256 |@@@                                      1
             512 |@@@@@@@@@@@@@@@@@@@@                     8
            1024 |@@@@@@@@@@@@@@@                          6
            2048 |@@@                                      1
            4096 |                                         0        

  systeminfo
           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |@@@@@@@@@@@@@@@                          3
            2048 |@@@@@@@@@@@@@@@@@@@@@@@@@                5
            4096 |                                         0        

  schedctl
           value  ------------- Distribution ------------- count
            8192 |                                         0
           16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
           32768 |                                         0        

  getsockopt
           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@          11
            2048 |@@@@@@@@@                                3
            4096 |                                         0        

  stat
           value  ------------- Distribution ------------- count
            1024 |                                         0
            2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              4
            4096 |                                         0
            8192 |@@@@@@@@@@@@@                            2
           16384 |                                         0        

  recvmsg
           value  ------------- Distribution ------------- count
            1024 |                                         0
            2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        15
            4096 |@@@@                                     2
            8192 |@@                                       1
           16384 |                                         0        

  connect
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              4
            8192 |                                         0
           16384 |@@@@@@@@@@@@@                            2
           32768 |                                         0        

  brk
           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |@@@@@@@@@@                               2
            2048 |@@@@@@@@@@@@@@@                          3
            4096 |                                         0
            8192 |@@@@@@@@@@                               2
           16384 |                                         0
           32768 |@@@@@                                    1
           65536 |                                         0        

  so_socket
           value  ------------- Distribution ------------- count
            1024 |                                         0
            2048 |@@@@@@@@@@@@@@@@@@@@@@                   5
            4096 |                                         0
            8192 |@@@@@@@@@                                2
           16384 |@@@@@@@@@                                2
           32768 |                                         0        

  mmap
           value  ------------- Distribution ------------- count
           32768 |                                         0
           65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
          131072 |                                         0        

  putmsg
           value  ------------- Distribution ------------- count
           32768 |                                         0
           65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
          131072 |                                         0        

  p_online
           value  ------------- Distribution ------------- count
             128 |                                         0
             256 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  250
             512 |@                                        5
            1024 |                                         0
            2048 |                                         1
            4096 |                                         0        

  getpid
           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |@@@@@@@@@@                               9
            2048 |@@@@@@@@@@@@@@@@@@@@@@@                  21
            4096 |@@                                       2
            8192 |@@@@                                     4
           16384 |                                         0        

  close
           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |@@@@@@@@@@@@@@@@@@@                      13
            2048 |@@@@@@@                                  5
            4096 |@@@@@@@                                  5
            8192 |@@@                                      2
           16384 |@@@@                                     3
           32768 |                                         0        

  lseek
           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |@@@@@@@@@@@@@@                           22
            2048 |@@@@@@@@@@@@@@@@@@@@@@@                  37
            4096 |@@                                       3
            8192 |@                                        2
           16384 |                                         0        

  open
           value  ------------- Distribution ------------- count
            1024 |                                         0
            2048 |@@@@@@@@@@@@@                            6
            4096 |@@@@@@@                                  3
            8192 |@@@@@@@@@@@@@                            6
           16384 |@@                                       1
           32768 |@@@@                                     2
           65536 |                                         0        

  lwp_cond_signal
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@@@@@@                                   2
            8192 |                                         0
           16384 |@@@@@@@@@@@@@@@@@@@@@@@                  8
           32768 |@@@@@@@@@@@                              4
           65536 |                                         0        

  sendmsg
           value  ------------- Distribution ------------- count
            8192 |                                         0
           16384 |@@@@                                     1
           32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              6
           65536 |@@@@@@@@@                                2
          131072 |                                         0        

  gtime
           value  ------------- Distribution ------------- count
             128 |                                         0
             256 |@@@@@@@@@@@@@@@@@@@@                     576
             512 |@@@@@@@@@@@@@@@@                         452
            1024 |@@                                       56
            2048 |@                                        34
            4096 |                                         3
            8192 |                                         8
           16384 |                                         1
           32768 |                                         1
           65536 |                                         0        

  read
           value  ------------- Distribution ------------- count
             256 |                                         0
             512 |                                         1
            1024 |@@@@@@                                   19
            2048 |@@@@@@@@@@                               34
            4096 |@@@@@                                    15
            8192 |@@@@@@@@@@@@@@@@@@                       61
           16384 |@                                        2
           32768 |                                         0        

  stat64
           value  ------------- Distribution ------------- count
            1024 |                                         0
            2048 |@                                        2
            4096 |@                                        1
            8192 |@@@                                      5
           16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@          52
           32768 |@@@@                                     7
           65536 |                                         0        

  send
           value  ------------- Distribution ------------- count
            1024 |                                         0
            2048 |@@                                       2
            4096 |@                                        1
            8192 |@@@@@@@@@@@@@@@@@@                       22
           16384 |@@@@@@@@@@                               12
           32768 |@@@@@                                    6
           65536 |@@@                                      4
          131072 |@@                                       3
          262144 |                                         0        

  write
           value  ------------- Distribution ------------- count
             256 |                                         0
             512 |@@                                       5
            1024 |@@@@                                     8
            2048 |                                         0
            4096 |@@@                                      6
            8192 |@@@@@@@@@@@                              23
           16384 |@@@@@@@@                                 16
           32768 |@@@@@@@@@@                               22
           65536 |@@                                       4
          131072 |                                         0
          262144 |                                         1
          524288 |                                         0        

  doorfs
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@@@@@@@@@@@@@@@@@@@@                     1
            8192 |                                         0
           16384 |                                         0
           32768 |                                         0
           65536 |                                         0
          131072 |                                         0
          262144 |                                         0
          524288 |                                         0
         1048576 |                                         0
         2097152 |@@@@@@@@@@@@@@@@@@@@                     1
         4194304 |                                         0        

  yield
           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |@@@@@@@@@@@@@@@@@@@@@                    198
            2048 |@                                        5
            4096 |                                         1
            8192 |@@@@@@                                   61
           16384 |@@@@@@@@@@@@                             110
           32768 |                                         2
           65536 |                                         2
          131072 |                                         0
          262144 |                                         1
          524288 |                                         0        

  recv
           value  ------------- Distribution ------------- count
             256 |                                         0
             512 |@@@@@@@@@@@@                             17
            1024 |@@@@                                     6
            2048 |@@@@@@@@@@@@@                            19
            4096 |@@@@                                     6
            8192 |                                         0
           16384 |                                         0
           32768 |                                         0
           65536 |                                         0
          131072 |@@                                       3
          262144 |@@@@                                     5
          524288 |                                         0
         1048576 |                                         0
         2097152 |                                         0
         4194304 |                                         0
         8388608 |                                         0
        16777216 |@                                        1
        33554432 |                                         0        

  nanosleep
           value  ------------- Distribution ------------- count
         4194304 |                                         0
         8388608 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 201
        16777216 |                                         0
        33554432 |                                         0
        67108864 |                                         0
       134217728 |                                         0
       268435456 |                                         0
       536870912 |                                         2
      1073741824 |                                         0        

  ioctl
           value  ------------- Distribution ------------- count
             128 |                                         0
             256 |@                                        50
             512 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   2157
            1024 |                                         17
            2048 |                                         8
            4096 |                                         10
            8192 |                                         13
           16384 |                                         6
           32768 |                                         3
           65536 |                                         7
          131072 |                                         0
          262144 |                                         0
          524288 |                                         0
         1048576 |                                         0
         2097152 |                                         0
         4194304 |                                         0
         8388608 |                                         2
        16777216 |                                         0
        33554432 |                                         0
        67108864 |                                         6
       134217728 |                                         0
       268435456 |                                         1
       536870912 |                                         4
      1073741824 |                                         0        

  lwp_cond_wait
           value  ------------- Distribution ------------- count
        16777216 |                                         0
        33554432 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@          46
        67108864 |@@@@                                     6
       134217728 |                                         0
       268435456 |@                                        2
       536870912 |@@@@                                     6
      1073741824 |                                         0        

  pollsys
           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |@                                        2
            2048 |                                         1
            4096 |                                         1
            8192 |                                         1
           16384 |                                         0
           32768 |                                         0
           65536 |@                                        2
          131072 |                                         0
          262144 |@@                                       6
          524288 |@                                        2
         1048576 |@@                                       5
         2097152 |@                                        2
         4194304 |                                         0
         8388608 |@@@                                      8
        16777216 |                                         0
        33554432 |                                         1
        67108864 |@@@@@@@@@@@@@@@@@@                       56
       134217728 |@@@@@@@@@                                28
       268435456 |                                         0
       536870912 |@@@                                      8
      1073741824 |                                         1
      2147483648 |                                         0        

  lwp_park
           value  ------------- Distribution ------------- count
            1024 |                                         0
            2048 |                                         6
            4096 |@                                        21
            8192 |@                                        20
           16384 |@@@@@@@@@                                134
           32768 |@@@@@@@@@@                               145
           65536 |@@@@@@@@@@                               153
          131072 |@                                        11
          262144 |                                         3
          524288 |                                         0
         1048576 |                                         0
         2097152 |                                         0
         4194304 |                                         0
         8388608 |@@@@@@                                   89
        16777216 |                                         0
        33554432 |                                         0
        67108864 |                                         0
       134217728 |                                         2
       268435456 |                                         4
       536870912 |@                                        18
      1073741824 |                                         2
      2147483648 |                                         0        

  portfs
           value  ------------- Distribution ------------- count
             256 |                                         0
             512 |                                         4
            1024 |@@@                                      41
            2048 |@@@@                                     44
            4096 |@                                        14
            8192 |                                         2
           16384 |                                         1
           32768 |                                         2
           65536 |                                         1
          131072 |                                         3
          262144 |@                                        10
          524288 |                                         3
         1048576 |                                         6
         2097152 |                                         3
         4194304 |                                         0
         8388608 |@@@@@@@@@@@@@@@@@@@@@@@                  272
        16777216 |                                         1
        33554432 |                                         1
        67108864 |                                         4
       134217728 |                                         2
       268435456 |                                         5
       536870912 |@@@@@                                    58
      1073741824 |                                         3
      2147483648 |                                         1
      4294967296 |                                         0        

For those keeping score, that would be 482 lines of output — and that was for just a couple of seconds. Brendan’s observation was that it would be great to tip these aggregations on their side (so their bars point up-and-down, not left-to-right) — with the output for each key appearing on one line. This is clearly a great idea, and “-x aggpack” was born:

# dtrace -n syscall:::entry'{self->ts = vtimestamp}' \
    -n syscall:::return'/self->ts/{@[probefunc] = \
    quantize(vtimestamp - self->ts); self->ts = 0}' -x aggpack
dtrace: description 'syscall:::entry' matched 233 probes
dtrace: description 'syscall:::return' matched 233 probes
^C

              key  min .------------------. max     | count
       sigpending    8 :     X            : 1048576 | 1
     lwp_continue    8 :      X           : 1048576 | 1
           uucopy    8 :      X           : 1048576 | 1
           getuid    8 :  x   x           : 1048576 | 2
         lwp_kill    8 :       X          : 1048576 | 1
        sigaction    8 :   xx x           : 1048576 | 4
           llseek    8 :     x x          : 1048576 | 2
          fstat64    8 :     xx           : 1048576 | 4
       setcontext    8 :      xx          : 1048576 | 2
      lwp_sigmask    8 :   xx _           : 1048576 | 10
       systeminfo    8 :     xx           : 1048576 | 4
        sysconfig    8 :   ____x          : 1048576 | 6
           accept    8 :       xx         : 1048576 | 2
             bind    8 :         X        : 1048576 | 2
            fstat    8 :     x    x       : 1048576 | 2
         recvfrom    8 :        X         : 1048576 | 5
           doorfs    8 :         xx       : 1048576 | 2
              brk    8 :     ___xx        : 1048576 | 8
       lwp_create    8 :           X      : 1048576 | 1
         schedctl    8 :       x   x      : 1048576 | 2
         p_online    8 :   X_  _          : 1048576 | 256
           getpid    8 :      xx_ _       : 1048576 | 28
          recvmsg    8 :       xx_        : 1048576 | 18
             open    8 :         xx x     : 1048576 | 4
            lseek    8 :       xx _ _     : 1048576 | 9
       getsockopt    8 :      _xx         : 1048576 | 43
             stat    8 :        x_x _     : 1048576 | 7
             mmap    8 :             X    : 1048576 | 1
           putmsg    8 :             X    : 1048576 | 1
             recv    8 :     _xxx__       : 1048576 | 65
            gtime    8 :   X_____         : 1048576 | 899
            fcntl    8 :   ___xx          : 1048576 | 208
       setsockopt    8 :      __x__       : 1048576 | 55
          sendmsg    8 :           xxx    : 1048576 | 9
  lwp_cond_signal    8 :       _x__x      : 1048576 | 33
            close    8 :       __xx_      : 1048576 | 65
    lwp_cond_wait    8 :        _xx_      : 1048576 | 69
           munmap    8 :                X : 1048576 | 1
             read    8 :    ___x_x_ _ _   : 1048576 | 137
           stat64    8 :        ___X      : 1048576 | 49
            yield    8 : _    X___        : 1048576 | 1314
        so_socket    8 :        ___x_     : 1048576 | 60
        nanosleep    8 :         _x__     : 1048576 | 77
             send    8 :      ____xx__    : 1048576 | 66
          pollsys    8 :       _ _xx_     : 1048576 | 86
            ioctl    8 :    X_________    : 1048576 | 3259
          connect    8 :         _ _xx_   : 1048576 | 57
           mmap64    8 :        __x _   x : 1048576 | 14
            write    8 :     __ __x_____  : 1048576 | 115
           portfs    8 :     ______x_     : 1048576 | 619
         lwp_park    8 :        _xx_x_    : 1048576 | 654

This communicates essentially the same amount of information in just a fraction of the output (55 lines versus 482) — and makes it (much) easier to quickly compare distributions across aggregation keys. That said, one of the challenges here is that ASCII can be very limiting when trying to come with characters of clearly different height. When I demonstrated a prototype of this at the illumos BOF at Surge, a couple of folks volunteered that I should look into using the Unicode Block Elements for this output. Here’s the same enabling, but rendered using these elements:

# dtrace -n syscall:::entry'{self->ts = vtimestamp}' \
    -n syscall:::return'/self->ts/{@[probefunc] = \
    quantize(vtimestamp - self->ts); self->ts = 0}' -x aggpack
dtrace: description 'syscall:::entry' matched 233 probes
dtrace: description 'syscall:::return' matched 233 probes
^C

              key  min .-------------------. max     | count
         lwp_self   32 :   █               : 8388608 | 1
        sysconfig   32 : ▃▃▃               : 8388608 | 3
       sigpending   32 :    █              : 8388608 | 1
             pset   32 :  ▆ ▃              : 8388608 | 3
      getsockname   32 :     █             : 8388608 | 1
       systeminfo   32 :   ▅▄              : 8388608 | 5
            fstat   32 :      █            : 8388608 | 1
           llseek   32 :    ▆▃             : 8388608 | 3
          memcntl   32 :      █            : 8388608 | 1
      resolvepath   32 :      █            : 8388608 | 1
           semsys   32 :      █            : 8388608 | 1
       setcontext   32 :     █             : 8388608 | 2
          setpgrp   32 :      █            : 8388608 | 1
          fstat64   32 :   ▃▃▃             : 8388608 | 6
        sigaction   32 : ▃▃▂▃              : 8388608 | 17
           accept   32 :       █           : 8388608 | 1
           access   32 :       █           : 8388608 | 1
           munmap   32 :       █           : 8388608 | 1
        setitimer   32 :    ▅▃▃            : 8388608 | 4
      lwp_sigmask   32 : ▄▃▃▂              : 8388608 | 26
             pipe   32 :        █          : 8388608 | 1
          waitsys   32 :   ▅    ▅          : 8388608 | 2
             stat   32 :     ▃ ▆           : 8388608 | 3
             bind   32 :      ▅▄           : 8388608 | 5
          mmapobj   32 :         █         : 8388608 | 1
           open64   32 :         █         : 8388608 | 1
         p_online   32 : █▁ ▁              : 8388608 | 256
           getpid   32 :   ▁▆▂             : 8388608 | 45
         schedctl   32 :         █         : 8388608 | 2
       getsockopt   32 :    ▃▅▂            : 8388608 | 32
       setsockopt   32 :   ▁▁▃▆            : 8388608 | 31
            fcntl   32 : ▂▁▃▃▂             : 8388608 | 114
          recvmsg   32 :    ▁▄▃▂▁▁         : 8388608 | 17
           putmsg   32 :           █       : 8388608 | 1
             mmap   32 :       ▅   ▅       : 8388608 | 2
           writev   32 :         ▆ ▃       : 8388608 | 3
            lseek   32 :   ▁▃▃▄▁▁          : 8388608 | 76
              brk   32 :  ▃▂▃▃▂▁▁ ▁        : 8388608 | 124
            gtime   32 : ▇▁▁▁▁▁ ▁          : 8388608 | 1405
          sendmsg   32 :        ▂▅▄        : 8388608 | 10
             recv   32 :   ▂▃▃▂▂▁          : 8388608 | 159
            yield   32 :    █▁▁▁           : 8388608 | 409
             open   32 :      ▃▂▄▁▂        : 8388608 | 30
            close   32 :    ▂▃▂▃▂▁         : 8388608 | 60
        so_socket   32 :       ▁▃▆         : 8388608 | 29
  lwp_cond_signal   32 :    ▁▃▂▂▁▃▁        : 8388608 | 59
        nanosleep   32 :       ▂▆▂         : 8388608 | 74
    lwp_cond_wait   32 :      ▂▃▄▂         : 8388608 | 120
          connect   32 :         ▂▅▃       : 8388608 | 24
           stat64   32 :       ▁▂▇▁        : 8388608 | 77
          pollsys   32 :   ▁▁▁ ▁▅▄         : 8388608 | 148
             send   32 :    ▁▂▂▂▂▃▂▁       : 8388608 | 137
             read   32 :   ▁▂▃▂▃▂▁▁▁▁      : 8388608 | 283
            ioctl   32 :  ▇▁▁▁▁▁▁▁▁▁▁      : 8388608 | 3697
            write   32 :   ▁▁▁▂▂▃▂▁▁ ▁▁    : 8388608 | 207
          forksys   32 :                 █ : 8388608 | 1
           portfs   32 :   ▁▂▂▁▁▂▄▁ ▁      : 8388608 | 807
         lwp_park   32 :  ▁ ▁▁▁▃▃▂▃▁       : 8388608 | 919

Delicious! (Assuming, of course, that these look right for you — which they may or may not, depending on how your monospaced font renders the Unicode Block Elements.) After one look at the Unicode Block Elements, it clearly had to be the default behavior — but if your terminal is rendered in a font that can’t display the UTF-8 encodings of these characters (less common) or if they render in a way that is not monospaced despite being in a putatively monospaced font (more common), I also added a “-x encoding” option that can be set to “ascii” to force the ASCII output.

Returning to our example, if the above is too much dead space for you, you can combine it with aggzoom:

# dtrace -n syscall:::entry'{self->ts = vtimestamp}' \
    -n syscall:::return'/self->ts/{@[probefunc] = \
    quantize(vtimestamp - self->ts); self->ts = 0}' -x aggpack -x aggzoom
dtrace: description 'syscall:::entry' matched 233 probes
dtrace: description 'syscall:::return' matched 233 probes
^C

              key  min .----------------. max     | count
     lwp_continue   32 :    █           : 1048576 | 1
       sigpending   32 :    █           : 1048576 | 1
           uucopy   32 :    █           : 1048576 | 1
         lwp_kill   32 :     █          : 1048576 | 1
        sigaction   32 : ▄▄ █           : 1048576 | 4
             pset   32 :  █  ▄          : 1048576 | 3
        sysconfig   32 : ███ █          : 1048576 | 4
       setcontext   32 :    ██          : 1048576 | 2
            fstat   32 :      █         : 1048576 | 1
         recvfrom   32 :      █         : 1048576 | 1
       systeminfo   32 :   ▅█           : 1048576 | 5
      lwp_sigmask   32 : ██▂▅           : 1048576 | 14
           accept   32 :     █ █        : 1048576 | 2
          recvmsg   32 :     █ █        : 1048576 | 2
             stat   32 :        █       : 1048576 | 2
         p_online   32 : █▁ ▁           : 1048576 | 256
         schedctl   32 :      █  █      : 1048576 | 2
              brk   32 :    ▄███▄       : 1048576 | 8
       getsockopt   32 :    ▅██         : 1048576 | 21
           getpid   32 :    █▇          : 1048576 | 39
       setsockopt   32 :     ▃█▂        : 1048576 | 16
       lwp_create   32 :          █     : 1048576 | 1
          sendmsg   32 :          █     : 1048576 | 1
            fcntl   32 : ▃▄▃▆█▁         : 1048576 | 60
           writev   32 :           █    : 1048576 | 1
            lseek   32 :    ▆▇█▂        : 1048576 | 65
             recv   32 :  ▁▆▃█▆▅        : 1048576 | 62
           putmsg   32 :           █    : 1048576 | 2
             open   32 :      █ ▅▂▃     : 1048576 | 16
            gtime   32 : █▁▁▁▁  ▁       : 1048576 | 1365
            close   32 :    ▂█▂█▄▃      : 1048576 | 36
  lwp_cond_signal   32 :     ▂▂█▅▅▂     : 1048576 | 18
        so_socket   32 :      ▁▂▁█      : 1048576 | 19
             mmap   32 :     ▆▃ █▄ ▆    : 1048576 | 13
        nanosleep   32 :       ▃█▃      : 1048576 | 34
             read   32 :   ▁▂▅▄█▅▂      : 1048576 | 144
    lwp_cond_wait   32 :       ▃█▂▁     : 1048576 | 74
            yield   32 :    █▁▁▁ ▁      : 1048576 | 1187
             send   32 :     ▁▂▅█▅▄▂    : 1048576 | 50
          connect   32 :      ▂▂ ▅██▃   : 1048576 | 20
           stat64   32 :     ▁▁▁▃█▁     : 1048576 | 80
          pollsys   32 :     ▁▁▁█▆      : 1048576 | 136
            write   32 :  ▂▄ ▃▂▄█▆▃▂ ▂  : 1048576 | 77
            ioctl   32 :  █▂▁▁▁▁▁▁ ▁    : 1048576 | 4587
           mmap64   32 :      ▂▄▄▂▅ ▂ █ : 1048576 | 20
           portfs   32 :  ▁▁▂▂▁▂▂█▂     : 1048576 | 601
         lwp_park   32 :    ▁▁▂▆▇▃█▁▁   : 1048576 | 787

Here’s another fun one:

# dtrace -n BEGIN'{start = timestamp}' \
    -n sched:::on-cpu'{@[execname] = \
    lquantize((timestamp - start) / 1000000000, 0, 30, 1)}' \
    -x aggpack -n tick-1sec'/++i >= 30/{exit(0)}' -q
              key     min .--------------------------------. max      | count
             sshd     < 0 : █                              : >= 30    | 1
            zpool     < 0 : █                              : >= 30    | 8
              zfs     < 0 : █                              : >= 30    | 19
          zoneadm     < 0 : █                              : >= 30    | 19
            utmpd     < 0 :     █                          : >= 30    | 1
            intrd     < 0 :                             █  : >= 30    | 2
             epmd     < 0 :   ▂    ▂    ▂    ▂    ▂    ▂   : >= 30    | 6
            mdnsd     < 0 :   ▂    ▂    ▂    ▂    ▂    ▂   : >= 30    | 6
         sendmail     < 0 :   ▂    ▂    ▂    ▂    ▂    ▂   : >= 30    | 6
         ur-agent     < 0 :   ▂▂         ▂         ▂   ▃▂  : >= 30    | 14
           vmadmd     < 0 :     ▃    ▂    ▂    ▂    ▂    ▃ : >= 30    | 22
         devfsadm     < 0 : ▁ ▁ ▁ ▁ ▁ ▁ ▁ ▁ ▁ ▁ ▁ ▁ ▁ ▁ ▁  : >= 30    | 30
          bridged     < 0 : ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ : >= 30    | 30
          fsflush     < 0 : ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ : >= 30    | 32
              fmd     < 0 :        ▁▁        ▁▇        ▁▁  : >= 30    | 41
            ipmon     < 0 : ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ : >= 30    | 51
          ipmgmtd     < 0 :         ▃         ▃         ▃  : >= 30    | 66
            lldpd     < 0 : ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ : >= 30    | 90
           dtrace     < 0 : ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ : >= 30    | 111
          haproxy     < 0 : ▂▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁ : >= 30    | 113
      svc.configd     < 0 :         ▂▁        ▄▁        ▂▁ : >= 30    | 126
       svc.startd     < 0 :         ▃▁        ▃▁        ▃▁ : >= 30    | 150
         beam.smp     < 0 : ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ : >= 30    | 209
             nscd     < 0 : ▁       ▃▂        ▃▂        ▃▂ : >= 30    | 192
         postgres     < 0 : ▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ : >= 30    | 598
     redis-server     < 0 : ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ : >= 30    | 598
      zpool-zones     < 0 :     ▂  ▁ ▂    ▃  ▁ ▂    ▂  ▁ ▃ : >= 30    | 623
             java     < 0 : ▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁ : >= 30    | 1211
             tail     < 0 : ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ : >= 30    | 1178
             node     < 0 : ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁ : >= 30    | 11195
            sched     < 0 : ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ : >= 30    | 22101

This (quickly) shows you the second offset for CPU scheduling events from the start of the D script for various applications, and points to some obvious conclusions: there are clearly a bunch of programs running once every ten seconds, a couple more running once every five seconds, and so on.

encoding

Seeing this, why should aggpack hog all that Unicode hawtness? While only aggpack will use the UTF-8 encodings by default, I also extended the encoding option to allow a “utf8” setting that forces the UTF-8 encoding of the Unicode Block Elements to be used for all aggregation display. For example, to return to our earlier agghist example:

# dtrace -n syscall:::entry'{@[execname] = count()}' \
    -x agghist -x aggzoom -x encoding=utf8
dtrace: description 'syscall:::entry' matched 233 probes
dtrace: description 'syscall:::entry' matched 233 probes
^C

              key  ------------- Distribution ------------- count
            utmpd |                                         2
             sshd |                                         10
           pickup |                                         14
              fmd |                                         20
         ur-agent |                                         21
             epmd |                                         22
           vmadmd |                                         31
      jsontool.js |                                         33
            mdnsd |                                         33
           master |                                         41
            intrd |                                         45
         devfsadm |                                         54
          ipmgmtd |                                         55
          bridged |                                         81
            mkdir |                                         83
              cat |                                         96
         sendmail |                                         101
            ipmon |                                         108
          dirname |                                         115
       svc.startd |                                         129
            ksh93 |                                         164
               ls |                                         182
          svcprop |▏                                        258
             bash |▏                                        290
            zpool |▎                                        436
             date |▎                                        448
             cron |▎                                        470
            lldpd |▍                                        595
 dump-minutely-sd |▍                                        609
          haproxy |▍                                        654
      svc.configd |▌                                        907
         zoneadmd |█                                        1554
              zfs |█▎                                       1979
          zoneadm |█▋                                       2414
             nscd |█▋                                       2522
             java |██▏                                      3205
             tail |██▌                                      3775
         postgres |████▏                                    6083
     redis-server |██████████████▋                          21710
           dtrace |███████████████████▎                     28509
             node |█████████████████████████▍               37427
         beam.smp |██████████████████████████████████████   55948

To paraphrase Kent Brockman, I for one welcome our new Unicode Block Element overlords — and I look forward to toiling in their underground sugar caves!

Availability

All of these new options have been integrated into SmartOS (and we aim to get them up into illumos). If you’re a Joyent public cloud customer, you already all have these improvements or they will be coming to you when your platform is next upgraded (depending on the compute node that you have landed on). And if you’re a DTrace user and like Brendan and have some new way that you’d like to see data visualized (or have any other DTrace feature request), don’t hesitate to speak up — DTrace only improves when those of us who depend on it every day envision a way that it could be even better!

Posted on November 10, 2013 at 9:30 pm by bmc · Permalink · Comments Closed
In: Uncategorized

Happy 10th Birthday, DTrace!

Ten years ago this morning — at 10:27a local time, to be exact — DTrace was integrated. On the occasion of DTrace’s fifth birthday a half-decade ago, I reflected on much of the drama of the final DTrace splashdown, but much has happend in the DTrace community in the last five years; some highlights:

And a bunch of other stuff that’s either more esoteric (e.g., llquantize) or that I now use so frequently that I’ve forgotten that there was a time that we didn’t have it (-x temporal FTW!). But if I could go back to my five-year-ago self, the most surprising development to my past-self may be how smoothly DTrace and its community survived the collapse of Sun and the craven re-closing of Solaris. It’s a testament to the power of open source and open communities that DTrace has — along with its operating system of record, illumos — emerged from these events empowered and invigorated. So I would probably show my past-self my LISA talk on the rise of illumos, and he and I would both get a good chuckle out of it all — and then no doubt erupt in fisticuffs over the technical impossibility of fds[] in the non-global zone

And lest anyone think that we’re done, the future is bright, as the El Dorado of user-land types in DTrace may be gleaming on the horizon courtesy of Adam and Robert Mustacchi. I think I speak on behalf of all of us in the broader DTrace community when I say that I’m very much looking forward to continuing to use, improve and expand DTrace for the next decade and beyond!

Posted on September 3, 2013 at 7:00 am by bmc · Permalink · Comments Closed
In: Uncategorized

Serving up disaster porn with Manta

Years ago, Ben Fried liberated me by giving me the words to describe myself: I am a disaster porn addict. The sordid specifics of my addiction would surely shock the unafflicted: there are well-thumbed NTSB final accident reports hidden under my matress; I prowl the internet late at night for pictures of dermoid cysts; and I routinely binge on the vexed Soviet missions to Mars.

In terms of my software work, my predilection for (if not addiction to) systems failure has manifested itself in an acute interest in postmortem debugging and debuggability. You can see this through my career, be it postmortem memory leak detection in 1999, postmortem object type identifcation in 2002, postmortem DTrace in 2003 or postmortem debugging of JavaScript in 2012.

That said, postmortem debugging does have a dark side — or at least, a tedious side: the dumps themselves are annoyingly large artifacts to deal with. Small ones are only tens of megabytes, but a core dump from a modestly large VM — and certainly a crash dump of a running operating system — will easily be gigabytes or tens of gigabyes in size. This problem is not new, and my career has been pockmarked by centralized dump servers that were constantly running low of space[1]. To free up space in the virtual morgue, dumps from resolved issues are inevitably torched on a regular basis. This act strikes me as a a kind of desecration; every dump — even one whose immediate failure is understood — is a sacred snapshot of a running system, and we can’t know what questions we might have in the future that may be answerable by the past. There have been many times in my career when debugging a new problem has led to my asking new questions of old dumps.

At Joyent, we have historically gotten by on the dump management problem with the traditional centralized dump servers managed with creaky shell scripts — but it wasn’t pretty, and I lived in fear of dumps slipping through the cracks. Ironically, it was the development of Manta that made our kludged-together solution entirely untenable: while in the Joyent public cloud we may have the luxury of (broadly) ignoring core dumps that may correspond to errant user code, in Manta, we care about every core dump from every service — we always want to understand why any component fails. But because it’s a distributed service, a single bug could cause many components to fail — and generate many, many dumps. In the earlier stages of Manta development, we quickly found ourselves buried under an avalanche of hundreds of dumps. Many of them were due to known issues, of course, but I was concerned that a hithertofore unseen issue would remain hidden among the rubble — and that we would lose our one shot to debug a problem that would return to haunt us in production.

A sexy solution

Of course, the answer to the Manta dump problem was clear: Manta may have been presenting us with a big data problem, but it also provided us a big data solution. Indeed, Manta is perfectly suited for the dump problem: dumps are big, but they also require computation (if in modest measure) to make any real use of them, and Manta’s compute abstraction of the operating system is perfect for running the debugger. There were just two small wrinkles, the first being that Manta as designed only allowed computation access to an object via a (non-interactive) job, posing a problem for the fundamentally interactive act of debugging. This wasn’t a technological limitation per se (after all, the operating system naturally includes interactive elements like the shell and so on), but allowing an interactive shell to be easily created from a Manta job was going to require some new plumbing. Fortunately, some lunchtime moaning on my part managed to convince Joshua Clulow to write this (if only to quiet my bleating), and the unadulterated awesomeness that is mlogin was born.

The second wrinkle was even smaller: for historical (if not antiquated) reasons, kernel crash dumps don’t have just one file, but two — a crash dump and a “name list” that contained the symbol table. But for over a decade we have also had the symbol table in the dump itself, and it was clear the vestigial appendix that was unix.0 needed to be removed and libkvm modified to dig it out of the dump — a straightforward fix.

Wrinkles smoothed, we had the necessary foundation in Manta to implement Thoth, a Manta-based system for dump management. Using the node-manta SDK to interact with Manta, Thoth allows for uploading dumps to Manta, tracking auxiliary information about each dump, debugging a dump (natch), querying dumps, and (most interestingly) automatically analyzing many dumps in parallel. (If you are in the SmartOS or broader illumos community and you want to use Thoth, it’s open source and available on GitHub.)

While the problem Thoth solves may have but limited appeal, some of the patterns it uses may be more broadly applicable to those building Manta-based systems. As you might imagine, Thoth uploads a dump by calculating a unique hash for the dump on the client[2]. Once the hash is calculated, a Manta directory is created that contains the dump itself. For the metadata associated with the dump (machine, application, datacenter and so on), I took a quick-and-dirty approach: Thoth stores the metadata about a dump in a JSON payload that lives beside the dump in the same directory. Here is what a JSON payload might look like:

% thoth info 6433c1ccfb41929d
{
  "name": "/thoth/stor/thoth/6433c1ccfb41929dedf3257a8d9160ea",
  "dump": "/thoth/stor/thoth/6433c1ccfb41929dedf3257a8d9160ea/core.svc.startd.70377",
  "pid": "70377",
  "cmd": "svc.startd",
  "psargs": "/lib/svc/bin/svc.startd",
  "platform": "joyent_20130625T221319Z",
  "node": "eb9ca020-77a6-41fd-aabb-8f95305f9aa6",
  "version": "1",
  "time": 1373566304,
  "stack": [
    "libc.so.1`_lwp_kill+0x15()",
    "libc.so.1`raise+0x2b()",
    "libc.so.1`abort+0x10e()",
    "utmpx_postfork+0x44()",
    "fork_common+0x186()",
    "fork_configd+0x8d()",
    "fork_configd_thread+0x2ca()",
    "libc.so.1`_thrp_setup+0x88()",
    "libc.so.1`_lwp_start()"
  ],
  "type": "core",
}

To actually debug a dump, one can use the “debug” command, which simply uses mlogin to allow interactive debugging of the dump via mdb:

% thoth debug 6433c1ccfb41929d
thoth: debugging 6433c1ccfb41929dedf3257a8d9160ea
 * created interactive job -- 8ea7ce47-a58d-4862-a070-a220ae23e7ce
 * waiting for session... - established
thoth: dump info in $THOTH_INFO
Loading modules: [ svc.startd libumem.so.1 libc.so.1 libuutil.so.1 libnvpair.so.1 ld.so.1 ]
>

Importantly, this allows you to share a dump with others without moving the dump and without needing to grant ssh access or punch holes in firewalls — they need only be able to have access to the object. For an open source system like ours, this alone is huge: if we see a panic in (say) ZFS, I would like to enlist (say) Matt or George to help debug it — even if it’s to verify a hypothesis that we have developed. Transporting tens of gigs around becomes so painful that we simply don’t do it — and the quality of software suffers.

To query dumps, Thoth runs simple Manta compute jobs on the (small) metadata objects. For example, if you wanted to find all of the dumps from a particular machine, you would run:

% thoth ls node=eb9ca020-77a6-41fd-aabb-8f95305f9aa6
thoth: creating job to list
thoth: created job 2e8af859-ffbf-4757-8b54-b7865307d4d9
thoth: waiting for completion of job 2e8af859-ffbf-4757-8b54-b7865307d4d9
DUMP             TYPE  TIME                NODE/CMD         TICKET
0ccf27af56fe18b9 core  2013-07-11T18:11:43 svc.startd       OS-2359
6433c1ccfb41929d core  2013-07-11T18:11:44 svc.startd       OS-2359

This kicks off a job that looks at all of the metadata objects in parallel in a map phase, pulls out those that match the specified machine and passes them to a reduce phase that simply serves to coalesce the results into a larger JSON object. We can use the mjob command along with Trent Mick‘s excellent json tool to see the first phase:

% mjob get 2e8af859-ffbf-4757-8b54-b7865307d4d9 | json phases[0].exec
mls /thoth/stor/thoth | awk '{ printf("/thoth/stor/thoth/%s/info.json\n", $1) }' | xargs mcat

So this, as it turns out, is a tad fancy: in the context of a job, it runs mls on the thoth directory, formats that as an absolute path name to the JSON object, and then uses the venerable xargs to send that output as arguments to mcat. mcat interprets its arguments as Manta objects, and runs the next phase of a job in parallel on those objects. We could — if we wanted — do the mls on the client and then pass in the results, but it would require an unnecessary roundtrip between server and client for each object; the power of Manta is that you can get the server to do work for you! Now let’s take a look at the next phase:

% mjob get 2e8af859-ffbf-4757-8b54-b7865307d4d9 | json phases[1].exec
json -c 'node=="eb9ca020-77a6-41fd-aabb-8f95305f9aa6"'

This is very simple: it just uses json‘s -c option to pass the entire JSON blob if the “node” property matches the specified value. Finally, the last phase:

% mjob get 2e8af859-ffbf-4757-8b54-b7865307d4d9 | json phases[2].exec
json -g

This uses json (again) to aggregate all of the JSON blobs into one array that contains all of them — and this array is what is returned as the output of the job. Now, as one might imagine, all of this is a slight abuse of Manta, as it’s using Manta to implement a database (albeit crudely) — but Manta is high performing enough that this works well, and it’s much simpler than having to spin up infrastructure dedicated to Thoth (which would require sizing a database, making it available, backing it up, etc). Manta loves it quick and dirty!

Getting kinky

Leveraging Manta’s unique strengths, Thoth also supports a notion of analyzers, simple shell scripts that run in the context of a Manta job. For example, here is an analyzer that determines if a dump is a duplicate of a known svc.startd issue:

#
# This analyzer only applies to core files
#
if [[ "$THOTH_TYPE" != "core" ]]; then
	exit 0;
fi

#
# This is only relevant for svc.startd
#
if [[ `cat $THOTH_INFO | json cmd` != "svc.startd" ]]; then
	exit 0;
fi

#
# This is only OS-2359 if we have utmpx_postfork in our stack
#
if ( ! echo ::stack | mdb $THOTH_DUMP | grep utmpx_postfork > /dev/null ); then
	exit 0;
fi

#
# We have a winner! Set the ticket.
#
thoth_ticket OS-2359
echo $THOTH_NAME: successfully diagnosed as OS-2359

Thoth’s “analyze” command can then be used to run this against one dump, any dumps that match a particular specification, or all dumps.

The money shot

Here’s a more involved analyzer that pulls an SMF FMRI out of the dump and — looking at the time of the dump — finds the log (in Manta!) that corresponds to that service at that hour and looks for a string (“PANIC”) in that log that corresponds to service failure. This log snippet is then attached to the dump via a custom property (“paniclog”) that can then be queried by other analyzers:

if [[ "$THOTH_TYPE" != "core" ]]; then
	exit 0
fi

if ( ! pargs -e $THOTH_DUMP | grep -w SMF_FMRI > /dev/null ); then
	exit 0
fi

fmri=`pargs -e $THOTH_DUMP | grep -w SMF_FMRI | cut -d= -f2-`
time=`cat $THOTH_INFO | json time`
node=`cat $THOTH_INFO | json node`

path=`basename $fmri | cut -d: -f1`/`date -d @$time +%Y/%m/%d/%H`/${node}.log

echo === $THOTH_NAME ===

echo log: $path

if ( ! mget /$MANTA_USER/stor/logs/$path > log.out ); then
	echo "  (not found)"
	exit 0
fi

grep -B 10 -A 10 PANIC log.out > panic.out

echo paniclog:
cat panic.out

thoth_set paniclog < panic.out

In a distributed system, logs are a critical component of system state — being able to quickly (and automatically) couple those logs with core dumps has been huge for us in terms of quickly diagnosing issues.

Pillow talk

Thoth is definitely a quick-and-dirty system — I was really designing it to satisfy our own immediate needs (and yes, addictions). If the usage patterns changed (if, for example, one were going to have millions of dumps instead of thousands), one would clearly want a proper database fronting Manta. Of course, it must also be said that one would then need to deal with annoying infrastructure issues like sizing that database, making it available, backing it up, etc. And if (when?) this needs to be implemented, my first approach would be to broadly keep the structure as it is, but add a Manta job that iterates over all metadata and adds the necessary rows in the appropriate tables of a centralized reporting database. This allows Manta to store the canonical data, leveraging the in situ compute to populate caching tiers and/or reporting databases.

As an engineer, it’s always gratifying to build something that you yourself want to use. Strange as it may sound to some, Thoth epitomizes that for me: it is the dump management and analysis framework that I never dreamed possible — and thanks to Manta, it was amazingly easy (and fun!) to build. Most importantly, in just the short while that we’ve had Thoth fully wired up, we have already caught a few issues that would have likely escaped notice before. (Thanks to a dump harvested by Thoth, we caught a pretty nasty OS bug that had managed to escape detection for many years.) Even though Thoth’s purpose is admittedly narrow, it’s an example of the kinds of things that you can build on top of Manta. We have many more such ideas in the works — and we’re not the only ones; Manta is inspiring many with their own big data problems — from e-commerce companies to to scientific computing and much in between!


[1] Pouring one out for my dead homie, cores2.central

[2] Shout-out to Robert‘s excellent node-ctype which allows for binary dumps to be easily parsed client-side

Posted on July 23, 2013 at 5:05 pm by bmc · Permalink · 2 Comments
In: Uncategorized

Manta: From revelation to product

If you haven’t seen it, you should read Mark’s blog entry on Manta, the revolutionary new object storage service we announced today that features in situ compute. The idea for this is beautifully simple: couple a ZFS-based distributed object storage system with the OS-level virtualization in SmartOS to deliver a system that allows arbitrary compute to be spun up where objects live. That is, not only can you store and retrieve objects (as you can with any internet-facing object store), you can also specify compute jobs to be operated upon those objects without requiring data motion. (If you still need to be convinced that this represents a new paradigm of object storage, check out this screencast.)

Sometimes simple ideas can seem obvious in hindsight — especially as the nuance of historical context is lost to time — so for the record let me be clear on this point: this idea wasn’t obvious. I say this unequivocally because I myself was trying think about how we could use the technological differentiators in SmartOS to yield a better object store — and it was very hard to not seek that differentiator in ZFS. As myopic as it may seem now in retrospect, I simply couldn’t look beyond ZFS — it just had to hold the riddle to a next generation object store!

But while ZFS is essential to Manta, it is ultimately as an implementation detail; the technology that serves as the essential enabler is the OS-level virtualization provided by Zones, which allow us to easily, quickly and securely spin up on-the-metal, multi-tenant compute on storage nodes without fear of compromising the integrity of the system. Zones hit the sweet spot: hardware virtualization (e.g. KVM) is at too low a level of abstraction to allow this efficiently, and higher levels of virtualization (e.g. PaaS offerings) sacrifice expressive power and introduce significant multi-tenancy complexity and risk.

Of course, all of this was obvious once Mark had the initial insight to build on Zones; what we needed to build was instantly self-evident. This flash of total clarity is rare in a career; I have only felt it a handful of times and it’s such an intense moment that it becomes locked in memory. I remember exactly where I was when Bonwick described to me the first ideas for what became ZFS (my dimly lit office in MPK17, circa 2000) — and I remember exactly where I was when I described to Bonwick my first ideas for what became DTrace (in Bart’s old blue van on Willow Road crossing the 101, February 1996). Given this, there was one thing about Manta that troubled me: I couldn’t remember where I was when Mark described the idea to me. I knew that it came from Mark, and I knew that it was sometime in the fall of 2011, but I couldn’t remember the details of the conversation. In talking to Mark about this, he couldn’t remember either — so I decided to go through old IM logs to determine when we first started talking about it to help us both date it.

And in going through my logs, it became clear why I couldn’t remember that initial conversation — because there wasn’t one, at least not in the traditional sense: it happened over IM. This (accidentally) captured for posterity a moment of which one has so few: having one’s brain blasted by enlightenment. (I know that it will disappoint my mother that I dealt with this essentially by swearing, so let me pause to explain that this isn’t her fault; as she herself points out from time to time, I wasn’t raised this way.) So here is my initial conversation with Mark, with some sensitive details redacted:

Of course, a flash of insight is a long way from a product — and that conversation over a year and a half ago was just the beginning of a long journey. As Mark mentioned, shortly after this conversation, he was joined by Dave who led the charge on Manta computation. To this duo, we added Yunong, Nate and Fred. Beyond that core Manta team, Keith developed the hardware architecture for Manta, Jerry developed the first actual Manta code with hyprlofs, and Bill and Matt wrote a deployment management system for Manta — and Matt further developed the DNS service that is at the heart of the system.

As Mark mentioned, Manta is built on top of SmartDataCanter, and as such the engineering work behind it was crucial to Manta: Orlando developed the compute node service that is involved with the provision of every Manta service, Pedro built the workflow engine that actually implements provisioning and Kevin developed the operator console that you’ll have to just trust me is awesome. John developed the auth mechanism that many first-time users will use to create their Joyent accounts today, and Andrés developed the name service replication that assures that those new users will be able to store to Manta.

In terms of SDKs, Marsell developed the ruby-manta SDK, Trent developed both the Python SDK (including mantash!) and Bunyan — the node.js logging service that has been proven indispensable time and time again when debugging Manta issues. Speaking of node.js, no one will be surprised to learn that Manta is largely implemented in node — and that TJ and Isaac were both clutch in helping us debug some nasty issues down the stretch, reinforcing our conviction that the entire stack should be under one (virtual) roof!

Josh Wilsdon developed the vmadm that lies at the heart of SmartOS provisioning — and deserves special mention for the particularly heavy hand he applied to Manta in some of our final load testing; any system that can survive Josh is production-ready! Robert and Rob both jumped in on countless networking issues — they were both critical for implementing some necessary but complicated changes to Manta’s physical networking topology. Brendan provided the performance analysis that is his hallmark, joining up with Robert to form Team Tunable — from whom no TCP tunable is safe!

Jonathan developed the multi-archecture package support which became necessary for the Manta implementation and Filip made sure all sorts of arcane software ran on SmartOS in the Manta compute zone. (When you find that what you’re looking for is already installed in a Manta compute zone, you have Filip to thank!) Finally, Josh Clulow developed mlogin — which really must be tried to be believed. If you’re trying to understand the Manta compute model or if you just want to play around, give mlogin a whirl!

From Mark’s initial flash of insight (and my barrage of swearing) to finished product, it has been a long road, but we are proud to deliver it to you today; welcome to Manta — and dare we say to the next generation of object storage!

Posted on June 25, 2013 at 6:16 am by bmc · Permalink · One Comment
In: Uncategorized