Triton: Docker and the “best of all worlds”
When Docker first rocketed into the nerdosphere in 2013, some wondered how we at Joyent felt about its popularity. Having run OS containers in multi-tenant production for nearly a decade (and being one of the most vocal proponents of OS-based virtualization), did we somehow resent the relatively younger Docker? Some were surprised to learn that (to the contrary!) we have been elated to see the rise of Docker: we share with Docker a vision for a containerized future, and we love that Docker has brought the technology to a much broader audience — and via an entirely different vector (namely, emphasizing developer agility instead of merely operational efficiency). Given our enthusiasm, you can imagine the question we posed to ourselves over a year ago: could we somehow combine the operational strength of SmartOS containers with the engaging developer experience of Docker? Importantly, we had no desire to develop a “better” Docker — we merely wanted to use SmartOS and SmartDataCenter as a substrate upon which to deploy Docker containers directly onto the metal. Doing this would leverage over a decade of deep operating systems engineering with technologies like Crossbow, ZFS, DTrace and (of course) Zones — and would deliver all of the operational advantages of pure OS-based virtualization to Docker containers: performance, elasticity, security and density.
That said, there was an obvious hurdle: while designed to be cross-platform, Docker is a Linux-borne technology — and the repository of Docker images is today a collection of Linux binaries. While SmartOS is Unix, it (somewhat infamously) isn’t Linux: applications need to be at least recompiled (if not ported) to work on SmartOS. Into this gap came a fortuitous accident: David Mackay, a member of the illumos community, attempted to revive LX-branded zones, an old Sun project that provided Linux emulation in a zone. While this project had been very promising when it was first done years ago, it had also been restricted to emulating a 2.4 Linux kernel for 32-bit binaries — and it was clear at the time that modernizing it was going to be significant work. As a result, the work sat unattended in the system for a while before being unceremoniously ripped out in 2010. It seemed clear that with the passage of time, this work would hardly be revivable: it had been so long, any resurrection was going to be tantamount to a rewrite.
But fortunately, David didn’t ask us our opinion before he attempted to revive it — he just did it. (As an aside: a tremendous advantage of open source is that the community can perform experiments that you might deem too risky or too expensive in terms of opportunity cost!) When David reported his results, we were taken aback: yes, this had the same limitations that it had always had (namely, 32-bit and lacking many modern Linux facilities), but given how many modern binaries still worked, it was also clear that this was a more viable path than we had thought. Energized by David’s results, Joyent’s Jerry Jelinek picked it up from there, reintegrating the Linux brand into SmartOS in March of last year. There was still much to do of course, but Jerry’s work was a start — and reflected the constraints we imposed on ourselves: do it all in the open; do it all on SmartOS master; develop general-purpose illumos facilities wherever possible; and aim to upstream it all when we were done.
Around this time, I met with Docker CTO Solomon Hykes to share our (new) vision. Honestly, I didn’t know what his reaction would be; I had great respect for what Docker had done and was doing, but didn’t know how he would react to a system bold enough to go its own way at such a fundamental level. Somewhat to my surprise, Solomon was incredibly supportive: not only was he aware of SmartOS, but he was also intimately familiar with zones — and he didn’t need to be convinced of the merits of our approach. Better, he asked a question near and dear to my heart: “Does this mean that I’ll be able to DTrace my Linux apps in a Docker container?” When I indicated that yes, that’s exactly what it would mean, he responded: “It will be the best of all worlds!” That Solomon (and by extension, Docker) was not merely willing but actually eager to see Docker on SmartOS was hugely inspirational to us, and we redoubled our efforts.
Back at Joyent, we worked assiduously under Jerry’s leadership over the spring and summer, and by the fall, we were ready for an attempt on the summit: 64-bit. Like other bringup work we’ve done, this work was terrifying in that we had very little forward visibility, and little ability to parallelize. As if he were Obi-Wan Kenobi meeting Darth Vader in the Death Star, Jerry had to face 64-bit — alone. Fortunately, Jerry didn’t suffer Ben Kenobi’s fate; by late October, he had 64-bit working! With the project significantly de-risked, everything kicked into high gear: Josh Wilsdon, Trent Mick and their team went to work understanding how to integrate SmartDataCenter with Docker; Josh Clulow, Patrick Mooney and I attacked some of the nasty LX-branded zone issues that remained; and Robert Mustacchi and Rob Gulewich worked towards completing their vision for network virtualization. Knowing what we were going to do — and how important open source is to modern infrastructure software in general and Docker in particular — we also took an important preparatory step: we open sourced SmartDataCenter and Manta.
Charged by having all of our work in the open and with a clear line of sight on what we wanted to deliver, progress was rapid. One major question: where to run the Docker daemon? In digging into Docker, we saw that much of what the actual daemon did would need to be significantly retooled to be phrased in terms of not only SmartOS but also SmartDataCenter. However, our excavations also unearthed a gem: the Docker Remote API. Discovering a robust API was a pleasant surprise, and it allowed us to take a different angle: instead of running a (heavily modified) Docker daemon, we could implement a new SDC service to provide a Docker Remote API endpoint. To Docker users, this would look and feel like Docker — and it would give us a foundation that we knew we could develop. At this point, we’re pretty good at developing SDC-based services (microservices FTW!), and progress on the service was quick. Yes, there were some thorny issues to resolve (and definitely note differences between our behavior and the stock Docker behavior!), but broadly speaking we have been able to get it to work without violating the principle of least surprise. And from a Docker developer perspective, having a Docker host that represents an entire datacenter — that is, a (seemingly) galactic Docker host — feels like an important step forward. (Many are as excited by this work as we are, but I think my favorite reaction is the back-handed compliment from Jeff Waugh of Canonical fame; somehow a compliment that is tied to an insult feels indisputably earnest.)
With everything coming together, and with new hardware being stood up for the new service, there was one important task left: we needed to name this thing. (Somehow, “SmartOS + LX-branded zones + SmartDataCenter + sdc-portolan + sdc-docker” was a bit of a mouthful.) As we thought about names, I turned back to Solomon’s words a year ago: if this represented the best of two different worlds, what mythical creatures were combinations of different animals? While this search yielded many fantastic concoctions (a favorite being Manticore — and definitely don’t mess with Typhon!), there was one that stood out: Triton, son of Poseidon. As half-human and half-fish and a god of the deep, Triton represents the combination of two similar but different worlds — and as a bonus, the name rolls off the tongue and fits nicely with the marine metaphor that Docker has pioneered.
So it gives me great pleasure to introduce Triton to the world — a piece of (open source!) engineering brought to you by a cast of thousands, over the course of decades. In a sentence (albeit a wordy one), Triton lets you run secure Linux containers directly on bare metal via an elastic Docker host that offers tightly integrated software-defined networking. The service is live, so if you want to check it out, sign up! If you’re looking for more technical details, check out both Casey’s blog entry and my Future of Docker in Production presentation. If you’d like it on-prem, get in touch. And if you’d prefer to DIY, start with sdc-docker. Finally, forgive me one shameless plug: if you happen to be in the New York City area in early April, be sure to join us at the Container Summit, where we’ll hear perspectives from analysts like Gartner, enterprise users of containers like Lucera and Walmart, and key Docker community members like Tutum, Shopify, and Docker themselves. Should make for an interesting afternoon!
Welcome to Triton — and to the best of all worlds!
SmartDataCenter and the merits of being opinionated
Recently, Randy Bias of EMC (formerly of CloudScaling) wrote an excellent piece on Why “Vanilla OpenStack” doesn’t exist and never will. If you haven’t read it and you are anywhere near a private cloud effort, you should consider it a must-read: Randy debunks the myth of a vanilla OpenStack in great detail. And it apparently does need debunking; as Randy outlines, those who are deploying an on-premises cloud expect:
- A uniform, monolithic cloud operating system (like Linux)
- Set of well-integrated and interoperable components
- Interoperability with their own vendors of choice in hardware, software, and public cloud
We at Joyent can vouch for these expectations, because years ago we had the same aspirations for our own public cloud. Though perhaps unlike others, we have also believed in the operating system as differentiator — and specifically, that OS containers are the foundation of elastic infrastructure — so we didn’t wait for a system to emerge, but rather endeavored to write our own. That is, given the foundation of our own container-based operating system — SmartOS — we set out to build exactly what Randy describes: a set of well-integrated, interoperable components on top of a uniform, monolithic cloud operating system that would allow us to leverage the economics of commodity hardware. This became SmartDataCenter, a container-centric distributed system upon which we built our own cloud and which we open sourced this past November.
The difference between SmartDataCenter and OpenStack mirrors the difference between the expectations for OpenStack and the reality that Randy outlines: where OpenStack is accommodating of many different visions for the cloud, SmartDataCenter is deliberately opinionated. In SmartDataCenter you don’t pick the storage substrate (it’s ZFS) or the hypervisor (it’s SmartOS) or the network virtualization (it’s Crossbow). While OpenStack deliberately accommodates swapping in different architectural models, SmartDataCenter deliberately rejects it: we designed it for commodity storage (shared-nothing — for good reason), commodity network equipment (no proprietary SDN) and (certainly) commodity compute. So while we’re agnostic with respect to hardware (as long as it’s x86-based and Ethernet-based), we are prescriptivist with respect to the software foundation that runs upon it. The upshot is that the integrator/operator retains control over hardware (and the different economic tradeoffs that that control allows), but needn’t otherwise design the system themselves — which we know from experience can result in greatly reduced times of deployment. (Indeed, one of the great prides of SmartDataCenter is our ease of install: provided you’re racked, stacked and cabled, you can get a cloud stood up in a matter of hours rather than days, weeks or longer.)
So in apparent contrast to OpenStack, SmartDataCenter only comes in “vanilla” (in Randy’s parlance). This is not to say that SmartDataCenter is in any way plain; to the contrary, by having such a deliberately designed foundation, we can unlock rapid innovation, viz. our emerging Docker integration with SmartDataCenter that allows for Docker containers to be deployed securely and directly on the metal. We are very excited about the prospects of Docker on SmartDataCenter, and so are other people. So in as much as SmartDataCenter is vanilla, it definitely comes with whipped cream and a cherry on top!
Predicteria 2015
Fifteen years ago, I initiated a time-honored tradition among my colleagues in kernel development at Sun: shortly after the first of every year, we would get together at our favorite local restaurant to form predictions for the coming year. We made one-year, three-year and six-year predictions for both our technologies and more broadly for the industry. We did this for nine years running — from 2000 to 2008 inclusive — and came to know the annual ritual as a play on the restaurant name: Predicteria.
I have always been in interested in our past notions of the future (hoverboards and self-lacing shoes FTW!), and looking back now at nearly a decade of our predictions has led me to an inescapable (and perhaps obvious) conclusion: predictions tell you more about the present than the future. That is, predictions reflect the zeitgeist of the day — both in substance and in tone: in good years, people predict halcyon days; in bad ones, the apocalypse. And when a particular company or technology happened to be in the news or otherwise on the collective mind, predictions tended to be centered around it: it was often the case that several people would predict that a certain company would be acquired or that a certain technology would flourish — or perish. (Let the record reflect that the demise of Itanium was accurately predicted many times over.)
Which is not to say that we never made gutsy predictions; in 2006, a colleague made a one-year prediction that “GOOG embarrassed by revelation of unauthorized US government spying at Gmail.” The timing may have been off, but the concern was disturbingly prescient. Sometimes the predictions were right, but for the wrong reasons: in 2003, one of my three-year predictions was that “Apple develops new ‘must-have’ gadget called the iPhone, a digital camera/MP3 player/cell phone.” This turned out to be stunningly accurate, even down to the timing (and it was by far my most accurate big prediction over the years), but if you can’t tell by the snide tone, I thought that such a thing would be Glass-like in its ludicrousness; I had not an inkling as to its transformative power. (And indeed, when the iPhone did in fact emerge a few years later, several at Predicteria predicted that it would be a disappointing flop.)
But accurate predictions were the exception, not the rule; our predictions were usually wrong — often wildly so. Evergreen wildly wrong predictions included: the rise of carbon nanotube-based memory, the relevance of quantum computing, and the death of tape, disk or volatile DRAM (each predicted several times over). We were also wrong by our omissions: as a group, we entirely failed to predict cloud computing — or even the rise of hardware-based virtualization.
I give all of this as a backdrop to some predictions for the coming year. If my experience taught me anything, it’s that these predictions may very well be right on trajectory, but wrong on timing — and that they may well capture current thinking more than they meaningfully predict the future. They also may be (rightfully) criticized for, as they say, talking our book — but we have made our bets based on where we think things are going, not vice versa. And finally, I apologize that these are somewhat milquetoast predictions; I’m afraid that practical concerns muffle the gutsy predictions that name names and boldly predict their fates!
Without further ado, looking forward to 2015:
- 2015 is the year of the container. If you’ve read our 2014 in review, you won’t be at all surprised to read this and I don’t really think that it’s controversial. Thanks to Docker, the world is finally figuring out that OS-based virtualization is actually highly disruptive (better performance and lower cost!), and I think that this realization will become mainstream in 2015. I don’t think that Docker will necessarily be the only substrate, but I believe that its momentum is such that it will continue to dominate in 2015.
- The impedance mismatch between containers in development and containers in production will be a wellspring of innovation. Currently, containers have a ton of developer enthusiasm — but limited production deployments, in part because of production concerns around security, persistence and network virtualization. All of these problems have multiple solutions — and there will be still more in 2015 (not least from us). It’s hard to predict winners (or if winners will even emerge in 2015), but it’s a sure bet that there will be (many) players tackling the problems in interesting ways.
- The on-prem versus public cloud debate will be increasingly seen as a false dichotomy. As I outlined at GigaOm Structure in June, the future is heterogenous: the public versus on-prem distinction will be be a rent-versus-buy decision based on economics, risk management and latency. Entities will not have one answer here, but many: one is not going to vanquish the other. That said, we also believe that choosing one versus the other shouldn’t involve having to change the underlying technology substrate — a major factor for us when we open sourced SmartDataCenter in November.
- The internet-of-things broadens beyond mere trend. I continue to believe that IoT is an absolute monster of a shift — one that will have profound changes for nearly every economic endeavor. As Marc Andreessen famously observed, “software is eating the world” — and ubiquitous computing will increasingly be the buffet. At Velocity 2014, I described architecting for the deluge of data, and I think that the arrival of machine-generated data (and the analytics to turn that data into business decisions) will be a defining characteristic of the next decade(s) — part of why we open sourced Manta when we opened SmartDataCenter. To what degree this happens in 2015 is unclear, but at some point we will stop thinking of IoT as a “trend” and realize that it is simply part of the rising tide of the information society.
- Wearable computing fizzles. If wearable computing ever has a time, it isn’t now: devices out there today are gimmicky, brittle, and — to put it euphemistically — overly flashy. These devices will continue to enjoy some life among the one-percenter Sharper Image set, but they won’t actually make the leap to broad consumer devices. (An exception may be the smart onesie; speaking from my own experience, first-time parents are suckers for anything that seems to relate to infant safety.)
- Security concerns become focused on intrusion and fraud detection. If the Target breach didn’t wake people up, the Sony Pictures fracas sure as hell did. As we head into 2015, it is being more broadly recognized that security is not a public vs. on-prem issue (after all, the biggest breaches have been on-prem) — and that disgruntled insiders may represent the greatest single threat to an entity. Security (and specifically, intrusion and fraud detection) will be viewed increasingly as a real-time big data problem — one that could save a business from an existential threat.
- Cryptocurrencies begin their long journey to laughingstock. In 2015, cryptocurrencies will stop being viewed as having “one bad year” — and start being viewed for what they are: modern-day Beanie Babies for ultra-libertarian math nerds. It’s stunning to me how many otherwise intelligent people fell for cryptocurrency, having forgotten (or failed to understand in the first place) what a currency actually is: a medium for exchange and a store of value. As such, currencies that experience acute inflation or (worse) acute deflation tautologically fail. We actually have a lot of (willfully ignored) macroeconomic history that tells us this; central banking was only invented after we had had disastrous experiences with every other way of managing currency (viz. the Panic of 1837). And this isn’t just like, my opinion, man: no one purchasing or selling goods is going to opt into a currency that introduces volatility risk in an orthogonal (and often low margin) commercial transaction. (Indeed, in the you-can’t-make-this-up department, Bitcoin has become too volatile to be reliably used for ransom.) Of course, proponents are now saying that “the value isn’t in the currency but in the blockchain application stack” — which is like saying the value of Beanie Babies is in their soft and cuddly nature: yes the value of Punchers the Lobster is non-zero, but it doesn’t mean that it’s actually worth 2,600 clams.
Right or wrong, these predictions point to an exciting 2015. And if nothing else you can rely on my for a candid self-assessment of my predictions — you’ll just need to wait fifteen years or so!
2014 in review: Docker rising
When looking back on 2014 from an infrastructure perspective, it’s hard not to have one word on the lips: Docker. (Or, as we are wont to do in Silicon Valley when a technology is particularly hot, have the same word on the lips three times over à la Gabbo: “Docker, Docker, DOCKER!”) While Docker has existed since 2013, 2014 was indisputably the year in which it transcended from an interesting project to a transformative technology — a shift which had profound ramifications for us at Joyent.
The enthusiasm for Docker has been invigorating: it validates Joyent’s core hypothesis that OS-based virtualization is the infrastructure substrate of the future. That said, going into 2014, there was also a clear impedance mismatch: while Docker was refreshingly open to being cross-platform, the reality is that it was being deployed exclusively on Linux — and that the budding encyclopedia of Docker images was exclusively Linux-based. Our operating system, SmartOS, is an illumos derivative that it many ways is similar to Linux (they’re both essentially Unix, after all), but it’s also different enough to be an impediment. So the arrival of Docker in 2013 left us headed into 2014 with a kind of dilemma: how can we enable Docker on our proven SmartOS-based substrate for OS containers while still allowing existing Linux-based images to function?
Into this quandary came a happy accident: David Mackay, an illumos community member, revived lx branded zones, work that had been explored some number of years ago to execute complete Linux binary environments in an illumos zone. This work was so old that, frankly, we didn’t feel it was likely to be salvageable — but we were pleasantly surprised when it seemed to still function for some modern binaries. (If it needs to be said, this is yet another example of why we so fervently believe in open source: it allows for others to explore ideas that may seem too radical for commercial entities with mortgages to pay and mouths to feed.)
Energized by the community, Joyent engineer Jerry Jelinek went to work in the spring, bolstering the emulation layer and getting it to work with progressively more and more modern Linux systems. By late summer, 32-bit was working remarkably well on Ubuntu 14.04 (an odyssey that I detailed in my illumos day Surge presentation) and we were ready to make an attempt at the summit: 64-bit Linux emulation. Like much bringup work, the 64-bit work was excruciating because it was very hard to forecast: you can be one bug away from a functioning system or a hundred — and the only way to really know is to grind through them all. Fortunately, we are nothing if not persistent, and by late fall we had 64-bit working on most stuff — and thanks to early adopter community members like Jorge Schrauwen, we were able to quickly find increasingly obscure software to validate it against. (Notes to self: (1) “Cabal hell” is a thing and (2) I bet HHVM is unaware of the implicit dependency they have on Linux address space layout.)
With the LX branded zone work looking very promising, Joyent engineer Josh Wilsdon led a team studying Docker to determine the best way to implement it on SmartOS for our orchestration software, SmartDataCenter. In doing this, we learned about a great Docker strength: its remote API. This API allows us to do exactly what robust APIs have allowed us to do for time immemorial: replace one implementation with a different one without breaking upstack software. Implementing a Docker API endpoint would also allow for a datacenter-wide Docker view that would solve many other problems for us as well; in late autumn, we set out building sdc-docker, a Docker engine for SDC that we have been developing in the open. As with the LX branded zone work, we are far enough along to validate the approach: we know that we can make this work.
In parallel to these two bodies of work, a third group of Joyent engineers led by Robert Mustacchi was tackling a long-standing problem: extending the infrastructure present in SmartOS for robust (and secure!) network virtualization for OS containers to the formation of virtual layer two networks that can span an entire datacenter (that is, finally breaking the shackles of .1q VLANs). We have wanted to do this for quite some time, but the rise of Docker has given this work a new urgency: of the Linux problems with respect to OS-based containers, network virtualization is clearly among the most acute — and we have heard over and over again that it has become an impediment to Docker in production. Robert and team have made great progress and by the end of 2014 had the first signs of life from the SDC integration point for this work.
The SmartDataCenter-based aspects of our Docker and network virtualization work embody an important point of distinction: while OpenStack has been accused of being “a software particle-board designed by committee”, SDC has been deliberately engineered based on our experience actually running a public cloud at scale. That said, OpenStack has had one (and arguably, only one) historic advantage: it is open source. While the components of SDC (namely, SmartOS and node.js) have been open, SDC itself was not. The rise of Docker — and the clear need for an open, container-based stack instead of some committee-designed VMware retread — allowed us to summon the organizational will to take an essential leap: on November 6th, we open sourced SDC and Manta.
Speaking of Manta: with respect to containers, Joyent has been living in the future (which, in case it sounds awesome, is actually very difficult; being ahead of the vanguard is a decidedly mixed blessing). If the broader world is finally understanding the merits of OS-based virtualization with respect to standing compute, it still hasn’t figured out that it has profound ramifications for scale-out storage. However, with the rise of Docker in 2014, we have more confidence than ever that this understanding will come in due time — and by open sourcing Manta we hope to accelerate it. (And certainly, you can imagine that we’ll help connect the dots by allowing Manta jobs to be phrased as Docker containers in 2015.)
Add it all up — the enthusiasm for Docker, the great progress of the LX-branded zone work, the Docker engine for SDC, the first-class network virtualization that we’re building into the system — and then give it the kicker of an entirely open source SmartDataCenter and Manta, and you can see that it’s been a hell of a 2014 for us. Indeed, it’s been a hell of a 2014 for the entire Docker community, and we believe that Matt Asay got it exactly right when he wrote that “Docker, hot as it was in 2014, will be even hotter in 2015.”
So here’s to a hot 2014 — and even hotter 2015!
SmartDataCenter and Manta are now open source
Today we are announcing that we are open sourcing the two systems at the heart of our business: SmartDataCenter and the Manta object storage platform. SmartDataCenter is the container-based orchestration software that runs the Joyent public cloud; we have used it for the better half of a decade to run on-the-metal OS containers — securely and at scale. Manta is our multi-tenant ZFS-based object storage platform that provides first-class compute by allowing OS containers to be spun up directly upon objects — effecting arbitrary computation at scale without data movement. The unifying technological foundation beneath both SmartDataCenter and Manta is OS-based virtualization, a technology that Joyent pioneered in the cloud way back in 2006. We have long known the transformative power of OS containers, so it has been both exciting and validating for us to see the rise of Docker and the broadening of appreciation for OS-based virtualization. SmartDataCenter and Manta show that containers aren’t merely a fad or developer plaything but rather a fundamental technological advance that represents the foundation for the next generation of computing — and we believe that open sourcing them advances the adoption of container-based architectures more broadly.
Without any further ado — and to assure that we don’t fall into the most prominent of my own corporate open source anti-patterns — here is the source for SmartDataCenter and the source for Manta. These are sophisticated systems with many moving parts, and you’ll see that these two repositories are in fact meta-repositories that explain the design of each of the systems and then point to the (many) components that comprise them (all now open source, natch). We believe that some of these subcomponents will likely find use entirely outside of SDC and Manta. For example, Manatee is a ZooKeeper-based system that manages Postgres replication and automates failover; Moray is a key-value service that lives on top of Postgres. Taken together, Manatee and Moray implement a highly-available key-value service that we use as the foundation for many other components in SDC and Manta — and one that we think others will find useful as well.
In terms of source code mechanics, you’ll see that many of the components are implemented in either node.js or by extending C-based systems. This is not by fiat but rather by the choices of individual engineers; over the past four years, as we learned about the nuances of node.js error handling and as we invested heavily in tooling for running node.js in production, node.js became the right tool for many of our jobs — and we used it for many of the services that constitute SDC and Manta.
And because any conversation about open source has to address licensing at some point or another, let’s get that out of the way: we opted for the Mozilla Public License 2.0. While relatively new, there is a lot to like about this license: its file-based copyleft allows it to be proprietary-friendly while also forcing certain kinds of derived work to be contributed back; its explicit patent license discourages litigation, offering some measure of troll protection; its explicit warranting of original work obviates the need for a contributor license agreement (we’re not so into CLAs); and (best of all, in my opinion), it has been explicitly designed to co-exist with other open source licenses in larger derived works. Mozilla did terrific work on MPL 2.0, and we hope to see it adopted by other companies that share our thinking around open source!
In terms of the business ramifications, at Joyent we have long been believers in open source as a business model; as the leaders of the node.js and SmartOS projects, we have seen the power of open source to start new conversations, open up new markets and (importantly) yield new customers. Ten years ago, I wrote that open source is “a loss leader — minus the loss, of course”; after a decade of experience with open source business models, I would add that open source also serves as sales outreach without cold calls, as a channel without loss of margin, and as a marketing campaign without advertisements. But while we have directly experienced the business advantages of open source, we at Joyent have also lived something of a dual life: node.js and SmartOS have been open source, but the distributed systems that we have built using these core technologies have remained largely behind our walls. So that these systems are now open source does not change the fundamentals of our business model: if you would like to consume SmartDataCenter or Manta as a service, you can spin up an instance on the public cloud or use our Manta storage service. Similarly, if you want a support contract and/or professional services to run either SmartDataCenter or Manta on-premises, we’ll sell them to you. Based on our past experiences with open source, we do know that there will be one important change: these technologies will find their way into the hands of those that we have no other way of reaching — and that some fraction of these will become customers. Also based on past experience, we know that some (presumably much smaller) fraction of these new technologists will — by merits of their interest in and contributions to these projects — one day join us as engineers at Joyent. Bluntly, open source is our farm system, and broadening our hiring channel during a blazingly hot market for software talent is playing no small role in our decision here. In short, this is not an act of altruism: it is a business decision — if a multifaceted one that we believe has benefits beyond the balance sheet.
Welcome to open source SDC and Manta — and long-live the container revolution!
Broadening node.js contributions
Several years ago, I gave a presentation on corporate open source anti-patterns. Several of my anti-patterns were clear and unequivocal (e.g., don’t announce that you’re open sourcing something without making the source code available, dummy!), but others were more complicated. One of the more nuanced anti-patterns was around copyright assignment and contributor license agreements: while I believe these constructs to be well-intended (namely, to preserve relicensing options for the open source project and to protect that project from third-party claims of copyright and patent infringement), I believe that they are not without significant risks with respect to the health of the community. Even at their very best, CLAs and copyright assignments act as a drag on contributions as new corporate contributors are forced to seek out their legal department — which seems like asking people to go to the dentist before their pull request can be considered. And that’s the very best case; at worst, these agreements and assignments grant a corporate entity (or, as I have personally learned the hard way, its acquirer) the latitude for gross misbehavior. Because this very worst scenario had burned us in the illumos community, illumos has been without CLA and copyright assignment since its inception: as with Linux, contributors hold copyright to their own contributions and agree to license it under the prevailing terms of the source base. Further, we at Joyent have also adopted this approach in the many open source components we develop in the node.js ecosystem: like many (most?) GitHub-hosted projects, there is no CLA or copyright assignment for node-bunyan, node-restify, ldap.js, node-vasync, etc. But while many Joyent-led projects have been without copyright assignment and CLA, one very significant Joyent-led project has had a CLA: node.js itself.
While node.js is a Joyent-led project, I also believe that communities must make their own decisions — and a CLA is a sufficiently nuanced issue that reasonable people can disagree on its ultimate merits. That is, despite my own views on a CLA, I have viewed the responsibility for the CLA as residing with the node.js leadership team, not with me. The upshot has been that the node.js status quo of a CLA (one essentially inherited from Google’s CLA for V8) has remained in place for several years.
Given this background you can imagine that I found it very heartwarming that when node.js core lead TJ Fontaine returned from his recent Node on the Road tour, one of the conclusions he came to was that the CLA had outlived its usefulness — and that we should simply obliterate it. I am pleased to announce that today, we are doing just that: we have eliminated the CLA for node.js. Doing this lowers the barrier to entry for node.js contributors thereby broadening the contributor base. It also brings node.js in line with other projects that Joyent leads and (not unimportantly!) assures that we ourselves are not falling into corporate open source anti-patterns!
From VP of Engineering to CTO
If you search for “cto vs. vp of engineering”, one of the top hits is a presentation that I gave with Jason Hoffman at Monki Gras 2012. Aside from some exceptionally apt clip art, the crux of our talk was that these two roles should not be thought of as caricatures (e.g. the CTO as a silver tongue with grand vision but lacking practical know-how and the VP of Engineering as a technocrat who makes the TPS reports run on time), but rather as a team that together leads a company’s technical endeavors. Yes, one is more outward- and future-looking and the other more team- and product-focused — but if the difference becomes too stark (that is, if the CTO and VP of Engineering can’t fill in for one another in a pinch) there may be a deeper cultural divide between vision and execution. As such, the CTO and the VP of Engineering must themselves represent the balance present in every successful engineer: they must be able to both together understand the world as it is — and envision the world as it could be.
This presentation has been on my mind recently because today my role at Joyent is changing: I am transitioning from VP of Engineering to CTO, and Mark Cavage is taking on the role of VP of Engineering. For me, this is an invigorating change in a couple of dimensions. First and foremost, I am excited to be working together with Mark in a formalized leadership capacity. The vitality of the CTO/VP of Engineering dynamic stems from the duo’s ability to function as a team, and I believe that Mark and I will be an effective one in this regard. (And Mark apparently forgives me for cussing him out when he conceived of what became Manta.)
Secondly, I am looking forward to talking to customers a bit more. Joyent is in a terrific position in that our vision for cloud computing is not mere rhetoric, but actual running service and shipping product. We are uniquely differentiated by the four technical pillars of our stack: SmartOS, node.js, SmartDataCenter and — as newly introduced last year — our revolutionary Manta storage service. These are each deep technologies in their own right, and especially at their intersections, they unlock capabilities that the market wants and needs — and our challenge now is as much communicating what we’ve done (and why we’ve done it) as it is continuing to execute. So while I have always engaged directly with customers, the new role will likely mean more time on planes and trains as I visit more customers (and prospective customers) to better understand how our technologies can help them solve their thorniest problems.
Finally, I am looking forward to the single most important role of the CTO: establishing the broader purpose of our technical endeavor. This purpose becomes the root of a company’s culture, as culture without purpose is mere costume. For Joyent and Joyeurs our purpose is simple: we’re here to change computing. As I mentioned in my Surge 2013 talk on technical leadership (video), superlative technologists are drawn to mission, team and problem — and in Joyent's case, the mission of changing computing (and the courage to tackle whatever problems that entails) has attracted an exceptionally strong team that I consider myself blessed to call my peers. I consider it a great honor to be Joyent's CTO, and I look forward to working with Mark and the team to continue to — in Steve Jobs' famous words — kick dents in the universe!
agghist, aggzoom and aggpack
As I have often remarked, DTrace is a workhorse, not a show horse: the features that we have added to DTrace over the years come not from theoretical notions but rather from actual needs in production environments. This is as true now as it was a decade ago, and even in core abstractions, extensive use of DTrace seems to give rise to new ways of thinking about them. In particular, Brendan recently had a couple of feature ideas for aggregation processing that have turned out to be really interesting…
agghist
When one performs a count() or sum() aggregating action, the result is a table that consists of values and numbers, e.g.:
# dtrace -n syscall:::entry'{@[execname] = count()}'
dtrace: description 'syscall:::entry' matched 233 probes
^C
utmpd 2
metadata 5
pickup 14
ur-agent 18
epmd 20
fmd 20
jsontool.js 33
vmadmd 33
master 41
mdnsd 41
intrd 45
devfsadm 52
bridged 81
mkdir 83
ipmgmtd 95
cat 96
ls 100
truss 101
ipmon 106
sendmail 111
dirname 115
svc.startd 125
svc.configd 153
ksh93 164
zoneadmd 165
svcprop 258
sshd 262
bash 291
zpool 436
date 448
cron 470
lldpd 584
dump-minutely-sd 611
haproxy 760
nscd 1275
zfs 1966
zoneadm 2414
java 2916
tail 3750
postgres 6702
redis-server 21283
dtrace 28308
node 39836
beam.smp 55940
Brendan’s observation was that it would be neat to (optionally) use the histogram-style aggregation printing with these count()/sum() aggregations to be able to more easily differentiate values. For that, we have the new “-x agghist” option:
# dtrace -n syscall:::entry'{@[execname] = count()}' -x agghist
dtrace: description 'syscall:::entry' matched 233 probes
^C
key ------------- Distribution ------------- count
utmpd | 2
metadata | 5
pickup | 14
epmd | 20
fmd | 23
ur-agent | 25
init | 27
mdnsd | 30
jsontool.js | 33
vmadmd | 36
master | 41
intrd | 45
devfsadm | 52
bridged | 78
mkdir | 83
cat | 96
ipmgmtd | 97
ls | 100
sendmail | 101
ipmon | 102
dirname | 115
svc.startd | 125
truss | 140
svc.configd | 153
ksh93 | 164
sshd | 248
svcprop | 258
bash | 290
zoneadmd | 387
zpool | 436
date | 448
cron | 470
lldpd | 562
dump-minutely-sd | 615
haproxy | 622
nscd | 808
zfs | 1966
zoneadm |@ 2414
java |@ 3003
tail |@ 3607
postgres |@ 5728
redis-server |@@@@@ 20611
dtrace |@@@@@@ 26966
node |@@@@@@@@@@ 39825
beam.smp |@@@@@@@@@@@@@ 55942
It’s obviously the same information, but presented with a quicker visual cue as to the distribution. Now, one may well note that this output has a lot of dead whitespace in it — read on.
aggzoom
The DTrace histogram-style output came directly from lockstat, which proceeded it by several years. lockstat was important for DTrace in several ways: aside from showing the power of production instrumentation, lockstat pointed to the need for first-class aggregations, for disjoint instrumentation providers and for multiplexed consumers (for which it was repaid by having its guts ripped out, being reimplemented as both a DTrace provider and a DTrace consumer). Due to some combination of my laziness and my reverence for the lockstat-style histogram, I simply lifted the lockstat processing code — which means I inherited (or stole, depending on your perspective) the decisions Jeff had made with regard to histogram rendering. In particular, lockstat determines the height of a bar by taking the bucket’s share of the total and multiplying it by the full height of the histogram. That is, if you add up all of the heights of all of the bars, they will add up to the full height of the histogram. The benefit of this is that it quickly tells you relative distribution: if you see a full-height bar, you know that that bucket represents essentially all of the values. But the problem is that if the number of buckets is large and/or the values of those buckets are relatively evenly distributed, the result is a bunch of very short bars (or worse, zero height bars) and dead whitespace. This can become so annoying to DTrace users that Brendan has been known to observe that this is the one improvement over DTrace that he can point to in SystemTap: that instead of dividing the height of the histogram among all of the bars, each bar has a height in proportion to the histogram height that is its value in proportion to the bucket of greatest value. Of course, the shape of the distribution doesn’t change — one simply is automatically scaling the height of the highest bar to the height of the histogram and adjusting all other bars accordingly.
To allow for this behavior, I added “-x aggzoom“; running the same example as above with this new option:
# dtrace -n syscall:::entry'{@[execname] = count()}' -x agghist -x aggzoom
dtrace: description 'syscall:::entry' matched 233 probes
^C
key ------------- Distribution ------------- count
utmpd | 2
metadata | 7
pickup | 14
ur-agent | 20
fmd | 25
epmd | 26
jsontool.js | 33
mdnsd | 39
master | 41
vmadmd | 44
rsyslogd | 57
devfsadm | 66
mkdir | 83
intrd | 90
cat | 96
bridged | 99
ls | 100
dirname | 115
truss | 125
ipmon | 130
sendmail | 131
ipmgmtd | 133
ksh93 | 164
svc.startd | 179
init | 189
sshd | 253
svcprop | 258
bash | 283
zpool | 436
date | 448
cron | 470
dump-minutely-sd | 611
lldpd | 716
svc.configd |@ 939
haproxy |@ 1097
zoneadmd |@ 1455
zfs |@ 1966
zoneadm |@@ 2414
nscd |@@ 2867
java |@@@ 4277
tail |@@@ 4579
postgres |@@@@@@ 9389
redis-server |@@@@@@@@@@@@@@@@@@ 26177
dtrace |@@@@@@@@@@@@@@@@@@@@@@@ 34530
node |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 48151
beam.smp |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 55940
It’s zoomtastic!
aggpack
If you follow Brendan’s blog, you know that he’s always experimenting with ways of visually communicating a maximal amount of system data. One recent visualization that has been particularly interesting are his frequency trails — a kind of stacked sparkline. Thinking about density of presentation sparked Brendan to observe that he often needs to try to visually correlate multiple quantize()/lquantize() aggregation keys, e.g. this classic DTrace “one-liner” to show the distribution of system call time (in nanoseconds) by system call:
# dtrace -n syscall:::entry'{self->ts = timestamp}' \
-n syscall:::return'/self->ts/{@[probefunc] = \
quantize(timestamp - self->ts); self->ts = 0}'
dtrace: description 'syscall:::entry' matched 233 probes
dtrace: description 'syscall:::return' matched 233 probes
^C
sigpending
value ------------- Distribution ------------- count
512 | 0
1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
2048 | 0
llseek
value ------------- Distribution ------------- count
256 | 0
512 |@@@@@@@@@@@@@@@@@@@@ 1
1024 |@@@@@@@@@@@@@@@@@@@@ 1
2048 | 0
fstat
value ------------- Distribution ------------- count
1024 | 0
2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
4096 | 0
setcontext
value ------------- Distribution ------------- count
1024 | 0
2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
4096 | 0
fstat64
value ------------- Distribution ------------- count
256 | 0
512 |@@@@@@@@@@@@@@@@@@@@ 2
1024 |@@@@@@@@@@@@@@@@@@@@ 2
2048 | 0
sigaction
value ------------- Distribution ------------- count
256 | 0
512 |@@@@@@@@@@@@@@@@@@@@ 2
1024 |@@@@@@@@@@@@@@@@@@@@ 2
2048 | 0
getuid
value ------------- Distribution ------------- count
128 | 0
256 |@@@@@@@@@@@@@@@@@@@@ 1
512 | 0
1024 | 0
2048 | 0
4096 |@@@@@@@@@@@@@@@@@@@@ 1
8192 | 0
accept
value ------------- Distribution ------------- count
1024 | 0
2048 |@@@@@@@@@@@@@@@@@@@@ 1
4096 |@@@@@@@@@@@@@@@@@@@@ 1
8192 | 0
sysconfig
value ------------- Distribution ------------- count
256 | 0
512 |@@@@@@@@@@@@@ 2
1024 |@@@@@@@@@@@@@@@@@@@@ 3
2048 |@@@@@@@ 1
4096 | 0
bind
value ------------- Distribution ------------- count
2048 | 0
4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2
8192 | 0
lwp_sigmask
value ------------- Distribution ------------- count
256 | 0
512 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 11
1024 |@@@ 1
2048 |@@@ 1
4096 | 0
recvfrom
value ------------- Distribution ------------- count
1024 | 0
2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 5
4096 | 0
setsockopt
value ------------- Distribution ------------- count
512 | 0
1024 |@@@@@@@@@@ 1
2048 |@@@@@@@@@@ 1
4096 |@@@@@@@@@@@@@@@@@@@@ 2
8192 | 0
fcntl
value ------------- Distribution ------------- count
128 | 0
256 |@@@ 1
512 |@@@@@@@@@@@@@@@@@@@@ 8
1024 |@@@@@@@@@@@@@@@ 6
2048 |@@@ 1
4096 | 0
systeminfo
value ------------- Distribution ------------- count
512 | 0
1024 |@@@@@@@@@@@@@@@ 3
2048 |@@@@@@@@@@@@@@@@@@@@@@@@@ 5
4096 | 0
schedctl
value ------------- Distribution ------------- count
8192 | 0
16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
32768 | 0
getsockopt
value ------------- Distribution ------------- count
512 | 0
1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 11
2048 |@@@@@@@@@ 3
4096 | 0
stat
value ------------- Distribution ------------- count
1024 | 0
2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4
4096 | 0
8192 |@@@@@@@@@@@@@ 2
16384 | 0
recvmsg
value ------------- Distribution ------------- count
1024 | 0
2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 15
4096 |@@@@ 2
8192 |@@ 1
16384 | 0
connect
value ------------- Distribution ------------- count
2048 | 0
4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4
8192 | 0
16384 |@@@@@@@@@@@@@ 2
32768 | 0
brk
value ------------- Distribution ------------- count
512 | 0
1024 |@@@@@@@@@@ 2
2048 |@@@@@@@@@@@@@@@ 3
4096 | 0
8192 |@@@@@@@@@@ 2
16384 | 0
32768 |@@@@@ 1
65536 | 0
so_socket
value ------------- Distribution ------------- count
1024 | 0
2048 |@@@@@@@@@@@@@@@@@@@@@@ 5
4096 | 0
8192 |@@@@@@@@@ 2
16384 |@@@@@@@@@ 2
32768 | 0
mmap
value ------------- Distribution ------------- count
32768 | 0
65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
131072 | 0
putmsg
value ------------- Distribution ------------- count
32768 | 0
65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
131072 | 0
p_online
value ------------- Distribution ------------- count
128 | 0
256 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 250
512 |@ 5
1024 | 0
2048 | 1
4096 | 0
getpid
value ------------- Distribution ------------- count
512 | 0
1024 |@@@@@@@@@@ 9
2048 |@@@@@@@@@@@@@@@@@@@@@@@ 21
4096 |@@ 2
8192 |@@@@ 4
16384 | 0
close
value ------------- Distribution ------------- count
512 | 0
1024 |@@@@@@@@@@@@@@@@@@@ 13
2048 |@@@@@@@ 5
4096 |@@@@@@@ 5
8192 |@@@ 2
16384 |@@@@ 3
32768 | 0
lseek
value ------------- Distribution ------------- count
512 | 0
1024 |@@@@@@@@@@@@@@ 22
2048 |@@@@@@@@@@@@@@@@@@@@@@@ 37
4096 |@@ 3
8192 |@ 2
16384 | 0
open
value ------------- Distribution ------------- count
1024 | 0
2048 |@@@@@@@@@@@@@ 6
4096 |@@@@@@@ 3
8192 |@@@@@@@@@@@@@ 6
16384 |@@ 1
32768 |@@@@ 2
65536 | 0
lwp_cond_signal
value ------------- Distribution ------------- count
2048 | 0
4096 |@@@@@@ 2
8192 | 0
16384 |@@@@@@@@@@@@@@@@@@@@@@@ 8
32768 |@@@@@@@@@@@ 4
65536 | 0
sendmsg
value ------------- Distribution ------------- count
8192 | 0
16384 |@@@@ 1
32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ 6
65536 |@@@@@@@@@ 2
131072 | 0
gtime
value ------------- Distribution ------------- count
128 | 0
256 |@@@@@@@@@@@@@@@@@@@@ 576
512 |@@@@@@@@@@@@@@@@ 452
1024 |@@ 56
2048 |@ 34
4096 | 3
8192 | 8
16384 | 1
32768 | 1
65536 | 0
read
value ------------- Distribution ------------- count
256 | 0
512 | 1
1024 |@@@@@@ 19
2048 |@@@@@@@@@@ 34
4096 |@@@@@ 15
8192 |@@@@@@@@@@@@@@@@@@ 61
16384 |@ 2
32768 | 0
stat64
value ------------- Distribution ------------- count
1024 | 0
2048 |@ 2
4096 |@ 1
8192 |@@@ 5
16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 52
32768 |@@@@ 7
65536 | 0
send
value ------------- Distribution ------------- count
1024 | 0
2048 |@@ 2
4096 |@ 1
8192 |@@@@@@@@@@@@@@@@@@ 22
16384 |@@@@@@@@@@ 12
32768 |@@@@@ 6
65536 |@@@ 4
131072 |@@ 3
262144 | 0
write
value ------------- Distribution ------------- count
256 | 0
512 |@@ 5
1024 |@@@@ 8
2048 | 0
4096 |@@@ 6
8192 |@@@@@@@@@@@ 23
16384 |@@@@@@@@ 16
32768 |@@@@@@@@@@ 22
65536 |@@ 4
131072 | 0
262144 | 1
524288 | 0
doorfs
value ------------- Distribution ------------- count
2048 | 0
4096 |@@@@@@@@@@@@@@@@@@@@ 1
8192 | 0
16384 | 0
32768 | 0
65536 | 0
131072 | 0
262144 | 0
524288 | 0
1048576 | 0
2097152 |@@@@@@@@@@@@@@@@@@@@ 1
4194304 | 0
yield
value ------------- Distribution ------------- count
512 | 0
1024 |@@@@@@@@@@@@@@@@@@@@@ 198
2048 |@ 5
4096 | 1
8192 |@@@@@@ 61
16384 |@@@@@@@@@@@@ 110
32768 | 2
65536 | 2
131072 | 0
262144 | 1
524288 | 0
recv
value ------------- Distribution ------------- count
256 | 0
512 |@@@@@@@@@@@@ 17
1024 |@@@@ 6
2048 |@@@@@@@@@@@@@ 19
4096 |@@@@ 6
8192 | 0
16384 | 0
32768 | 0
65536 | 0
131072 |@@ 3
262144 |@@@@ 5
524288 | 0
1048576 | 0
2097152 | 0
4194304 | 0
8388608 | 0
16777216 |@ 1
33554432 | 0
nanosleep
value ------------- Distribution ------------- count
4194304 | 0
8388608 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 201
16777216 | 0
33554432 | 0
67108864 | 0
134217728 | 0
268435456 | 0
536870912 | 2
1073741824 | 0
ioctl
value ------------- Distribution ------------- count
128 | 0
256 |@ 50
512 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2157
1024 | 17
2048 | 8
4096 | 10
8192 | 13
16384 | 6
32768 | 3
65536 | 7
131072 | 0
262144 | 0
524288 | 0
1048576 | 0
2097152 | 0
4194304 | 0
8388608 | 2
16777216 | 0
33554432 | 0
67108864 | 6
134217728 | 0
268435456 | 1
536870912 | 4
1073741824 | 0
lwp_cond_wait
value ------------- Distribution ------------- count
16777216 | 0
33554432 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 46
67108864 |@@@@ 6
134217728 | 0
268435456 |@ 2
536870912 |@@@@ 6
1073741824 | 0
pollsys
value ------------- Distribution ------------- count
512 | 0
1024 |@ 2
2048 | 1
4096 | 1
8192 | 1
16384 | 0
32768 | 0
65536 |@ 2
131072 | 0
262144 |@@ 6
524288 |@ 2
1048576 |@@ 5
2097152 |@ 2
4194304 | 0
8388608 |@@@ 8
16777216 | 0
33554432 | 1
67108864 |@@@@@@@@@@@@@@@@@@ 56
134217728 |@@@@@@@@@ 28
268435456 | 0
536870912 |@@@ 8
1073741824 | 1
2147483648 | 0
lwp_park
value ------------- Distribution ------------- count
1024 | 0
2048 | 6
4096 |@ 21
8192 |@ 20
16384 |@@@@@@@@@ 134
32768 |@@@@@@@@@@ 145
65536 |@@@@@@@@@@ 153
131072 |@ 11
262144 | 3
524288 | 0
1048576 | 0
2097152 | 0
4194304 | 0
8388608 |@@@@@@ 89
16777216 | 0
33554432 | 0
67108864 | 0
134217728 | 2
268435456 | 4
536870912 |@ 18
1073741824 | 2
2147483648 | 0
portfs
value ------------- Distribution ------------- count
256 | 0
512 | 4
1024 |@@@ 41
2048 |@@@@ 44
4096 |@ 14
8192 | 2
16384 | 1
32768 | 2
65536 | 1
131072 | 3
262144 |@ 10
524288 | 3
1048576 | 6
2097152 | 3
4194304 | 0
8388608 |@@@@@@@@@@@@@@@@@@@@@@@ 272
16777216 | 1
33554432 | 1
67108864 | 4
134217728 | 2
268435456 | 5
536870912 |@@@@@ 58
1073741824 | 3
2147483648 | 1
4294967296 | 0
For those keeping score, that would be 482 lines of output — and that was for just a couple of seconds. Brendan’s observation was that it would be great to tip these aggregations on their side (so their bars point up-and-down, not left-to-right) — with the output for each key appearing on one line. This is clearly a great idea, and “-x aggpack” was born:
# dtrace -n syscall:::entry'{self->ts = vtimestamp}' \
-n syscall:::return'/self->ts/{@[probefunc] = \
quantize(vtimestamp - self->ts); self->ts = 0}' -x aggpack
dtrace: description 'syscall:::entry' matched 233 probes
dtrace: description 'syscall:::return' matched 233 probes
^C
key min .------------------. max | count
sigpending 8 : X : 1048576 | 1
lwp_continue 8 : X : 1048576 | 1
uucopy 8 : X : 1048576 | 1
getuid 8 : x x : 1048576 | 2
lwp_kill 8 : X : 1048576 | 1
sigaction 8 : xx x : 1048576 | 4
llseek 8 : x x : 1048576 | 2
fstat64 8 : xx : 1048576 | 4
setcontext 8 : xx : 1048576 | 2
lwp_sigmask 8 : xx _ : 1048576 | 10
systeminfo 8 : xx : 1048576 | 4
sysconfig 8 : ____x : 1048576 | 6
accept 8 : xx : 1048576 | 2
bind 8 : X : 1048576 | 2
fstat 8 : x x : 1048576 | 2
recvfrom 8 : X : 1048576 | 5
doorfs 8 : xx : 1048576 | 2
brk 8 : ___xx : 1048576 | 8
lwp_create 8 : X : 1048576 | 1
schedctl 8 : x x : 1048576 | 2
p_online 8 : X_ _ : 1048576 | 256
getpid 8 : xx_ _ : 1048576 | 28
recvmsg 8 : xx_ : 1048576 | 18
open 8 : xx x : 1048576 | 4
lseek 8 : xx _ _ : 1048576 | 9
getsockopt 8 : _xx : 1048576 | 43
stat 8 : x_x _ : 1048576 | 7
mmap 8 : X : 1048576 | 1
putmsg 8 : X : 1048576 | 1
recv 8 : _xxx__ : 1048576 | 65
gtime 8 : X_____ : 1048576 | 899
fcntl 8 : ___xx : 1048576 | 208
setsockopt 8 : __x__ : 1048576 | 55
sendmsg 8 : xxx : 1048576 | 9
lwp_cond_signal 8 : _x__x : 1048576 | 33
close 8 : __xx_ : 1048576 | 65
lwp_cond_wait 8 : _xx_ : 1048576 | 69
munmap 8 : X : 1048576 | 1
read 8 : ___x_x_ _ _ : 1048576 | 137
stat64 8 : ___X : 1048576 | 49
yield 8 : _ X___ : 1048576 | 1314
so_socket 8 : ___x_ : 1048576 | 60
nanosleep 8 : _x__ : 1048576 | 77
send 8 : ____xx__ : 1048576 | 66
pollsys 8 : _ _xx_ : 1048576 | 86
ioctl 8 : X_________ : 1048576 | 3259
connect 8 : _ _xx_ : 1048576 | 57
mmap64 8 : __x _ x : 1048576 | 14
write 8 : __ __x_____ : 1048576 | 115
portfs 8 : ______x_ : 1048576 | 619
lwp_park 8 : _xx_x_ : 1048576 | 654
This communicates essentially the same amount of information in just a fraction of the output (55 lines versus 482) — and makes it (much) easier to quickly compare distributions across aggregation keys. That said, one of the challenges here is that ASCII can be very limiting when trying to come with characters of clearly different height. When I demonstrated a prototype of this at the illumos BOF at Surge, a couple of folks volunteered that I should look into using the Unicode Block Elements for this output. Here’s the same enabling, but rendered using these elements:
# dtrace -n syscall:::entry'{self->ts = vtimestamp}' \
-n syscall:::return'/self->ts/{@[probefunc] = \
quantize(vtimestamp - self->ts); self->ts = 0}' -x aggpack
dtrace: description 'syscall:::entry' matched 233 probes
dtrace: description 'syscall:::return' matched 233 probes
^C
key min .-------------------. max | count
lwp_self 32 : █ : 8388608 | 1
sysconfig 32 : ▃▃▃ : 8388608 | 3
sigpending 32 : █ : 8388608 | 1
pset 32 : ▆ ▃ : 8388608 | 3
getsockname 32 : █ : 8388608 | 1
systeminfo 32 : ▅▄ : 8388608 | 5
fstat 32 : █ : 8388608 | 1
llseek 32 : ▆▃ : 8388608 | 3
memcntl 32 : █ : 8388608 | 1
resolvepath 32 : █ : 8388608 | 1
semsys 32 : █ : 8388608 | 1
setcontext 32 : █ : 8388608 | 2
setpgrp 32 : █ : 8388608 | 1
fstat64 32 : ▃▃▃ : 8388608 | 6
sigaction 32 : ▃▃▂▃ : 8388608 | 17
accept 32 : █ : 8388608 | 1
access 32 : █ : 8388608 | 1
munmap 32 : █ : 8388608 | 1
setitimer 32 : ▅▃▃ : 8388608 | 4
lwp_sigmask 32 : ▄▃▃▂ : 8388608 | 26
pipe 32 : █ : 8388608 | 1
waitsys 32 : ▅ ▅ : 8388608 | 2
stat 32 : ▃ ▆ : 8388608 | 3
bind 32 : ▅▄ : 8388608 | 5
mmapobj 32 : █ : 8388608 | 1
open64 32 : █ : 8388608 | 1
p_online 32 : █▁ ▁ : 8388608 | 256
getpid 32 : ▁▆▂ : 8388608 | 45
schedctl 32 : █ : 8388608 | 2
getsockopt 32 : ▃▅▂ : 8388608 | 32
setsockopt 32 : ▁▁▃▆ : 8388608 | 31
fcntl 32 : ▂▁▃▃▂ : 8388608 | 114
recvmsg 32 : ▁▄▃▂▁▁ : 8388608 | 17
putmsg 32 : █ : 8388608 | 1
mmap 32 : ▅ ▅ : 8388608 | 2
writev 32 : ▆ ▃ : 8388608 | 3
lseek 32 : ▁▃▃▄▁▁ : 8388608 | 76
brk 32 : ▃▂▃▃▂▁▁ ▁ : 8388608 | 124
gtime 32 : ▇▁▁▁▁▁ ▁ : 8388608 | 1405
sendmsg 32 : ▂▅▄ : 8388608 | 10
recv 32 : ▂▃▃▂▂▁ : 8388608 | 159
yield 32 : █▁▁▁ : 8388608 | 409
open 32 : ▃▂▄▁▂ : 8388608 | 30
close 32 : ▂▃▂▃▂▁ : 8388608 | 60
so_socket 32 : ▁▃▆ : 8388608 | 29
lwp_cond_signal 32 : ▁▃▂▂▁▃▁ : 8388608 | 59
nanosleep 32 : ▂▆▂ : 8388608 | 74
lwp_cond_wait 32 : ▂▃▄▂ : 8388608 | 120
connect 32 : ▂▅▃ : 8388608 | 24
stat64 32 : ▁▂▇▁ : 8388608 | 77
pollsys 32 : ▁▁▁ ▁▅▄ : 8388608 | 148
send 32 : ▁▂▂▂▂▃▂▁ : 8388608 | 137
read 32 : ▁▂▃▂▃▂▁▁▁▁ : 8388608 | 283
ioctl 32 : ▇▁▁▁▁▁▁▁▁▁▁ : 8388608 | 3697
write 32 : ▁▁▁▂▂▃▂▁▁ ▁▁ : 8388608 | 207
forksys 32 : █ : 8388608 | 1
portfs 32 : ▁▂▂▁▁▂▄▁ ▁ : 8388608 | 807
lwp_park 32 : ▁ ▁▁▁▃▃▂▃▁ : 8388608 | 919
Delicious! (Assuming, of course, that these look right for you — which they may or may not, depending on how your monospaced font renders the Unicode Block Elements.) After one look at the Unicode Block Elements, it clearly had to be the default behavior — but if your terminal is rendered in a font that can’t display the UTF-8 encodings of these characters (less common) or if they render in a way that is not monospaced despite being in a putatively monospaced font (more common), I also added a “-x encoding” option that can be set to “ascii” to force the ASCII output.
Returning to our example, if the above is too much dead space for you, you can combine it with aggzoom:
# dtrace -n syscall:::entry'{self->ts = vtimestamp}' \
-n syscall:::return'/self->ts/{@[probefunc] = \
quantize(vtimestamp - self->ts); self->ts = 0}' -x aggpack -x aggzoom
dtrace: description 'syscall:::entry' matched 233 probes
dtrace: description 'syscall:::return' matched 233 probes
^C
key min .----------------. max | count
lwp_continue 32 : █ : 1048576 | 1
sigpending 32 : █ : 1048576 | 1
uucopy 32 : █ : 1048576 | 1
lwp_kill 32 : █ : 1048576 | 1
sigaction 32 : ▄▄ █ : 1048576 | 4
pset 32 : █ ▄ : 1048576 | 3
sysconfig 32 : ███ █ : 1048576 | 4
setcontext 32 : ██ : 1048576 | 2
fstat 32 : █ : 1048576 | 1
recvfrom 32 : █ : 1048576 | 1
systeminfo 32 : ▅█ : 1048576 | 5
lwp_sigmask 32 : ██▂▅ : 1048576 | 14
accept 32 : █ █ : 1048576 | 2
recvmsg 32 : █ █ : 1048576 | 2
stat 32 : █ : 1048576 | 2
p_online 32 : █▁ ▁ : 1048576 | 256
schedctl 32 : █ █ : 1048576 | 2
brk 32 : ▄███▄ : 1048576 | 8
getsockopt 32 : ▅██ : 1048576 | 21
getpid 32 : █▇ : 1048576 | 39
setsockopt 32 : ▃█▂ : 1048576 | 16
lwp_create 32 : █ : 1048576 | 1
sendmsg 32 : █ : 1048576 | 1
fcntl 32 : ▃▄▃▆█▁ : 1048576 | 60
writev 32 : █ : 1048576 | 1
lseek 32 : ▆▇█▂ : 1048576 | 65
recv 32 : ▁▆▃█▆▅ : 1048576 | 62
putmsg 32 : █ : 1048576 | 2
open 32 : █ ▅▂▃ : 1048576 | 16
gtime 32 : █▁▁▁▁ ▁ : 1048576 | 1365
close 32 : ▂█▂█▄▃ : 1048576 | 36
lwp_cond_signal 32 : ▂▂█▅▅▂ : 1048576 | 18
so_socket 32 : ▁▂▁█ : 1048576 | 19
mmap 32 : ▆▃ █▄ ▆ : 1048576 | 13
nanosleep 32 : ▃█▃ : 1048576 | 34
read 32 : ▁▂▅▄█▅▂ : 1048576 | 144
lwp_cond_wait 32 : ▃█▂▁ : 1048576 | 74
yield 32 : █▁▁▁ ▁ : 1048576 | 1187
send 32 : ▁▂▅█▅▄▂ : 1048576 | 50
connect 32 : ▂▂ ▅██▃ : 1048576 | 20
stat64 32 : ▁▁▁▃█▁ : 1048576 | 80
pollsys 32 : ▁▁▁█▆ : 1048576 | 136
write 32 : ▂▄ ▃▂▄█▆▃▂ ▂ : 1048576 | 77
ioctl 32 : █▂▁▁▁▁▁▁ ▁ : 1048576 | 4587
mmap64 32 : ▂▄▄▂▅ ▂ █ : 1048576 | 20
portfs 32 : ▁▁▂▂▁▂▂█▂ : 1048576 | 601
lwp_park 32 : ▁▁▂▆▇▃█▁▁ : 1048576 | 787
Here’s another fun one:
# dtrace -n BEGIN'{start = timestamp}' \
-n sched:::on-cpu'{@[execname] = \
lquantize((timestamp - start) / 1000000000, 0, 30, 1)}' \
-x aggpack -n tick-1sec'/++i >= 30/{exit(0)}' -q
key min .--------------------------------. max | count
sshd < 0 : █ : >= 30 | 1
zpool < 0 : █ : >= 30 | 8
zfs < 0 : █ : >= 30 | 19
zoneadm < 0 : █ : >= 30 | 19
utmpd < 0 : █ : >= 30 | 1
intrd < 0 : █ : >= 30 | 2
epmd < 0 : ▂ ▂ ▂ ▂ ▂ ▂ : >= 30 | 6
mdnsd < 0 : ▂ ▂ ▂ ▂ ▂ ▂ : >= 30 | 6
sendmail < 0 : ▂ ▂ ▂ ▂ ▂ ▂ : >= 30 | 6
ur-agent < 0 : ▂▂ ▂ ▂ ▃▂ : >= 30 | 14
vmadmd < 0 : ▃ ▂ ▂ ▂ ▂ ▃ : >= 30 | 22
devfsadm < 0 : ▁ ▁ ▁ ▁ ▁ ▁ ▁ ▁ ▁ ▁ ▁ ▁ ▁ ▁ ▁ : >= 30 | 30
bridged < 0 : ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ : >= 30 | 30
fsflush < 0 : ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ : >= 30 | 32
fmd < 0 : ▁▁ ▁▇ ▁▁ : >= 30 | 41
ipmon < 0 : ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ : >= 30 | 51
ipmgmtd < 0 : ▃ ▃ ▃ : >= 30 | 66
lldpd < 0 : ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ : >= 30 | 90
dtrace < 0 : ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ : >= 30 | 111
haproxy < 0 : ▂▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁ : >= 30 | 113
svc.configd < 0 : ▂▁ ▄▁ ▂▁ : >= 30 | 126
svc.startd < 0 : ▃▁ ▃▁ ▃▁ : >= 30 | 150
beam.smp < 0 : ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ : >= 30 | 209
nscd < 0 : ▁ ▃▂ ▃▂ ▃▂ : >= 30 | 192
postgres < 0 : ▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ : >= 30 | 598
redis-server < 0 : ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ : >= 30 | 598
zpool-zones < 0 : ▂ ▁ ▂ ▃ ▁ ▂ ▂ ▁ ▃ : >= 30 | 623
java < 0 : ▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁ : >= 30 | 1211
tail < 0 : ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ : >= 30 | 1178
node < 0 : ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁ : >= 30 | 11195
sched < 0 : ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ : >= 30 | 22101
This (quickly) shows you the second offset for CPU scheduling events from the start of the D script for various applications, and points to some obvious conclusions: there are clearly a bunch of programs running once every ten seconds, a couple more running once every five seconds, and so on.
encoding
Seeing this, why should aggpack hog all that Unicode hawtness? While only aggpack will use the UTF-8 encodings by default, I also extended the encoding option to allow a “utf8” setting that forces the UTF-8 encoding of the Unicode Block Elements to be used for all aggregation display. For example, to return to our earlier agghist example:
# dtrace -n syscall:::entry'{@[execname] = count()}' \
-x agghist -x aggzoom -x encoding=utf8
dtrace: description 'syscall:::entry' matched 233 probes
dtrace: description 'syscall:::entry' matched 233 probes
^C
key ------------- Distribution ------------- count
utmpd | 2
sshd | 10
pickup | 14
fmd | 20
ur-agent | 21
epmd | 22
vmadmd | 31
jsontool.js | 33
mdnsd | 33
master | 41
intrd | 45
devfsadm | 54
ipmgmtd | 55
bridged | 81
mkdir | 83
cat | 96
sendmail | 101
ipmon | 108
dirname | 115
svc.startd | 129
ksh93 | 164
ls | 182
svcprop |▏ 258
bash |▏ 290
zpool |▎ 436
date |▎ 448
cron |▎ 470
lldpd |▍ 595
dump-minutely-sd |▍ 609
haproxy |▍ 654
svc.configd |▌ 907
zoneadmd |█ 1554
zfs |█▎ 1979
zoneadm |█▋ 2414
nscd |█▋ 2522
java |██▏ 3205
tail |██▌ 3775
postgres |████▏ 6083
redis-server |██████████████▋ 21710
dtrace |███████████████████▎ 28509
node |█████████████████████████▍ 37427
beam.smp |██████████████████████████████████████ 55948
To paraphrase Kent Brockman, I for one welcome our new Unicode Block Element overlords — and I look forward to toiling in their underground sugar caves!
Availability
All of these new options have been integrated into SmartOS (and we aim to get them up into illumos). If you’re a Joyent public cloud customer, you already all have these improvements or they will be coming to you when your platform is next upgraded (depending on the compute node that you have landed on). And if you’re a DTrace user and like Brendan and have some new way that you’d like to see data visualized (or have any other DTrace feature request), don’t hesitate to speak up — DTrace only improves when those of us who depend on it every day envision a way that it could be even better!
Happy 10th Birthday, DTrace!
Ten years ago this morning — at 10:27a local time, to be exact — DTrace was integrated. On the occasion of DTrace’s fifth birthday a half-decade ago, I reflected on much of the drama of the final DTrace splashdown, but much has happend in the DTrace community in the last five years; some highlights:
- DTrace was ported to Linux — twice
- Brendan‘s DTrace book became the canonical guide to using DTrace in practice, bringing DTrace to a new generation of practitioners
- Chris Andrews lit the way for bringing USDT to dyanamic languages with his terrific libusdt, using it to bring DTrace to node.js, bring DTrace to Lua, bring DTrace to Perl and bring DTrace to Ruby.
- Dave Pacheco became a master of the black art of DTrace ustack helpers, and developed a DTrace ustack helper for node.js, allowing for Brendan Gregg‘s amazing flamegraphs for node.js programs on SmartOS. Dave’s work was both extended to 64-bit and explained beautifully by the incomparable Fedor Indutny.
- In part shamed by Joyent customers, I expanded DTrace in the non-global zone — which was embarrassingly a problem that I had previously ducked by claiming it was “technically impossible.” (If you don’t know by now, this is often correctly translated as “I really, really don’t feel like it.”)
- Eric Schrock added the print action — which saved my butt as recently as this past weekend.
- We got the DTrace community together at the first DTrace unconference — dtrace.conf(08) — and then followed that up at an Olympiad cadence with dtrace.conf(12). The International DTrace Committee will soon be selecting the host city for dtrace.conf(16), so get those bribes ready!
And a bunch of other stuff that’s either more esoteric (e.g., llquantize) or that I now use so frequently that I’ve forgotten that there was a time that we didn’t have it (-x temporal FTW!). But if I could go back to my five-year-ago self, the most surprising development to my past-self may be how smoothly DTrace and its community survived the collapse of Sun and the craven re-closing of Solaris. It’s a testament to the power of open source and open communities that DTrace has — along with its operating system of record, illumos — emerged from these events empowered and invigorated. So I would probably show my past-self my LISA talk on the rise of illumos, and he and I would both get a good chuckle out of it all — and then no doubt erupt in fisticuffs over the technical impossibility of fds[] in the non-global zone…
And lest anyone think that we’re done, the future is bright, as the El Dorado of user-land types in DTrace may be gleaming on the horizon courtesy of Adam and Robert Mustacchi. I think I speak on behalf of all of us in the broader DTrace community when I say that I’m very much looking forward to continuing to use, improve and expand DTrace for the next decade and beyond!
Serving up disaster porn with Manta
Years ago, Ben Fried liberated me by giving me the words to describe myself: I am a disaster porn addict. The sordid specifics of my addiction would surely shock the unafflicted: there are well-thumbed NTSB final accident reports hidden under my matress; I prowl the internet late at night for pictures of dermoid cysts; and I routinely binge on the vexed Soviet missions to Mars.
In terms of my software work, my predilection for (if not addiction to) systems failure has manifested itself in an acute interest in postmortem debugging and debuggability. You can see this through my career, be it postmortem memory leak detection in 1999, postmortem object type identifcation in 2002, postmortem DTrace in 2003 or postmortem debugging of JavaScript in 2012.
That said, postmortem debugging does have a dark side — or at least, a tedious side: the dumps themselves are annoyingly large artifacts to deal with. Small ones are only tens of megabytes, but a core dump from a modestly large VM — and certainly a crash dump of a running operating system — will easily be gigabytes or tens of gigabyes in size. This problem is not new, and my career has been pockmarked by centralized dump servers that were constantly running low of space[1]. To free up space in the virtual morgue, dumps from resolved issues are inevitably torched on a regular basis. This act strikes me as a a kind of desecration; every dump — even one whose immediate failure is understood — is a sacred snapshot of a running system, and we can’t know what questions we might have in the future that may be answerable by the past. There have been many times in my career when debugging a new problem has led to my asking new questions of old dumps.
At Joyent, we have historically gotten by on the dump management problem with the traditional centralized dump servers managed with creaky shell scripts — but it wasn’t pretty, and I lived in fear of dumps slipping through the cracks. Ironically, it was the development of Manta that made our kludged-together solution entirely untenable: while in the Joyent public cloud we may have the luxury of (broadly) ignoring core dumps that may correspond to errant user code, in Manta, we care about every core dump from every service — we always want to understand why any component fails. But because it’s a distributed service, a single bug could cause many components to fail — and generate many, many dumps. In the earlier stages of Manta development, we quickly found ourselves buried under an avalanche of hundreds of dumps. Many of them were due to known issues, of course, but I was concerned that a hithertofore unseen issue would remain hidden among the rubble — and that we would lose our one shot to debug a problem that would return to haunt us in production.
A sexy solution
Of course, the answer to the Manta dump problem was clear: Manta may have been presenting us with a big data problem, but it also provided us a big data solution. Indeed, Manta is perfectly suited for the dump problem: dumps are big, but they also require computation (if in modest measure) to make any real use of them, and Manta’s compute abstraction of the operating system is perfect for running the debugger. There were just two small wrinkles, the first being that Manta as designed only allowed computation access to an object via a (non-interactive) job, posing a problem for the fundamentally interactive act of debugging. This wasn’t a technological limitation per se (after all, the operating system naturally includes interactive elements like the shell and so on), but allowing an interactive shell to be easily created from a Manta job was going to require some new plumbing. Fortunately, some lunchtime moaning on my part managed to convince Joshua Clulow to write this (if only to quiet my bleating), and the unadulterated awesomeness that is mlogin was born.
The second wrinkle was even smaller: for historical (if not antiquated) reasons, kernel crash dumps don’t have just one file, but two — a crash dump and a “name list” that contained the symbol table. But for over a decade we have also had the symbol table in the dump itself, and it was clear the vestigial appendix that was unix.0 needed to be removed and libkvm modified to dig it out of the dump — a straightforward fix.
Wrinkles smoothed, we had the necessary foundation in Manta to implement Thoth, a Manta-based system for dump management. Using the node-manta SDK to interact with Manta, Thoth allows for uploading dumps to Manta, tracking auxiliary information about each dump, debugging a dump (natch), querying dumps, and (most interestingly) automatically analyzing many dumps in parallel. (If you are in the SmartOS or broader illumos community and you want to use Thoth, it’s open source and available on GitHub.)
While the problem Thoth solves may have but limited appeal, some of the patterns it uses may be more broadly applicable to those building Manta-based systems. As you might imagine, Thoth uploads a dump by calculating a unique hash for the dump on the client[2]. Once the hash is calculated, a Manta directory is created that contains the dump itself. For the metadata associated with the dump (machine, application, datacenter and so on), I took a quick-and-dirty approach: Thoth stores the metadata about a dump in a JSON payload that lives beside the dump in the same directory. Here is what a JSON payload might look like:
% thoth info 6433c1ccfb41929d
{
"name": "/thoth/stor/thoth/6433c1ccfb41929dedf3257a8d9160ea",
"dump": "/thoth/stor/thoth/6433c1ccfb41929dedf3257a8d9160ea/core.svc.startd.70377",
"pid": "70377",
"cmd": "svc.startd",
"psargs": "/lib/svc/bin/svc.startd",
"platform": "joyent_20130625T221319Z",
"node": "eb9ca020-77a6-41fd-aabb-8f95305f9aa6",
"version": "1",
"time": 1373566304,
"stack": [
"libc.so.1`_lwp_kill+0x15()",
"libc.so.1`raise+0x2b()",
"libc.so.1`abort+0x10e()",
"utmpx_postfork+0x44()",
"fork_common+0x186()",
"fork_configd+0x8d()",
"fork_configd_thread+0x2ca()",
"libc.so.1`_thrp_setup+0x88()",
"libc.so.1`_lwp_start()"
],
"type": "core",
}
To actually debug a dump, one can use the “debug” command, which simply uses mlogin to allow interactive debugging of the dump via mdb:
% thoth debug 6433c1ccfb41929d thoth: debugging 6433c1ccfb41929dedf3257a8d9160ea * created interactive job -- 8ea7ce47-a58d-4862-a070-a220ae23e7ce * waiting for session... - established thoth: dump info in $THOTH_INFO Loading modules: [ svc.startd libumem.so.1 libc.so.1 libuutil.so.1 libnvpair.so.1 ld.so.1 ] >
Importantly, this allows you to share a dump with others without moving the dump and without needing to grant ssh access or punch holes in firewalls — they need only be able to have access to the object. For an open source system like ours, this alone is huge: if we see a panic in (say) ZFS, I would like to enlist (say) Matt or George to help debug it — even if it’s to verify a hypothesis that we have developed. Transporting tens of gigs around becomes so painful that we simply don’t do it — and the quality of software suffers.
To query dumps, Thoth runs simple Manta compute jobs on the (small) metadata objects. For example, if you wanted to find all of the dumps from a particular machine, you would run:
% thoth ls node=eb9ca020-77a6-41fd-aabb-8f95305f9aa6 thoth: creating job to list thoth: created job 2e8af859-ffbf-4757-8b54-b7865307d4d9 thoth: waiting for completion of job 2e8af859-ffbf-4757-8b54-b7865307d4d9 DUMP TYPE TIME NODE/CMD TICKET 0ccf27af56fe18b9 core 2013-07-11T18:11:43 svc.startd OS-2359 6433c1ccfb41929d core 2013-07-11T18:11:44 svc.startd OS-2359
This kicks off a job that looks at all of the metadata objects in parallel in a map phase, pulls out those that match the specified machine and passes them to a reduce phase that simply serves to coalesce the results into a larger JSON object. We can use the mjob command along with Trent Mick‘s excellent json tool to see the first phase:
% mjob get 2e8af859-ffbf-4757-8b54-b7865307d4d9 | json phases[0].exec
mls /thoth/stor/thoth | awk '{ printf("/thoth/stor/thoth/%s/info.json\n", $1) }' | xargs mcat
So this, as it turns out, is a tad fancy: in the context of a job, it runs mls on the thoth directory, formats that as an absolute path name to the JSON object, and then uses the venerable xargs to send that output as arguments to mcat. mcat interprets its arguments as Manta objects, and runs the next phase of a job in parallel on those objects. We could — if we wanted — do the mls on the client and then pass in the results, but it would require an unnecessary roundtrip between server and client for each object; the power of Manta is that you can get the server to do work for you! Now let’s take a look at the next phase:
% mjob get 2e8af859-ffbf-4757-8b54-b7865307d4d9 | json phases[1].exec json -c 'node=="eb9ca020-77a6-41fd-aabb-8f95305f9aa6"'
This is very simple: it just uses json‘s -c option to pass the entire JSON blob if the “node” property matches the specified value. Finally, the last phase:
% mjob get 2e8af859-ffbf-4757-8b54-b7865307d4d9 | json phases[2].exec json -g
This uses json (again) to aggregate all of the JSON blobs into one array that contains all of them — and this array is what is returned as the output of the job. Now, as one might imagine, all of this is a slight abuse of Manta, as it’s using Manta to implement a database (albeit crudely) — but Manta is high performing enough that this works well, and it’s much simpler than having to spin up infrastructure dedicated to Thoth (which would require sizing a database, making it available, backing it up, etc). Manta loves it quick and dirty!
Getting kinky
Leveraging Manta’s unique strengths, Thoth also supports a notion of analyzers, simple shell scripts that run in the context of a Manta job. For example, here is an analyzer that determines if a dump is a duplicate of a known svc.startd issue:
# # This analyzer only applies to core files # if [[ "$THOTH_TYPE" != "core" ]]; then exit 0; fi # # This is only relevant for svc.startd # if [[ `cat $THOTH_INFO | json cmd` != "svc.startd" ]]; then exit 0; fi # # This is only OS-2359 if we have utmpx_postfork in our stack # if ( ! echo ::stack | mdb $THOTH_DUMP | grep utmpx_postfork > /dev/null ); then exit 0; fi # # We have a winner! Set the ticket. # thoth_ticket OS-2359 echo $THOTH_NAME: successfully diagnosed as OS-2359
Thoth’s “analyze” command can then be used to run this against one dump, any dumps that match a particular specification, or all dumps.
The money shot
Here’s a more involved analyzer that pulls an SMF FMRI out of the dump and — looking at the time of the dump — finds the log (in Manta!) that corresponds to that service at that hour and looks for a string (“PANIC”) in that log that corresponds to service failure. This log snippet is then attached to the dump via a custom property (“paniclog”) that can then be queried by other analyzers:
if [[ "$THOTH_TYPE" != "core" ]]; then
exit 0
fi
if ( ! pargs -e $THOTH_DUMP | grep -w SMF_FMRI > /dev/null ); then
exit 0
fi
fmri=`pargs -e $THOTH_DUMP | grep -w SMF_FMRI | cut -d= -f2-`
time=`cat $THOTH_INFO | json time`
node=`cat $THOTH_INFO | json node`
path=`basename $fmri | cut -d: -f1`/`date -d @$time +%Y/%m/%d/%H`/${node}.log
echo === $THOTH_NAME ===
echo log: $path
if ( ! mget /$MANTA_USER/stor/logs/$path > log.out ); then
echo " (not found)"
exit 0
fi
grep -B 10 -A 10 PANIC log.out > panic.out
echo paniclog:
cat panic.out
thoth_set paniclog < panic.out
In a distributed system, logs are a critical component of system state — being able to quickly (and automatically) couple those logs with core dumps has been huge for us in terms of quickly diagnosing issues.
Pillow talk
Thoth is definitely a quick-and-dirty system — I was really designing it to satisfy our own immediate needs (and yes, addictions). If the usage patterns changed (if, for example, one were going to have millions of dumps instead of thousands), one would clearly want a proper database fronting Manta. Of course, it must also be said that one would then need to deal with annoying infrastructure issues like sizing that database, making it available, backing it up, etc. And if (when?) this needs to be implemented, my first approach would be to broadly keep the structure as it is, but add a Manta job that iterates over all metadata and adds the necessary rows in the appropriate tables of a centralized reporting database. This allows Manta to store the canonical data, leveraging the in situ compute to populate caching tiers and/or reporting databases.
As an engineer, it’s always gratifying to build something that you yourself want to use. Strange as it may sound to some, Thoth epitomizes that for me: it is the dump management and analysis framework that I never dreamed possible — and thanks to Manta, it was amazingly easy (and fun!) to build. Most importantly, in just the short while that we’ve had Thoth fully wired up, we have already caught a few issues that would have likely escaped notice before. (Thanks to a dump harvested by Thoth, we caught a pretty nasty OS bug that had managed to escape detection for many years.) Even though Thoth’s purpose is admittedly narrow, it’s an example of the kinds of things that you can build on top of Manta. We have many more such ideas in the works — and we’re not the only ones; Manta is inspiring many with their own big data problems — from e-commerce companies to to scientific computing and much in between!
[1] Pouring one out for my dead homie, cores2.central
[2] Shout-out to Robert‘s excellent node-ctype which allows for binary dumps to be easily parsed client-side
