Reflecting on The Soul of a New Machine

Long ago as an undergraduate, I found myself back home on a break from school, bored and with eyes wandering idly across a family bookshelf. At school, I had started to find a calling in computing systems, and now in the den, an old book suddenly caught my eye: Tracy Kidder’s The Soul of a New Machine. Taking it off the shelf, the book grabbed me from its first descriptions of Tom West, captivating me with the epic tale of the development of the Eagle at Data General. I — like so many before and after me — found the book to be life changing: by telling the stories of the people behind the machine, the book showed the creative passion among engineers that might otherwise appear anodyne, inspiring me to chart a course that might one day allow me to make a similar mark.

Since reading it over two decades ago, I have recommended The Soul of a Machine at essentially every opportunity, believing that it is a part of computing’s literary foundation — that it should be considered our Odyssey. Recently, I suggested it as beach reading to Jess Frazelle, and apparently with perfect timing: when I saw the book at the top of her vacation pile, I knew a fuse had been lit. I was delighted (though not at all surprised) to see Jess livetweet her admiration of the book, starting with the compelling prose, the lucid technical explanations and the visceral anecdotes — but then moving on to the deeper technical inspiration she found in the book. And as she reached the book’s crescendo, Jess felt its full power, causing her to reflect on the nature of engineering motivation.

Excited to see the effect of the book on Jess, I experienced a kind of reflected recommendation: I was inspired to (re-)read my own recommendation! Shortly after I started reading, I began to realize that (contrary to what I had been telling myself over the years!) I had not re-read the book in full since that first reading so many years ago. Rather, over the years I had merely revisited those sections that I remembered fondly. On the one hand, these sections are singular: the saga of engineers debugging a nasty I-cache data corruption issue; the young engineer who implements the simulator in an impossibly short amount of time because no one wanted to tell him that he was being impossibly ambitious; the engineer who, frustrated with a nanosecond-scale timing problem in the ALU that he designed, moved to a commune in Vermont, claiming a desire to deal with “no unit of time shorter than a season”. But by limiting myself to these passages, I was succumbing to the selection bias of my much younger self; re-reading the book now from start to finish has given new parts depth and meaning. Aspects that were more abstract to me as an undergraduate — from the organizational rivalries and absurdities of the industry to the complexities of West’s character and the tribulations of the team down the stretch — are now deeply evocative of concrete episodes of my own career.

As a particularly visceral example: early in the book, a feud between two rival projects boils over in an argument at a Howard Johnson’s that becomes known among the Data General engineers as “the big shoot-out at HoJo’s.” Kidder has little more to say about it (the organizational civil war serves merely as a backdrop for Eagle), and I don’t recall this having any effect on me when I first read it — but reading it now, it resonates with a grim familiarity. In any engineering career of sufficient length, a beloved project will at some point be shelved or killed — and the moment will be sufficiently traumatic to be seared into collective memory and memorialized in local lore. So if you haven’t yet had your own shoot-out at HoJo’s, it is regrettably coming; may your career be blessed with few such firefights!

Another example of a passage with new relevance pertains to Tom West developing his leadership style on Eagle:

That fall West had put a new term in his vocabulary. It was trust. “Trust is risk, and risk avoidance is the name of the game in business,” West said once, in praise of trust. He would bind his team with mutual trust, he had decided. When a person signed up to do a job for him, he would in turn trust that person to accomplish it; he wouldn’t break it down into little pieces and make the task small, easy and dull.

I can’t imagine that this paragraph really affected me much as an undergraduate, but reading it now I view it as a one-paragraph crash-course in engineering leadership; those who deeply internalize it will find themselves blessed with high-functioning, innovative engineering teams. And lest it seem obvious, it is in fact more subtle than it looks: West is right that trust is risk — and risk-averse organizations can really struggle to foster mutual trust, despite its obvious advantages.

This passage also serves as a concrete example of the deeper currents that the book captures: it is not merely about the specific people or the machine they built, but about why we build things — especially things that so ardently resist being built. Kidder doesn’t offer a pat answer, though the engineers in the book repeatedly emphasize that their motivations are not so simple as ego or money. In my experience, engineers take on problems for lots of reasons: the opportunity to advance the state of the art; the potential to innovate; the promise of making a difference in everyday lives; the opportunity to learn about new technologies. Over the course of the book, each of these can be seen at times among the Eagle team. But, in my experience (and reflected in Kidder’s retelling of Eagle), when the problem is challenging and success uncertain, the endurance of engineers cannot be explained by these factors alone; regardless of why they start, engineers persist because they are bonded by a shared mission. In this regard, engineers endure not just for themselves, but for one another; tackling thorny problems is very much a team sport!

More than anything, my re-read of the book leaves me with the feeling that on teams that have a shared mission, mutual trust and ample grit, incredible things become possible. Over my career, I have had the pleasure of experiencing this several times, and they form my fondest memories: teams in which individuals banded together to build something larger than the sum of themselves. So The Soul of a New Machine serves to remind us that the soul of what we build is, above all, shared — that we do not endeavor alone but rather with a group of like-minded individuals.

Thanks to Jess for inspiring my own re-read of this classic — and may your read (or re-read!) of the book be as invigorating!

Posted on February 10, 2019 at 5:20 pm by bmc · Permalink · 11 Comments
In: Uncategorized

A EULA in FOSS clothing?

There was a tremendous amount of reaction to and discussion about my blog entry on the midlife crisis in open source. As part of this discussion on HN, Jay Kreps of Confluent took the time to write a detailed response — which he shortly thereafter elevated into a blog entry.

Let me be clear that I hold Jay in high regard, as both a software engineer and an entrepreneur — and I appreciate the time he took to write a thoughtful response. That said, there are aspects of his response that I found troubling enough to closely re-read the Confluent Community License — and that in turn has led me to a deeply disturbing realization about what is potentially going on here.

Here is what Jay said that I found troubling:

The book analogy is not accurate; for starters, copyright does not apply to physical books and intangibles like software or digital books in the same way.

Now, what Jay said is true to a degree in that (as with many different kind of expression), software has code specific to it; this can be found in 17 U.S.C. § 117. But the fact that Jay also made reference to digital books was odd; digital books really have nothing to do with software (or not any more so than any other kind of creative expression). That said, digital books and proprietary software do actually share one thing in common, though it’s horrifying: in both cases their creators have maintained that you don’t actually own the copy you paid for. That is, unlike a book, you don’t actually buy a copy of a digital book, you merely acquire a license to use their book under their terms. But how do they do this? Because when you access the digital book, you click “agree” on a license — an End User License Agreement (EULA) — that makes clear that you don’t actually own anything. The exact language varies; take (for example) VMware’s end user license agreement:

2.1 General License Grant. VMware grants to You a non-exclusive, non-transferable (except as set forth in Section 12.1 (Transfers; Assignment) license to use the Software and the Documentation during the period of the license and within the Territory, solely for Your internal business operations, and subject to the provisions of the Product Guide. Unless otherwise indicated in the Order, licenses granted to You will be perpetual, will be for use of object code only, and will commence on either delivery of the physical media or the date You are notified of availability for electronic download.

That’s a bit wordy and oblique; in this regard, Microsoft’s Windows 10 license is refreshingly blunt:

(2)(a) License. The software is licensed, not sold. Under this agreement, we grant you the right to install and run one instance of the software on your device (the licensed device), for use by one person at a time, so long as you comply with all the terms of this agreement.

That’s pretty concise: “The software is licensed, not sold.” So why do this at all? EULAs are an attempt to get out of copyright law — where the copyright owner is quite limited in the rights afforded to them as to how the content is consumed — and into contract law, where there are many fewer such limits. And EULAs have accordingly historically restricted (or tried to restrict) all sorts of uses like benchmarking, reverse engineering, running with competitive products (or, say, being used by a competitor to make competitive products), and so on.

Given the onerous restrictions, it is not surprising that EULAs are very controversial. They are also legally dubious: when you are forced to click through or (as it used to be back in the day) forced to unwrap a sealed envelope on which the EULA is printed to get to the actual media, it’s unclear how much you are actually “agreeing” to — and it may be considered a contract of adhesion. And this is just one of many legal objections to EULAs.

Suffice it to say, EULAs have long been considered open source poison, so with Jay’s frightening reference to EULA’d content, I went back to the Confluent Community License — and proceeded to kick myself for having missed it all on my first quick read. First, there’s this:


You will notice that this looks nothing like any traditional source-based license — but it is exactly the kind of boilerplate that you find on EULAs, terms-of-service agreements, and other contracts that are being rammed down your throat. And then there’s this:

1.1 License. Subject to the terms and conditions of this Agreement, Confluent hereby grants to Licensee a non-exclusive, royalty-free, worldwide, non-transferable, non-sublicenseable license during the term of this Agreement to: (a) use the Software; (b) prepare modifications and derivative works of the Software; (c) distribute the Software (including without limitation in source code or object code form); and (d) reproduce copies of the Software (the “License”).

On the one hand looks like the opening of open source licenses like (say) the Apache Public License (albeit missing important words like “perpetual” and “irrevocable”), but the next two sentences are the difference that are the focus of the license:

Licensee is not granted the right to, and Licensee shall not, exercise the License for an Excluded Purpose. For purposes of this Agreement, “Excluded Purpose” means making available any software-as-a-service, platform-as-a-service, infrastructure-as-a-service or other similar online service that competes with Confluent products or services that provide the Software.

But how can you later tell me that I can’t use my copy of the software because it competes with a service that Confluent started to offer? Or is that copy not in fact mine? This is answered in section 3:

Confluent will retain all right, title, and interest in the Software, and all intellectual property rights therein.

Okay, so my copy of the software isn’t mine at all. On the one hand, this is (literally) proprietary software boilerplate — but I was given the source code and the right to modify it; how do I not own my copy? Again, proprietary software is built on the notion that — unlike the book you bought at the bookstore — you don’t own anything: rather, you license the copy that is in fact owned by the software company. And again, as it stands, this is dubious, and courts have ruled against “licensed, not sold” software. But how can a license explicitly allow me to modify the software and at the same time tell me that I don’t own the copy that I just modified?! And to be clear: I’m not asking who owns the copyright (that part is clear, as it is for open source) — I’m asking who owns the copy of the work that I have modified? How can one argue that I don’t own the copy of the software that I downloaded, modified and built myself?!

This prompts the following questions, which I also asked Jay via Twitter:

  1. If I git clone software covered under the Confluent Community License, who owns that copy of the software?

  2. Do you consider the Confluent Community License to be a contract?
  3. Do you consider the Confluent Community License to be a EULA?

To Confluent: please answer these questions, and put the answers in your FAQ. Again, I think it’s fine for you to be an open core company; just make this software proprietary and be done with it. (And don’t let yourself be troubled about the fact that it was once open source; there is ample precedent for reproprietarizing software.) What I object to (and what I think others object to) is trying to be at once open and proprietary; you must pick one.

To GitHub: Assuming that this is in fact a EULA, I think it is perilous to allow EULAs to sit in public repositories. It’s one thing to have one click through to accept a license (though again, that itself is dubious), but to say that a git clone is an implicit acceptance of a contract that happens to be sitting somewhere in the repository beggars belief. With efforts like, GitHub has been a model in guiding projects with respect to licensing; it would be helpful for GitHub’s counsel to weigh in on their view of this new strain of source-available proprietary software and the degree to which it comes into conflict with GitHub’s own terms of service.

To foundations concerned with software liberties, including the Apache Foundation, the Linux Foundation, the Free Software Foundation, the Electronic Frontier Foundation, the Open Source Initiative, and the Software Freedom Conservancy: the open source community needs your legal review on this! I don’t think I’m being too alarmist when I say that this is potentially a dangerous new precedent being set; it would be very helpful to have your lawyers offer their perspectives on this, even if they disagree with one another. We seem to be in some terrible new era of frankenlicenses, where the worst of proprietary licenses are bolted on to the goodwill created by open source licenses; we need your legal voices before these creatures destroy the village!

Posted on December 16, 2018 at 6:01 pm by bmc · Permalink · 7 Comments
In: Uncategorized

Open source confronts its midlife crisis

Midlife is tough: the idealism of youth has faded, as has inevitably some of its fitness and vigor. At the same time, the responsibilities of adulthood have grown: the kids that were such a fresh adventure when they were infants and toddlers are now grappling with their own transition into adulthood — and you try to remind yourself that the kids that you have sacrificed so much for probably don’t actually hate your guts, regardless of that post they just liked on the ‘gram. Making things more challenging, while you are navigating the turbulence of teenagers, your own parents are likely entering life’s twilight, needing help in new ways from their adult children. By midlife, in addition to the singular joys of life, you have also likely experienced its terrible sorrows: death, heartbreak, betrayal. Taken together, the fading of youth, the growth in responsibility and the endurance of misfortune can lead to cynicism or (worse) drastic and poorly thought-out choices. Add in a little fear of mortality and some existential dread, and you have the stuff of which midlife crises are made…

I raise this not because of my own adventures at midlife, but because it is clear to me that open source — now several decades old and fully adult — is going through its own midlife crisis. This has long been in the making: for years, I (and others) have been critical of service providers’ parastic relationship with open source, as cloud service providers turn open source software into a service offering without giving back to the communities upon which they implicitly depend. At the same time, open source has been (rightfully) entirely unsympathetic to the proprietary software models that have been burned to the ground — but also seemingly oblivious as to the larger economic waves that have buoyed them.

So it seemed like only a matter of time before the companies built around open source software would have to confront their own crisis of confidence: open source business models are really tough, selling software-as-a-service is one of the most natural of them, the cloud service providers are really good at it — and their commercial appetites seem boundless. And, like a new cherry red two-seater sports car next to a minivan in a suburban driveway, some open source companies are dealing with this crisis exceptionally poorly: they are trying to restrict the way that their open source software can be used. These companies want it both ways: they want the advantages of open source — the community, the positivity, the energy, the adoption, the downloads — but they also want to enjoy the fruits of proprietary software companies in software lock-in and its concomitant monopolistic rents. If this were entirely transparent (that is, if some bits were merely being made explicitly proprietary), it would be fine: we could accept these companies as essentially proprietary software companies, albeit with an open source loss-leader. But instead, these companies are trying to license their way into this self-contradictory world: continuing to claim to be entirely open source, but perverting the license under which portions of that source are available. Most gallingly, they are doing this by hijacking open source nomenclature. Of these, the laughably named commons clause is the worst offender (it is plainly designed to be confused with the purely virtuous creative commons), but others (including CockroachDB’s Community License, MongoDB’s Server Side Public License, and Confluent’s Community License) are little better. And in particular, as it apparently needs to be said: no, “community” is not the opposite of “open source” — please stop sullying its good name by attaching it to licenses that are deliberately not open source! But even if they were more aptly named (e.g. “the restricted clause” or “the controlled use license” or — perhaps most honest of all — “the please-don’t-put-me-out-of-business-during-the-next-reInvent-keynote clause”), these licenses suffer from a serious problem: they are almost certainly asserting rights that the copyright holder doesn’t in fact have.

If I sell you a book that I wrote, I can restrict your right to read it aloud for an audience, or sell a translation, or write a sequel; these restrictions are rights afforded the copyright holder. I cannot, however, tell you that you can’t put the book on the same bookshelf as that of my rival, or that you can’t read the book while flying a particular airline I dislike, or that you aren’t allowed to read the book and also work for a company that competes with mine. (Lest you think that last example absurd, that’s almost verbatim the language in the new Confluent Community (sic) License.) I personally think that none of these licenses would withstand a court challenge, but I also don’t think it will come to that: because the vendors behind these licenses will surely fear that they wouldn’t survive litigation, they will deliberately avoid inviting such challenges. In some ways, this netherworld is even worse, as the license becomes a vessel for unverifiable fear of arbitrary liability.

Legal dubiousness aside, as with that midlife hot rod, the licenses aren’t going to address the underlying problem. To be clear, the underlying problem is not the licensing, it’s that these companies don’t know how to make money — they want open source to be its own business model, and seeing that the cloud service providers have an entirely viable business model, they want a piece of the action. But as a result of these restrictive riders, one of two things will happen with respect to a cloud services provider that wants to build a service offering around the software:

  1. The cloud services provider will build their service not based on the software, but rather on another open source implementation that doesn’t suffer from the complication of a lurking company with brazenly proprietary ambitions.

  2. The cloud services provider will build their service on the software, but will use only the truly open source bits, reimplementing (and keeping proprietary) any of the surrounding software that they need.

In the first case, the victory is strictly pyrrhic: yes, the cloud services provider has been prevented from monetizing the software — but the software will now have less of the adoption that is the lifeblood of a thriving community. In the second case, there is no real advantage over the current state of affairs: the core software is still being used without the open source company being explicitly paid for it. Worse, the software and its community have been harmed: where one could previously appeal to the social contract of open source (namely, that cloud service providers have a social responsibility to contribute back to the projects upon which they depend), now there is little to motivate such reciprocity. Why should the cloud services provider contribute anything back to a company that has declared war on it? (Or, worse, implicitly accused it of malfeasance.) Indeed, as long as fights are being picked with them, cloud service providers will likely clutch their bug fixes in the open core as a differentiator, cackling to themselves over the gnarly race conditions that they have fixed of which the community is blissfully unaware. Is this in any way a desired end state?

So those are the two cases, and they are both essentially bad for the open source project. Now, one may notice that there is a choice missing, and for those open source companies that still harbor magical beliefs, let me put this to you as directly as possible: cloud services providers are emphatically not going to license your proprietary software. I mean, you knew that, right? The whole premise with your proprietary license is that you are finding that there is no way to compete with the operational dominance of the cloud services providers; did you really believe that those same dominant cloud services providers can’t simply reimplement your LDAP integration or whatever? The cloud services providers are currently reproprietarizing all of computing — they are making their own CPUs for crying out loud! — reimplementing the bits of your software that they need in the name of the service that their customers want (and will pay for!) won’t even move the needle in terms of their effort.

Worse than all of this (and the reason why this madness needs to stop): licenses that are vague with respect to permitted use are corporate toxin. Any company that has been through an acquisition can speak of the peril of the due diligence license audit: the acquiring entity is almost always deep pocketed and (not unrelatedly) risk averse; the last thing that any company wants is for a deal to go sideways because of concern over unbounded liability to some third-party knuckle-head. So companies that engage in license tomfoolery are doing worse than merely not solving their own problem: they are potentially poisoning the wellspring of their own community.

So what to do? Those of us who have been around for a while — who came up in the era of proprietary software and saw the merciless transition to open source software — know that there’s no way to cross back over the Rubicon. Open source software companies need to come to grips with that uncomfortable truth: their business model isn’t their community’s problem, and they should please stop trying to make it one. And while they’re at it, it would be great if they could please stop making outlandish threats about the demise of open source; they sound like shrieking proprietary software companies from the 1990s, warning that open source will be ridden with nefarious backdoors and unspecified legal liabilities. (Okay, yes, a confession: just as one’s first argument with their teenager is likely to give their own parents uncontrollable fits of smug snickering, those of us who came up in proprietary software may find companies decrying the economically devastating use of their open source software to be amusingly ironic — but our schadenfreude cups runneth over, so they can definitely stop now.)

So yes, these companies have a clear business problem: they need to find goods and services that people will exchange money for. There are many business models that are complementary with respect to open source, and some of the best open source software (and certainly the least complicated from a licensing drama perspective!) comes from companies that simply needed the software and open sourced it because they wanted to build a community around it. (There are many examples of this, but the outstanding Envoy and Jaeger both come to mind — the former from Lyft, the latter from Uber.) In this regard, open source is like a remote-friendly working policy: it’s something that you do because it makes economic and social sense; even as it’s very core to your business, its not a business model in and of itself.

That said, it is possible to build business models around the open source software that is a company’s expertise and passion! Even though the VC that led the last round wants to puke into a trashcan whenever they hear it, business models like “support”, “services” and “training” are entirely viable! (That’s the good news; the bad news is that they may not deliver the up-and-to-the-right growth that these companies may have promised in their pitch deck — and they may come at too low a margin to pay for large teams, lavish perks, or outsized exits.) And of course, making software available as a service is also an entirely viable business model — but I’m pretty sure they’ve heard about that one in the keynote.

As part of their quest for a business model, these companies should read Adam Jacob’s excellent blog entry on sustainable free and open source communities. Adam sees what I see (and Stephen O’Grady sees and Roman Shaposhnik sees), and he has taken a really positive action by starting the Sustainable Free and Open Source Communities project. This project has a lot to be said for it: it explicitly focuses on building community; it emphasizes social contracts; it seeks longevity for the open source artifacts; it shows the way to viable business models; it rejects copyright assignment to a corporate entity. Adam’s efforts can serve to clear our collective head, and to focus on what’s really important: the health of the communities around open source. By focusing on longevity, we can plainly see restrictive licensing as the death warrant that it is, shackling the fate of a community to that of a company. (Viz. after the company behind AGPL-licensed RethinkDB capsized, it took the Linux Foundation buying the assets and relicensing them to rescue the community.) Best of all, it’s written by someone who has built a business that has open source software at its heart. Adam has endured the challenges of the open core model, and is refreshingly frank about its economic and psychic tradeoffs. And if he doesn’t make it explicit, Adam’s fundamental optimism serves to remind us, too, that any perceived “danger” to open source is overblown: open source is going to endure, as no company is going to be able to repeal the economics of software. That said, as we collectively internalize that open source is not a business model on its own, we will likely see fewer VC-funded open source companies (though I’m honestly not sure that that’s a bad thing).

I don’t think that it’s an accident that Adam, Stephen, Roman and I see more or less the same thing and are more or less the same age: not only have we collectively experienced many sides of this, but we are at once young enough to still recall our own idealism, yet old enough to know that coercion never endures in the limit. In short, this too shall pass — and in the end, open source will survive its midlife questioning just as people in midlife get through theirs: by returning to its core values and by finding rejuvenation in its communities. Indeed, we can all find solace in the fact that while life is finite, our values and our communities survive us — and that our engagement with them is our most important legacy.

Posted on December 14, 2018 at 10:50 pm by bmc · Permalink · 7 Comments
In: Uncategorized

Assessing software engineering candidates

Note: This blog entry reproduces RFD 151. Comments are encouraged in the discussion for RFD 151.

How does one assess candidates for software engineering positions? This is an age-old question without a formulaic answer: software engineering is itself too varied to admit a single archetype.

Most obviously, software engineering is intellectually challenging; it demands minds that not only enjoy the thrill of solving puzzles, but can also stay afloat in a sea of numbing abstraction. This raw capacity, however, is insufficient; there are many more nuanced skills that successful software engineers must posess. For example, software engineering is an almost paradoxical juxtaposition of collaboration and isolation: successful software engineers are able to work well with (and understand the needs of!) others, but are also able to focus intensely on their own. This contrast extends to the conveyance of ideas, where they must be able to express their own ides well enough to persuade others, but also be able to understand and be persuaded by the ideas of others — and be able to implement all of these on their own. They must be able to build castles of imagination, and yet still understand the constraints of a grimy reality: they must be arrogant enough to see the world as it isn’t, but humble enough to accpet the world as it is. Each of these is a balance, and for each, long-practicing software engineers will cite colleagues who have been ineffective because they have erred too greatly on one side or another.

The challenge is therefore to assess prospective software engineers, without the luxury of firm criteria. This document is an attempt to pull together accumulated best practices; while it shouldn’t be inferred to be overly prescriptive, where it is rigid, there is often a painful lesson behind it.

In terms of evaluation mechanism: using in-person interviewing alone can be highly unreliable and can select predominantly for surface aspects of a candidate’s personality. While we advocate (and indeed, insist upon) interviews, they should come relatively late in the process; as much assessment as possible should be done by allowing the candidate to show themselves as software engineers truly work: on their own, in writing.

Traits to evaluate

How does one select for something so nuanced as balance, especially when the road ahead is unknown? We must look at a wide-variety of traits, presented here in the order in which they are traditionally assessed:


As the ordering implies, there is a temptation in traditional software engineering hiring to focus on aptitude exclusively: to use an interview exclusively to assess a candidate’s pure technical pulling power. While this might seem to be a reasonable course, it in fact leads down the primrose path to pop quizzes about algorithms seen primarily in interview questions. (Red-black trees and circular linked list detection: looking at you.) These assessments of aptitude are misplaced: software engineering is not, in fact, a spelling bee, and one’s ability to perform during an arbitrary oral exam may or may not correlate to one’s ability to actually develop production software. We believe that aptitude is better assessed where software engineers are forced to exercise it: based on the work that they do on their own. As such, candidates should be asked to provide three samples of their works: a code sample, a writing sample, and an analysis sample.

Code sample

Software engineers are ultimately responsible for the artifacts that they create, and as such, a code sample can be the truest way to assess a candidate’s ability.

Candidates should be guided to present code that they believe best reflects them as a software engineer. If this seems too broad, it can be easily focused: what is some code that you’re proud of and/or code that took you a while to get working?

If candidates do not have any code samples because all of their code is proprietary, they should write some: they should pick something that they have always wanted to write but have been needing an excuse — and they should go write it! On such a project, the guideline to the candidate should be to spend at least (say) eight hours on it, but no more than twenty-four — and over no longer than a two week period.

If the candidate is to write something de novo and/or there is a new or interesting technology that the organization is using, it may be worth guiding the candidate to use it (e.g., to write it in a language that the team has started to use, or using a component that the team is broadly using). This constraint should be uplifting to the candidate (e.g., “You may have wanted to explore this technology; here’s your chance!”). At Joyent in the early days of node.js, this was what we called “the node test”, and it yielded many fun little projects — and many great engineers.

Writing sample

Writing good code and writing good prose seem to be found together in the most capable software engineers. That these skills are related is perhaps unsurprising: both types of writing are difficult; both require one to create wholly new material from a blank page; both demand the ability to revise and polish.

To assess a candidate’s writing ability, they should be asked to provide a writing sample. Ideally, this will be technical writing, e.g.:

If a candidate has all of these, they should be asked to provide one of each; if a candidate has none of them, they should be asked to provide a writing sample on something else entirely, e.g. a thesis, dissertation or other academic paper.

Analysis sample

Part of the challenge of software engineering is dealing with software when it doesn’t, in fact, work correctly. At this moment, a software engineer must flip their disposition: instead of an artist creating something new, they must become a scientist, attempting to reason about a foreign world. In having candidates only write code, analytical skills are often left unexplored. And while this can be explored conversationally (e.g., asking for “debugging war stories” is a classic — and often effective — interview question), an oral description of recalled analysis doesn’t necessarily allow the true depths of a candidate’s analytical ability to be plumbed. For this, candidates should be asked to provide an analysis sample: a written analysis of software behavior from the candidate. This may be difficult for many candidates: for many engineers, these analyses may be most often found in defect reports, which may not be public. If the candidate doesn’t have such an analysis sample, the scope should be deliberately broadened to any analytical work they have done on any system (academic or otherwise). If this broader scope still doesn’t yield an analysis sample, the candidate should be asked to generate one to the best of their ability by writing down their analysis of some aspect of system behavior. (This can be as simple as asking them to write down the debugging story that would be their answer to the interview question — giving the candidate the time and space to answer the question once, and completely.)


We are all born uneducated — and our own development is a result of the informal education of experience and curiosity, as well as a better structured and more formal education. To assess a candidate’s education, both the formal and informal aspects of education should be considered.

Formal education

Formal education is easier to assess by its very formality: a candidate’s education is relatively easily evaluated if they had the good fortune of discovering their interest and aptitude at a young age, had the opportunity to pursue and complete their formal education in computer science, and had the further good luck of attending an institution that one knows and has confidence in.

But one should not be bigoted by familiarity: there are many terrific software engineers who attended little-known schools or who took otherwise unconventional paths. The completion of a formal education in computer science is much more important than the institution: the strongest candidate from a little-known school is almost assuredly stronger than the weakest candidate from a well-known school.

In other cases, it’s even more nuanced: there have been many later-in-life converts to the beauty and joy of software engineering, and such candidates should emphatically not be excluded merely because they discovered software later than others. For those that concentrated in entirely non-technical disciplines, further probing will likely be required, with greater emphasis on their technical artifacts.

The most important aspect of one’s formal education may not be its substance so much as its completion. Like software engineering, there are many aspects of completing a formal education that aren’t necessarily fun: classes that must be taken to meet requirements; professors that must be endured rather than enjoyed; subject matter that resists quick understanding or appeal. In this regard, completion of a formal education represents the completion of a significant task. Inversely, the failure to complete one’s formal education may constitute an area of concern. There are, of course, plausible life reasons to abandon one’s education prematurely (especially in an era when higher education is so expensive), but there are also many paths and opportunities to resume and complete it. The failure to complete formal education may indicate deeper problems, and should be understood.

Informal education

Learning is a life-long endeavor, and much of one’s education will be informal in nature. Assessing this informal education is less clear, especially because (by its nature) there is little formally to show for it — but candidates should have a track record of being able to learn on their own, even when this self-education is arduous. One way to probe this may be with a simple question: what is an example of something that you learned that was a struggle for you? As with other questions posed here, the question should have a written answer.


Motivation is often not assessed in the interview process, which is unfortunate because it dictates so much of what we do and why. For many companies, it will be important to find those that are intrinsically motivated — those who do what they do primarily for the value of doing it.

Selecting for motivation can be a challenge, and defies formula. Here, open source and open development can be a tremendous asset: it allows others to see what is being done, and, if they are excited by the work, to join the effort and to make their motivation clear.


Values are often not evaluated formally at all in the software engineering process, but they can be critical to determine the “fit” of a candidate. To differentiate values from principles: values represent relative importance versus the absolute importance of principles. Values are important in a software engineering context because we so frequently make tradeoffs in which our values dictate our disposition. (For example, the relative importance of speed of development versus rigor; both are clearly important and positive attributes, but there is often a tradeoff to be had between them). Different engineering organizations may have different values over different times or for different projects, but it’s also true that individuals tend to develop their own values over their career — and it’s essential that the values of a candidate do not clash with the values of the team that they are to join.

But how to assess one’s values? Many will speak to values that they don’t necessarily hold (e.g., rigor), so simply asking someone what’s important to them may or may not yield their true values. One observation is that one’s values — and the adherence or divergence from those values — will often be reflected in happiness and satisfaction with work. When work strongly reflects one’s values, one is much more likely to find it satisfying; when values are compromised (even if for a good reason), work is likely be unsatisfying. As such, the specifics of one’s values may be ascertained by asking candidates some probing questions, e.g.:

Our values can also be seen in the way we interact with others. As such, here are some questions that may have revealing answers:

The answers to these questions should be written down to allow them to be answered thoughtfully and in advance — and then to serve as a starting point for conversation in an interview.

Some questions, however, are more amenable to a live interview. For example, it may be worth asking some situational questions like:


In an ideal world, integrity would not be something we would need to assess in a candidate: we could trust that everyone is honest and trustworthy. This view, unfortunately, is naïve with respect to how malicious bad actors can be; for any organization — but especially for one that is biased towards trust and transparency — it is essential that candidates be of high integrity: an employee who operates outside of the bounds of integrity can do nearly unbounded damage to an organization that assumes positive intent.

There is no easy or single way to assess integrity for people with whom one hasn’t endured difficult times. By far the most accurate way of assessing integrity in a candidate is for them to already be in the circle of one’s trust: for them to have worked deeply with (and be trusted by) someone that is themselves deeply trusted. But even in these cases where the candidate is trusted, some basic verification is prudent.

Criminal background check

The most basic integrity check involves a criminal background check. While local law dictates how these checks are used, the check should be performed for a simple reason: it verifies that the candidate is who they say they are. If someone has made criminal mistakes, these mistakes may or may not disqualify them (much will depend on the details of the mistakes, and on local law on how background checks can be used), but if a candidate fails to be honest or remorseful about those mistakes, it is a clear indicator of untrustworthiness.

Credential check

A hidden criminal background in software engineering candidates is unusual; much more common is a slight “fudging” of credentials or other elements of one’s past: degrees that were not in fact earned; grades or scores that have been exaggerated; awards that were not in fact bestowed; gaps in employment history that are quietly covering up by changing the time that one was at a previous employer. These transgressions may seem slight, but they can point to something quite serious: a candidate’s willingness or desire to mislead others to advance themselves. To protect against this, a basic credential check should be performed. This can be confined to degrees, honors, and employment.


References can be very tricky, especially for someone coming from a difficult situation (e.g., fleeing poor management). Ideally, a candidate is well known by someone inside the company who is trusted — but even this poses challenges: sometimes we don’t truly know people until they are in difficult situations, and someone “known” may not, in fact, be known at all. Worse, references are most likely to break down when they are most needed: dishonest, manipulative people are, after all, dishonest and manipulative; they can easily fool people — and even references — into thinking that they are something that they are not. So while references can provide value (and shouldn’t be eliminated as a tool), they should also be used carefully and kept in perspective.


For individuals outside of that circle of trust, checking integrity is probably still best done in person. There are several potential mechanisms here:

Mechanics of evaluation

Interviews should begin with phone screens to assess the most basic viability, especially with respect to motivation. This initial conversation might include some basic but elementary (and unstructured) homework to gauge that motivation. The candidate should be pointed to material about the company and sources that describe methods of work and specifics about what that work entails. The candidate should be encouraged to review some of this material and send formal written thoughts as a quick test of motivation. If one is not motivated enough to learn about a potential employer, it’s hard to see how they will suddenly gain the motivation to see them through difficult problems.

If and when a candidate is interested in deeper interviews, everyone should be expected to provide the same written material.

Candidate-submitted material

The candidate should submit the following:

Candidate-submitted material should be collected and distributed to everyone on the interview list.

Before the interview

Everyone on the interview schedule should read the candidate-submitted material, and a pre-meeting should then be held to discuss approach: based on the written material, what are the things that the team wishes to better understand? And who will do what?

Pre-interview job talk

For senior candidates, it can be effective to ask them to start the day by giving a technical presentation to those who will interview them. On the one hand, it may seem cruel to ask a candidate to present to a roomful of people who will be later interviewing them, but to the candidate this should be a relief: this allows them to start the day with a home game, where they are talking about something that they know well and can prepare for arbitrarily. The candidate should be allowed to present on anything technical that they’ve worked on, and it should be made clear that:

  1. Confidentiality will be respected (that is, they can present on proprietary work)

  2. The presentation needn’t be novel — it is fine for the candidate to give a talk that they have given before

  3. Slides are fine but not required

  4. The candidate should assume that the audience is technical, but not necessarily familiar with the domain that they are presenting

  5. The candidate should assume about 30 minutes for presentation and 15 minutes for questions.

The aim here is severalfold.

First, this lets everyone get the same information at once: it is not unreasonable that the talk that a candidate would give would be similar to a conversation that they would have otherwise had several times over the day as they are asked about their experience; this minimizes that repetition.

Second, it shows how well the candidate teaches. Assuming that the candidate is presenting on a domain that isn’t intimately known by every member of the audience, the candidate will be required to instruct. Teaching requires both technical mastery and empathy — and a pathological inability to teach may point to deeper problems in a candidate.

Third, it shows how well the candidate fields questions about their work. It should go without saying that the questions themselves shouldn’t be trying to find flaws with the work, but should be entirely in earnest; seeing how a candidate answers such questions can be very revealing about character.

All of that said: a job talk likely isn’t appropriate for every candidate — and shouldn’t be imposed on (for example) those still in school. One guideline may be: those with more than seven years of experience are expected to give a talk; those with fewer than three are not expected to give a talk (but may do so); those in between can use their own judgement.


Interviews shouldn’t necessarily take one form; interviewers should feel free to take a variety of styles and approaches — but should generally refrain from “gotcha” questions and/or questions that may conflate surface aspects of intellect with deeper qualities (e.g., Microsoft’s infamous “why are manhole covers round?”). Mixing interview styles over the course of the day can also be helpful for the candidate.

After the interview

After the interview (usually the next day), the candidate should be discussed by those who interviewed them. The objective isn’t necessarily to get to consensus first (though that too, ultimately), but rather to areas of concern. In this regard, the post-interview conversation must be handled carefully: the interview is deliberately constructed to allow broad contact with the candidate, and it is possible than someone relatively junior or otherwise inexperienced will see something that others will miss. The meeting should be constructed to assure that this important data isn’t supressed; bad hires can happen when reservations aren’t shared out of fear of disappointing a larger group!

One way to do this is to structure the meeting this way:

  1. All participants are told to come in with one of three decisions: Hire, Do not hire, Insufficient information. All participants should have one of these positions and they should not change their initial position. (That is, one’s position on a candidate may change over the course of the meeting, but the initial position shouldn’t be retroactively changed.) If it helps, this position can be privately recorded before the meeting starts.

  2. The meeting starts with everyone who believes Do not hire explaining their position. While starting with the Do not hire positions may seem to give the meeting a negative disposition, it is extremely important that the meeting start with the reservations lest they be silenced — especially when and where they are so great that someone believes a candidate should not be hired.

  3. Next, those who believe Insufficient information should explain their position. These positions may be relatively common, and it means that the interview left the interviewer with unanswered questions. By presenting these unanswered questions, there is a possibility that others can provide answers that they may have learned in their interactions with the candidate.

  4. Finally, those who believe Hire should explain their position, perhaps filling in missing information for others who are less certain.

If there are any Do not hire positions, these should be treated very seriously, for it is saying that the aptitude, education, motivation, values and/or integrity of the candidate are in serious doubt or are otherwise unacceptable. Those who believe Do not hire should be asked for the dimensions that most substantiate their position. Especially where these reservations are around values or integrity, a single Do not hire should raise serious doubts about a candidate: the risks of bad hires around values or integrity are far too great to ignore someone’s judgement in this regard!

Ideally, however, no one has the position of Do not hire, and through a combination of screening and candidate self-selection, everyone believes Hire and the discussion can be brief, positive and forward-looking!

If, as is perhaps most likely, there is some mix of Hire and Insufficient information, the discussion should focus on the information that is missing about the candidate. If other interviewers cannot fill in the information about the candidate (and if it can’t be answered by the corpus of material provided by the candidate), the group should together brainstorm about how to ascertain it. Should a follow-up conversation be scheduled? Should the candidate be asked to provide some missing information? Should some aspect of the candidate’s background be explored? The collective decision should not move to Hire as long as there remain unanswered questions preventing everyone from reaching the same decision.

Assessing the assessment process

It is tautologically challenging to evaluate one’s process for assessing software engineers: one lacks data on the candidates that one doesn’t hire, and therefore can’t know which candidates should have been extended offers of employment but weren’t. As such, hiring processes can induce a kind of ultimate survivorship bias in that it is only those who have survived (or instituted) the process who are present to assess it — which can lead to deafening echo chambers of smug certitude. One potential way to assess the assessment process: ask candidates for their perspective on it. Candidates are in a position to be evaluating many different hiring processes concurrently, and likely have the best perspective on the relative merits of different ways of assessing software engineers.

Of course, there is peril here too: while many organizations would likely be very interested in a candidate who is bold enough to offer constructive criticism on the process being used to assess them while it is being used to assess them, the candidates themselves might not realize that — and may instead offer bland bromides for fear of offending a potential employer. Still, it has been our experience that a thoughtful process will encourage a candidate’s candor — and we have found that the processes described here have been strengthened by listening carefully to the feedback of candidates.

Posted on October 5, 2018 at 11:02 am by bmc · Permalink · Comments Closed
In: Uncategorized

Should KubeCon be double-blind?

With a paltry 13% acceptance rate, KubeCon is naturally going to generate a lot of disappointment — the vast, vast majority of proposals aren’t being accepted. But as several have noted, a small number of vendors account for a significant number of accepted talks. Is this an issue? In particular, review for KubeCon isn’t double-blind; should it be?

In terms of my own perspective here, I view conferences for practitioners (and especially their concomitant hallway tracks) as essential for the community of our craft. Historically, I have been troubled by the strangulation of practioner conferences by academic computer science: after we presented DTrace at USENIX 2004, I publicly wondered about the fate of USENIX — which engendered some thoughtful discussion. When USENIX had me keynote their annual technical conference twelve years later, I used the opportunity to express my concerns with the conference model, and wondered about finding the right solution both for practitioners and for academic computer science. That evening, we had a birds-of-a-feather session, which (encouragingly) was very well attended. There were many interesting perspectives, but the one that stood out to me was from Kathryn McKinley, who makes a compelling case that reviews should be double-blind. In the BOF, McKinley was emphatic and persuasive that conferences absolutely must be double-blind in their review — and that anything less is a disservice to the community and the discipline.

Wanting to take that advice, when we organized Systems We Love later that year, we ran it double-blind with a very large (and, if I may say, absolutely awesome!) program committee. We had many, many submissions — well over ten times the number of slots! We were double-blind for the first few stages of review, until the number of submissions had been reduced by a factor of five. Once we had reduced the number of talks submissions to “merely” double the number of slots, we de-blinded to get the rest of the way to a program. (Which was agonizing — too many great submissions!) By de-blinding, we were essentially using factors about the submitter as a tie-breaker to differentiate submissions that were both high quality — and as a way to get voices we might not otherwise hear from.

Personally, I feel that we were able to hit a sweet spot by doing in this way — and there were quite a few surprises when we de-blinded. Of note, at least a quarter of the speakers (and perhaps more, as I didn’t ask everyone) were presenting for the first time. Equally as surprising: several “big names” had submissions that we rejected while blinded — but looking at their submissions, the submissions themselves just weren’t that great! (Which isn’t to say that they don’t have a ton of terrific work to their name — just that every swing of the bat is not going to be a home run.)

So: should KubeCon be double-blind? I consider myself firmly in McKinley’s camp in that I believe that any oversubscribed conference needs to be double-blind to a very significant degree. That said, I also think our challenges as practitioners don’t exactly map to the challenges in academic computer science. (For example, because we aren’t using conferences as a publishing vector, I don’t think we need to be double-blind-until-accept — I think we can de-blind ourself to our rejections.) I also don’t even think we need to be double-blind all the way through the process: we should be double-blind until the program committee has reduced the number of submissions to the point that every remaining submission is deemed one that the program committee wants to accept. (That is, to the point that were it not for the physical limits of the conference, the program committee would want to accept the remaining submissions.) De-blinding at this point assures that the quality of the content is primarily due to the merit of the submission — not due to the particulars of the submitter. (That is, not based on what they’ve done in the past — or who their employer happens to be.) That said, de-blinding at the point of quality does allow these other factors to be used to mold the final program.

For KubeCon — and for other practitioner conferences — I think a hybrid model is the best approach: double-blind for a significant fraction of review, de-blinded for a final program formulation, and then perhaps “invited talks” for talks that were rejected when blind, but that the program committee wishes to accept based on the presenter. This won’t lead to less disappointment at KubeCon (13% is too low an acceptance rate to not be rejecting high-quality submissions), but I believe that a significantly double-blind process will give the community the assurance of a program that best represents it!

Posted on October 3, 2018 at 11:41 am by bmc · Permalink · Comments Closed
In: Uncategorized

The relative performance of C and Rust

My blog post on falling in love with Rust got quite a bit of attention — with many being surprised by what had surprised me as well: the high performance of my naive Rust versus my (putatively less naive?) C. However, others viewed it as irresponsible to report these performance differences, believing that these results would be blown out of proportion or worse. The concern is not entirely misplaced: system benchmarking is one of those areas where — in Jonathan Swift’s words from three centuries ago — “falsehood flies, and the truth comes limping after it.”

There are myriad reasons why benchmarking is so vulnerable to leaving truth behind. First, it’s deceptively hard to quantify the performance of a system simply because the results are so difficult to verify: the numbers we get must be validated (or rejected) according to the degree that they comport with our expectations. As a result, if our expectations are incorrect, the results can be wildly wrong. To see this vividly, please watch (or rewatch!) Brendan Gregg’s excellent (and hilarious) lightning talk on benchmarking gone wrong. Brendan recounts his experience dealing with a particularly flawed approach, and it’s a talk that I always show anyone who is endeavoring to benchmark the system: it shows how easy it is to get it totally wrong — and how important it is to rigorously validate results.

Second, even if one gets an entirely correct result, it’s really only correct within the context of the system. As we succumb to the temptation of applying a result more universally than this context merits — as the asterisks and the qualifiers on a performance number are quietly amputated — a staid truth is transmogrified into a flying falsehood. Worse, some of that context may have been implicit in that the wrong thing may have been benchmarked: in trying to benchmark one aspect of the system, one may inadvertently merely quantify an otherwise hidden bottleneck.

So take all of this as disclaimer: I am not trying to draw large conclusions about “C vs. Rust” here. To the contrary, I think that it is a reasonable assumption that, for any task, a lower-level language can always be made to outperform a higher-level one. But with that said, a pesky fact remains: I reimplemented a body of C software in Rust, and it performed better for the same task; what’s going on? And is there anything broader we can say about these results?

To explore this, I ran some statemap rendering tests on SmartOS on a single-socket Haswell server (Xeon E3-1270 v3) running at 3.50GHz. The C version was compiled with GCC 7.3.0 with -O2 level optimizations; the Rust version was compiled with 1.29.0 with --release. All of the tests were run bound to a processor set containing a single core; all were bound to one logical CPU within that core, with the other logical CPU forced to be idle. cpustat was used to gather CPU performance counter data, with one number denoting one run with pic0 programmed to that CPU performance counter. The input file (~30MB compressed) contains 3.5M state changes, and in the default config will generate a ~6MB SVG.

Here are the results for a subset of the counters relating to the cache performance:

Counter statemap-gcc statemap-rust
cpu_clk_unhalted.thread_p 32,166,437,125 23,127,271,226 -28.1%
inst_retired.any_p 49,110,875,829 48,752,136,699 -0.7%
cpu_clk_unhalted.ref_p 918,870,673 660,493,684 -28.1%
mem_uops_retired.stlb_miss_loads 8,651,386 2,353,178 -72.8%
mem_uops_retired.stlb_miss_stores 268,802 1,000,684 272.3%
mem_uops_retired.lock_loads 7,791,528 51,737 -99.3%
mem_uops_retired.split_loads 107,969 52,745,125 48752.1%
mem_uops_retired.split_stores 196,934 41,814,301 21132.6%
mem_uops_retired.all_loads 11,977,544,999 9,035,048,050 -24.6%
mem_uops_retired.all_stores 3,911,589,945 6,627,038,769 69.4%
mem_load_uops_retired.l1_hit 9,337,365,435 8,756,546,174 -6.2%
mem_load_uops_retired.l2_hit 1,205,703,362 70,967,580 -94.1%
mem_load_uops_retired.l3_hit 66,771,301 33,323,740 -50.1%
mem_load_uops_retired.l1_miss 1,276,311,911 105,524,579 -91.7%
mem_load_uops_retired.l2_miss 69,671,774 34,616,966 -50.3%
mem_load_uops_retired.l3_miss 2,544,750 1,364,435 -46.4%
mem_load_uops_retired.hit_lfb 1,393,631,815 157,897,686 -88.7%
mem_load_uops_l3_hit_retired.xsnp_miss 435 526 20.9%
mem_load_uops_l3_hit_retired.xsnp_hit 1,269 740 -41.7%
mem_load_uops_l3_hit_retired.xsnp_hitm 820 517 -37.0%
mem_load_uops_l3_hit_retired.xsnp_none 67,846,758 33,376,449 -50.8%
mem_load_uops_l3_miss_retired.local_dram 2,543,699 1,301,381 -48.8%


So the Rust version is issuing a remarkably similar number of instructions (within less than one percent!), but with a decidedly different mix: just three quarters of the loads of the C version and (interestingly) many more stores. The cycles per instruction (CPI) drops from 0.65 to 0.47, indicating much better memory behavior — and indeed the L1 misses, L2 misses and L3 misses are all way down. The L1 hits as an absolute number are actually quite high relative to the loads, giving Rust a 96.9% L1 hit rate versus the C version’s 77.9% hit rate. Rust also lives much better in the L2, where it has half the L2 misses of the C version.

Okay, so Rust has better memory behavior than C? Well, not so fast. In terms of what this thing is actually doing: the core of statemap processing is coalescing a large number of state transitions in the raw data into a smaller number of rectangles for the resulting SVG. When presented with a new state transition, it picks the “best” two adjacent rectangles to coalesce based on a variety of properties. As a result, this code spends all of its time constantly updating an efficient data structure to be able to make this decision. For the C version, this is a binary search tree (an AVL tree), but Rust (interestingly) doesn’t offer a binary search tree — and it is instead implemented with a BTreeSet, which implements a B-tree. B-trees are common when dealing with on-disk state, where the cost of loading a node contained in a disk block is much, much less than the cost of searching that node for a desired datum, but they are less common as a replacement for an in-memory BST. Rust makes the (compelling) argument that, given the modern memory hierarchy, the cost of getting a line from memory is far greater than the cost of reading it out of a cache — and B-trees make sense as a replacement for BSTs, albeit with a much smaller value for B. (Cache lines are somewhere between 64 and 512 bytes; disk blocks start at 512 bytes and can be much larger.)

Could the performance difference that we’re seeing simply be Rust’s data structure being — per its design goals — more cache efficient? To explore this a little, I varied the value of the number of rectangles in the statemap, as this will affect both the size of the tree (more rectangles will be a larger tree, leading to a bigger working set) and the number of deletions (more rectangles will result in fewer deletions, leading to less compute time).

The results were pretty interesting:

A couple of things to note here: first, there are 3.5M state transitions in the input data; as soon as the number of rectangles exceeds the number of states, there is no reason for any coalescing, and some operations (namely, deleting from the tree of rectangles) go away entirely. So that explains the flatline at roughly 3.5M rectangles.

Also not surprisingly, the worst performance for both approaches occurs when the number of rectangles is set at more or less half the number of state transitions: the tree is huge (and therefore has relatively poorer cache performance for either approach) and each new state requires a deletion (so the computational cost is also high).

So far, this seems consistent with the BTreeSet simply being a more efficient data structure. But what is up with that lumpy Rust performance?! In particular there are some strange spikes; e.g., zooming in on the rectangle range up to 100,000 rectangles:

Just from eyeballing it, they seem to appear at roughly logarithmic frequency with respect to the number of rectangles. My first thought was perhaps some strange interference relationship with respect to the B-tree and the cache size or stride, but this is definitely a domain where an ounce of data is worth much more than a pound of hypotheses!

Fortunately, because Rust is static (and we have things like, say, symbols and stack traces!), we can actually just use DTrace to explore this. Take this simple D script, rustprof.d:

#pragma D option quiet

/pid == $target && arg1 != 0/
        @[usym(arg1)] = count();

        trunc(@, 10);
        printa("%10@d %A\n", @);

I ran this against two runs: one at a peak (e.g., 770,000 rectangle) and then another at the adjacent trough (e.g., 840,000 rectangles), demangling the resulting names by sending the the output through rustfilt. Results for 770,000 rectangles:

# dtrace -s ./rustprof.d -c "./statemap --dry-run -c 770000 ./pg-zfs.out" | rustfilt
3943472 records processed, 769999 rectangles
      1043 statemap`<alloc::collections::btree::map::BTreeMap<K, V>>::remove
      1180 statemap`<std::collections::hash::map::DefaultHasher as core::hash::Hasher>::finish
      1253 statemap`<serde_json::read::StrRead<'a> as serde_json::read::Read<'a>>::parse_str
      1320 statemap`<std::collections::hash::map::HashMap<K, V, S>>::remove
      2558 statemap`statemap::statemap::Statemap::ingest
      4123 statemap`<std::collections::hash::map::HashMap<K, V, S>>::insert
      4503 statemap`<std::collections::hash::map::HashMap<K, V, S>>::get
     26640 statemap`alloc::collections::btree::search::search_tree

And now the same thing, but against the adjacent valley of better performance at 840,000 rectangles:

# dtrace -s ./rustprof.d -c "./statemap --dry-run -c 840000 ./pg-zfs.out" | rustfilt
3943472 records processed, 839999 rectangles
       971 statemap`<std::collections::hash::map::DefaultHasher as core::hash::Hasher>::write
      1071 statemap`<alloc::collections::btree::map::BTreeMap<K, V>>::remove
      1158 statemap`<std::collections::hash::map::DefaultHasher as core::hash::Hasher>::finish
      1348 statemap`<serde_json::read::StrRead<'a> as serde_json::read::Read<'a>>::parse_str
      2524 statemap`statemap::statemap::Statemap::ingest
      2948 statemap`<std::collections::hash::map::HashMap<K, V, S>>::insert
      4125 statemap`<std::collections::hash::map::HashMap<K, V, S>>::get
     26359 statemap`alloc::collections::btree::search::search_tree

The samples in btree::search::search_tree are roughly the same — but the poorly performing one has many more samples in HashMap<K, V, S>::insert (4123 vs. 2948). What is going on? The HashMap implementation in Rust uses Robin Hood hashing and linear probing — which means that hash maps must be resized when they hit a certain load factor. (By default, the hash map load factor is 90.9%.) And note that I am using hash maps to effectively implement a doubly linked list: I will have a number of hash maps that — between them — will contain the specified number of rectangles. Given that we only see this at particular sizes (and given that the distance between peaks increases exponentially with respect to the number of rectangles), it seems entirely plausible that at some numbers of rectangles, the hash maps will grow large enough to induce quite a bit more probing, but not quite large enough to be resized.

To explore this hypothesis, it would be great to vary the hash map load factor, but unfortunately the load factor isn’t currently dynamic. Even then, we could explore this by using with_capacity to preallocate our hash maps, but the statemap code doesn’t necessarily know how much to preallocate because the rectangles themselves are spread across many hash maps.

Another option is to replace our use of HashMap with a different data structure — and in particular, we can use a BTreeMap in its place. If the load factor isn’t the issue (that is, if there is something else going on for which the additional compute time in HashMap<K, V, S>::insert is merely symptomatic), we would expect a BTreeMap-based implementation to have a similar issue at the same points.

With Rust, conducting this experiment is absurdly easy:

diff --git a/src/ b/src/
index a44dc73..5b7073d 100644
--- a/src/
+++ b/src/
@@ -109,7 +109,7 @@ struct StatemapEntity {
     last: Option,                      // last start time
     start: Option,                     // current start time
     state: Option,                     // current state
-    rects: HashMap<u64, RefCell>, // rectangles for this entity
+    rects: BTreeMap<u64, RefCell>, // rectangles for this entity

@@ -151,6 +151,7 @@ use std::str;
 use std::error::Error;
 use std::fmt;
 use std::collections::HashMap;
+use std::collections::BTreeMap;
 use std::collections::BTreeSet;
 use std::str::FromStr;
 use std::cell::RefCell;
@@ -306,7 +307,7 @@ impl StatemapEntity {
             description: None,
             last: None,
             state: None,
-            rects: HashMap::new(),
+            rects: BTreeMap::new(),
             id: id,

That’s it: because the two (by convention) have the same interface, there is nothing else that needs to be done! And the results, with the new implementation in light blue:

Our lumps are gone! In general, the BTreeMap-based implementation performs a little worse than the HashMap-based implementation, but without as much variance. Which isn’t to say that this is devoid of strange artifacts! It’s especially interesting to look at the variation at lower levels of rectangles, when the two implementations seem to alternate in the pole position:

I don’t know what happens to the BTreeMap-based implementation at about ~2,350 rectangles (where it degrades by nearly 10% but then recovers when the number of rectangles hits ~2,700 or so), but at this point, the effects are only academic for my purposes: for statemaps, the default number of rectangles is 25,000. That said, I’m sure that digging there would yield interesting discoveries!

So, where does all of this leave us? Certainly, Rust’s foundational data structures perform very well. Indeed, it might be tempting to conclude that, because a significant fraction of the delta here is the difference in data structures (i.e., BST vs. B-tree), the difference in language (i.e., C vs. Rust) doesn’t matter at all.

But that would be overlooking something important: part of the reason that using a BST (and in particular, an AVL tree) was easy for me is because we have an AVL tree implementation built as an intrusive data structure. This is a pattern we use a bunch in C: the data structure is embedded in a larger, containing structure — and it is the caller’s responsibility to allocate, free and lock this structure. That is, implementing a library as an intrusive data structure completely sidesteps both allocation and locking. This allows for an entirely robust arbitrarily embeddable library, and it also makes it really easy for a single data structure to be in many different data structures simultaneously. For example, take ZFS’s zio structure, in which a single contiguous chunk of memory is on (at least) two different lists and three different AVL trees! (And if that leaves you wondering how anything could possibly be so complicated, see George Wilson’s recent talk explaining the ZIO pipeline.)

Implementing a B-tree this way, however, would be a mess. The value of a B-tree is in the contiguity of nodes — that is, it is the allocation that is a core part of the win of the data structure. I’m sure it isn’t impossible to implement an intrusive B-tree in C, but it would require so much more caller cooperation (and therefore a more complicated and more error-prone interface) that I do imagine that it would have you questioning life choices quite a bit along the way. (After all, a B-tree is a win — but it’s a constant-time win.)

Contrast this to Rust: intrusive data structures are possible in Rust, but they are essentially an anti-pattern. Rust really, really wants you to have complete orthogonality of purpose in your software. This leads you to having multiple disjoint data structures with very clear trees of ownership — where before you might have had a single more complicated data structure with graphs of multiple ownership. This clear separation of concerns in turn allows for these implementations to be both broadly used and carefully optimized. For an in-depth example of the artful implementation that Rust allows, see Alexis Beingessner’s excellent blog entry on the BTreeMap implementation.

All of this adds up to the existential win of Rust: powerful abstractions without sacrificing performance. Does this mean that Rust will always outperform C? No, of course not. But it does mean that you shouldn’t be surprised when it does — and that if you care about performance and you are implementing new software, it is probably past time to give Rust a very serious look!

Update: Several have asked if Clang would result in materially better performance; my apologies for not having mentioned that when I did my initial analysis, I had included Clang and knew that (at the default rectangles of 25,000), it improved things a little but not enough to approach the performance of the Rust implementation. But for completeness sake, I cranked the Clang-compiled binary at the same rectangle points:

Despite its improvement over GCC, I don’t think that the Clang results invalidate any of my analysis — but apologies again for not including them in the original post!

Posted on September 28, 2018 at 6:28 pm by bmc · Permalink · 14 Comments
In: Uncategorized

Falling in love with Rust

Let me preface this with an apology: this is a technology love story, and as such, it’s long, rambling, sentimental and personal. Also befitting a love story, it has a When Harry Met Sally feel to it, in that its origins are inauspicious…

First encounters

Over a decade ago, I worked on a technology to which a competitor paid the highest possible compliment: they tried to implement their own knockoff. Because this was done in the open (and because it is uniquely mesmerizing to watch one’s own work mimicked), I spent way too much time following their mailing list and tracking their progress (and yes, taking an especially shameful delight in their occasional feuds). On their team, there was one technologist who was clearly exceptionally capable — and I confess to being relieved when he chose to leave the team relatively early in the project’s life. This was all in 2005; for years for me, Rust was “that thing that Graydon disappeared to go work on.” From the description as I read it at the time, Graydon’s new project seemed outrageously ambitious — and I assumed that little would ever come of it, though certainly not for lack of ability or effort…

Fast forward eight years to 2013 or so. Impressively, Graydon’s Rust was not only still alive, but it had gathered a community and was getting quite a bit of attention — enough to merit a serious look. There seemed to be some very intriguing ideas, but any budding interest that I might have had frankly withered when I learned that Rust had adopted the M:N threading model — including its more baroque consequences like segmented stacks. In my experience, every system that has adopted the M:N model has lived to regret it — and it was unfortunate to have a promising new system appear to be ignorant of the scarred shoulders that it could otherwise stand upon. For me, the implications were larger than this single decision: I was concerned that it may be indicative of a deeper malaise that would make Rust a poor fit for the infrastructure software that I like to write. So while impressed that Rust’s ambitious vision was coming to any sort of fruition at all, I decided that Rust wasn’t for me personally — and I didn’t think much more about it…

Some time later, a truly amazing thing happened: Rust ripped it out. Rust’s reasoning for removing segmented stacks is a concise but thorough damnation; their rationale for removing M:N is clear-eyed, thoughtful and reflective — but also unequivocal in its resolve. Suddenly, Rust became very interesting: all systems make mistakes, but few muster the courage to rectify them; on that basis alone, Rust became a project worthy of close attention.

So several years later, in 2015, it was with great interest that I learned that Adam started experimenting with Rust. On first read of Adam’s blog entry, I assumed he would end what appeared to be excruciating pain by deleting the Rust compiler from his computer (if not by moving to a commune in Vermont) — but Adam surprised me when he ended up being very positive about Rust, despite his rough experiences. In particular, Adam hailed the important new ideas like the ownership model — and explicitly hoped that his experience would serve as a warning to others to approach the language in a different way.

In the years since, Rust continued to mature and my curiosity (and I daresay, that of many software engineers) has steadily intensified: the more I have discovered, the more intrigued I have become. This interest has coincided with my personal quest to find a programming language for the back half of my career: as I mentioned in my Node Summit 2017 talk on platform as a reflection of values, I have been searching for a language that reflects my personal engineering values around robustness and performance. These values reflect a deeper sense within me: that software can be permanent — that software’s unique duality as both information and machine afford a timeless perfection and utility that stand apart from other human endeavor. In this regard, I have believed (and continue to believe) that we are living in a Golden Age of software, one that will produce artifacts that will endure for generations. Of course, it can be hard to hold such heady thoughts when we seem to be up to our armpits in vendored flotsam, flooded by sloppy abstractions hastily implemented. Among current languages, only Rust seems to share this aspiration for permanence, with a perspective that is decidedly larger than itself.

Taking the plunge

So I have been actively looking for an opportunity to dive into Rust in earnest, and earlier this year, one presented itself: for a while, I have been working on a new mechanism for system visualization that I dubbed statemaps. The software for rendering statemaps needs to inhale a data stream, coalesce it down to a reasonable size, and render it as a dynamic image that can be manipulated by the user. This originally started off as being written in node.js, but performance became a problem (especially for larger data sets) and I did what we at Joyent have done in such situations: I rewrote the hot loop in C, and then dropped that into a node.js add-on (allowing the SVG-rendering code to remain in JavaScript). This was fine, but painful: the C was straightforward, but the glue code to bridge into node.js was every bit as capricious, tedious, and error-prone as it has always been. Given the performance constraint, the desire for the power of a higher level language, and the experimental nature of the software, statemaps made for an excellent candidate to reimplement in Rust; my intensifying curiosity could finally be sated!

As I set out, I had the advantage of having watched (if from afar) many others have their first encounters with Rust. And if those years of being a Rust looky-loo taught me anything, it’s that the early days can be like the first days of snowboarding or windsurfing: lots of painful falling down! So I took deliberate approach with Rust: rather than do what one is wont to do when learning a new language and tinker a program into existence, I really sat down to learn Rust. This is frankly my bias anyway (I always look for the first principles of a creation, as explained by its creators), but with Rust, I went further: not only did I buy the canonical reference (The Rust Programming Language by Steve Klabnik, Carol Nichols and community contributors), I also bought an O’Reilly book with a bit more narrative (Programming Rust by Jim Blandy and Jason Orendorff). And with this latter book, I did something that I haven’t done since cribbing BASIC programs from Enter magazine back in the day: I typed in the example program in the introductory chapters. I found this to be very valuable: it got the fingers and the brain warmed up while still absorbing Rust’s new ideas — and debugging my inevitable transcription errors allowed me to get some understanding of what it was that I was typing. At the end was something that actually did something, and (importantly), by working with a program that was already correct, I was able to painlessly feel some of the tremendous promise of Rust.

Encouraged by these early (if gentle) experiences, I dove into my statemap rewrite. It took a little while (and yes, I had some altercations with the borrow checker!), but I’m almost shocked about how happy I am with the rewrite of statemaps in Rust. Because I know that many are in the shoes I occupied just a short while ago (namely, intensely wondering about Rust, but also wary of its learning curve — and concerned about the investment of time and energy that climbing it will necessitate), I would like to expand on some of the things that I love about Rust other than the ownership model. This isn’t because I don’t love the ownership model (I absolutely do) or that the ownership model isn’t core to Rust (it is rightfully thought of as Rust’s epicenter), but because I think its sheer magnitude sometimes dwarfs other attributes of Rust — attributes that I find very compelling! In a way, I am writing this for my past self — because if I have one regret about Rust, it’s that I didn’t see beyond the ownership model to learn it earlier.

I will discuss these attributes in roughly the order I discovered them with the (obvious?) caveat that this shouldn’t be considered authoritative; I’m still very much new to Rust, and my apologies in advance for any technical details that I get wrong!

1. Rust’s error handling is beautiful

The first thing that really struck me about Rust was its beautiful error handling — but to appreciate why it so resonated with me requires some additional context. Despite its obvious importance, error handling is something we haven’t really gotten right in systems software. For example, as Dave Pacheo observed with respect to node.js, we often conflate different kinds of errors — namely, programmatic errors (i.e., my program is broken because of a logic error) with operational errors (i.e., an error condition external to my program has occurred and it affects my operation). In C, this conflation is unusual, but you see it with the infamous SIGSEGV signal handler that has been known to sneak into more than one undergraduate project moments before a deadline to deal with an otherwise undebuggable condition. In the Java world, this is slightly more common with the (frowned upon) behavior of catching java.lang.NullPointerException or otherwise trying to drive on in light of clearly broken logic. And in the JavaScript world, this conflation is commonplace — and underlies one of the most serious objections to promises.

Beyond the ontological confusion, error handling suffers from an infamous mechanical problem: for a function that may return a value but may also fail, how is the caller to delineate the two conditions? (This is known as the semipredicate problem after a Lisp construct that suffers from it.) C handles this as it handles so many things: by leaving it to the programmer to figure out their own (bad) convention. Some use sentinel values (e.g., Linux system calls cleave the return space in two and use negative values to denote the error condition); some return defined values on success and failure and then set an orthogonal error code; and of course, some just silently eat errors entirely (or even worse).

C++ and Java (and many other languages before them) tried to solve this with the notion of exceptions. I do not like exceptions: for reasons not dissimilar to Dijkstra’s in his famous admonition against “goto”, I consider exceptions harmful. While they are perhaps convenient from a function signature perspective, exceptions allow errors to wait in ambush, deep in the tall grass of implicit dependencies. When the error strikes, higher-level software may well not know what hit it, let alone from whom — and suddenly an operational error has become a programmatic one. (Java tries to mitigate this sneak attack with checked exceptions, but while well-intentioned, they have serious flaws in practice.) In this regard, exceptions are a concrete example of trading the speed of developing software with its long-term operability. One of our deepest, most fundamental problems as a craft is that we have enshrined “velocity” above all else, willfully blinding ourselves to the long-term consequences of gimcrack software. Exceptions optimize for the developer by allowing them to pretend that errors are someone else’s problem — or perhaps that they just won’t happen at all.

Fortunately, exceptions aren’t the only way to solve this, and other languages take other approaches. Closure-heavy languages like JavaScript afford environments like node.js the luxury of passing an error as an argument — but this argument can be ignored or otherwise abused (and it’s untyped regardless), making this solution far from perfect. And Go uses its support for multiple return values to (by convention) return both a result and an error value. While this approach is certainly an improvement over C, it is also noisy, repetitive and error-prone.

By contrast, Rust takes an approach that is unique among systems-oriented languages: leveraging first algebraic data types — whereby a thing can be exactly one of an enumerated list of types and the programmer is required to be explicit about its type to manipulate it — and then combining it with its support for parameterized types. Together, this allows functions to return one thing that’s one of two types: one type that denotes success and one that denotes failure. The caller can then pattern match on the type of what has been returned: if it’s of the success type, it can get at the underlying thing (by unwrapping it), and if it’s of the error type, it can get at the underlying error and either handle it, propagate it, or improve upon it (by adding additional context) and propagating it. What it cannot do (or at least, cannot do implicitly) is simply ignore it: it has to deal with it explicitly, one way or the other. (For all of the details, see Recoverable Errors with Result.)

To make this concrete, in Rust you end up with code that looks like this:

fn do_it(filename: &str) -> Result {
    let stat = match fs::metadata(filename) {
        Ok(result) => { result },
        Err(err) => { return Err(err); }

    let file = match File::open(filename) {
        Ok(result) => { result },
        Err(err) => { return Err(err); }

    /* ... */


Already, this is pretty good: it’s cleaner and more robust than multiple return values, return sentinels and exceptions — in part because the type system helps you get this correct. But it’s also verbose, so Rust takes it one step further by introducing the propagation operator: if your function returns a Result, when you call a function that itself returns a Result, you can append a question mark on the call to the function denoting that upon Ok, the result should be unwrapped and the expression becomes the unwrapped thing — and upon Err the error should be returned (and therefore propagated). This is easier seen than explained! Using the propagation operator turns our above example into this:

fn do_it_better(filename: &str) -> Result {
    let stat = fs::metadata(filename)?;
    let file = File::open(filename)?;

    /* ... */


This, to me, is beautiful: it is robust; it is readable; it is not magic. And it is safe in that the compiler helps us arrive at this and then prevents us from straying from it.

Platforms reflect their values, and I daresay the propagation operator is an embodiment of Rust’s: balancing elegance and expressiveness with robustness and performance. This balance is reflected in a mantra that one hears frequently in the Rust community: “we can have nice things.” Which is to say: while historically some of these values were in tension (i.e., making software more expressive might implicitly be making it less robust or more poorly performing), through innovation Rust is finding solutions that don’t compromise one of these values for the sake of the other.

2. The macros are incredible

When I was first learning C, I was (rightly) warned against using the C preprocessor. But like many of the things that we are cautioned about in our youth, this warning was one that the wise give to the enthusiastic to prevent injury; the truth is far more subtle. And indeed, as I came of age as a C programmer, I not only came to use the preprocessor, but to rely upon it. Yes, it needed to be used carefully — but in the right hands it could generate cleaner, better code. (Indeed, the preprocessor is very core to the way we implement DTrace’s statically defined tracing.) So if anything, my problems with the preprocessor were not its dangers so much as its many limitations: because it is, in fact, a preprocessor and not built into the language, there were all sorts of things that it would never be able to do — like access the abstract syntax tree.

With Rust, I have been delighted by its support for hygienic macros. This not only solves the many safety problems with preprocessor-based macros, it allows them to be outrageously powerful: with access to the AST, macros are afforded an almost limitless expansion of the syntax — but invoked with an indicator (a trailing bang) that makes it clear to the programmer when they are using a macro. For example, one of the fully worked examples in Programming Rust is a json! macro that allows for JSON to be easy declared in Rust. This gets to the ergonomics of Rust, and there are many macros (e.g., format!, vec!, etc.) that make Rust more pleasant to use.

Another advantage of macros: they are so flexible and powerful that they allow for effective experimentation. For example, the propagation operator that I love so much actually started life as a try! macro; that this macro was being used ubiquitously (and successfully) allowed a language-based solution to be considered. Languages can be (and have been!) ruined by too much experimentation happening in the language rather than in how it’s used; through its rich macros, it seems that Rust can enable the core of the language to remain smaller — and to make sure that when it expands, it is for the right reasons and in the right way.

3. format! is a pleasure

Okay, this is a small one but it’s (another) one of those little pleasantries that has made Rust really enjoyable. Many (most? all?) languages have an approximation or equivalent of the venerable sprintf, whereby variable input is formatted according to a format string. Rust’s variant of this is the format! macro (which is in turn invoked by println!, panic!, etc.), and (in keeping with one of the broader themes of Rust) it feels like it has learned from much that came before it. It is type-safe (of course) but it is also clean in that the {} format specifier can be used on any type that implements the Display trait. I also love that the {:?} format specifier denotes that the argument’s Debug trait implementation should be invoked to print debug output. More generally, all of the format specifiers map to particular traits, allowing for an elegant approach to an historically grotty problem. There are a bunch of other niceties, and it’s all a concrete example of how Rust uses macros to deliver nice things without sullying syntax or otherwise special-casing. None of the formatting capabilities are unique to Rust, but that’s the point: in this (small) domain (as in many) Rust feels like a distillation of the best work that came before it. As anyone who has had to endure one of my talks can attest, I believe that appreciating history is essential both to understand our present and to map our future. Rust seems to have that perspective in the best ways: it is reverential of the past without being incarcerated by it.

4. include_str! is a godsend

One of the filthy aspects of the statemap code is that it is effectively encapsulating another program — a JavaScript program that lives in the SVG to allow for the interactivity of the statemap. This code lives in its own file, which the statemap code should pass through to the generated SVG. In the node.js/C hybrid, I am forced to locate the file in the filesystem — which is annoying because it has to be delivered along with the binary and located, etc. Now Rust — like many languages (including ES6) — has support for raw-string literals. As an aside, it’s interesting to see the discussion leading up to its addition, and in particular, how a group of people really looked at every language that does this to see what should be mimicked versus what could be improved upon. I really like the syntax that Rust converged on: r followed by one or more octothorpes followed by a quote to begin a raw string literal, and a quote followed by a matching number of octothorpes followed to end a literal, e.g.:

    let str = r##""What a curious feeling!" said Alice"##;

This alone would have allowed me to do what I want, but still a tad gross in that it’s a bunch of JavaScript living inside a raw literal in a .rs file. Enter include_str!, which allows me to tell the compiler to find the specified file in the filesystem during compilation, and statically drop it into a string variable that I can manipulate:

         * Now drop in our in-SVG code.
        let lib = include_str!("statemap-svg.js");

So nice! Over the years I have wanted this many times over for my C, and it’s another one of those little (but significant!) things that make Rust so refreshing.

5. Serde is stunningly good

Serde is a Rust crate that allows for serialization and deserialization, and it’s just exceptionally good. It uses macros (and, in particular, Rust’s procedural macros) to generate structure-specific routines for serialization and deserialization. As a result, Serde requires remarkably little programmer lift to use and performs eye-wateringly well — a concrete embodiment of Rust’s repeated defiance of the conventional wisdom that programmers must choose between abstractions and performance!

For example, in the statemap implementation, the input is concatenated JSON that begins with a metadata payload. To read this payload in Rust, I define the structure, and denote that I wish to derive the Deserialize trait as implemented by Serde:

#[derive(Deserialize, Debug)]
struct StatemapInputMetadata {
    start: Vec<u64>,
    title: String,
    host: Option<String>,
    entityKind: Option<String>,
    states: HashMap<String, StatemapInputState>,

Then, to actually parse it:

     let metadata: StatemapInputMetadata = serde_json::from_str(payload)?;

That’s… it. Thanks to the magic of the propagation operator, the errors are properly handled and propagated — and it has handled tedious, error-prone things for me like the optionality of certain members (itself beautifully expressed via Rust’s ubiquitous Option type). With this one line of code, I now (robustly) have a StatemapInputMetadata instance that I can use and operate upon — and this performs incredibly well on top of it all. In this regard, Serde represents the best of software: it is a sophisticated, intricate implementation making available elegant, robust, high-performing abstractions; as legendary White Sox play-by-play announcer Hawk Harrelson might say, MERCY!

6. I love tuples

In my C, I have been known to declare anonymous structures in functions. More generally, in any strongly typed language, there are plenty of times when you don’t want to have to fill out paperwork to be able to structure your data: you just want a tad more structure for a small job. For this, Rust borrows an age-old construct from ML in tuples. Tuples are expressed as a parenthetical list, and they basically work as you expect them to work in that they are static in size and type, and you can index into any member. For example, in some test code that needs to make sure that names for colors are correctly interpreted, I have this:

        let colors = vec![
            ("aliceblue", (240, 248, 255)),
            ("antiquewhite", (250, 235, 215)),
            ("aqua", (0, 255, 255)),
            ("aquamarine", (127, 255, 212)),
            ("azure", (240, 255, 255)),
            /* ... */

Then colors[2].0 (say) which will be the string “aqua”; (colors[1].1).2 will be the integer 215. Don’t let the absence of a type declaration in the above deceive you: tuples are strongly typed, it’s just that Rust is inferring the type for me. So if I accidentally try to (say) add an element to the above vector that contains a tuple of mismatched signature (e.g., the tuple “((188, 143, 143), ("rosybrown"))“, which has the order reversed), Rust will give me a compile-time error.

The full integration of tuples makes them a joy to use. For example, if a function returns a tuple, you can easily assign its constituent parts to disjoint variables, e.g.:

fn get_coord() -> (u32, u32) {
   (1, 2)

fn do_some_work() {
    let (x, y) = get_coord();
    /* x has the value 1, y has the value 2 */

Great stuff!

7. The integrated testing is terrific

One of my regrets on DTrace is that we didn’t start on the DTrace test suite at the same time we started the project. And even after we starting building it (too late, but blessedly before we shipped it), it still lived away from the source for several years. And even now, it’s a bit of a pain to run — you really need to know it’s there.

This represents everything that’s wrong with testing in C: because it requires bespoke machinery, too many people don’t bother — even when they know better! Viz.: in the original statemap implementation, there is zero testing code — and not because I don’t believe in it, but just because it was too much work for something relatively small. Yes, there are plenty of testing frameworks for C and C++, but in my experience, the integrated frameworks are too constrictive — and again, not worth it for a smaller project.

With the rise of test-driven development, many languages have taken a more integrated approach to testing. For example, Go has a rightfully lauded testing framework, Python has unittest, etc. Rust takes a highly integrated approach that combines the best of all worlds: test code lives alongside the code that it’s testing — but without having to make the code bend to a heavyweight framework. The workhorses here are conditional compilation and Cargo, which together make it so easy to write tests and run them that I found myself doing true test-driven development with statemaps — namely writing the tests as I develop the code.

8. The community is amazing

In my experience, the best communities are ones that are inclusive in their membership but resolute in their shared values. When communities aren’t inclusive, they stagnate, or rot (or worse); when communities don’t share values, they feud and fracture. This can be a very tricky balance, especially when so many open source projects start out as the work of a single individual: it’s very hard for a community not to reflect the idiosyncrasies of its founder. This is important because in the open source era, community is critical: one is selecting a community as much as one is selecting a technology, as each informs the future of the other. One factor that I value a bit less is strictly size: some of my favorite communities are small ones — and some of my least favorite are huge.

For purposes of a community, Rust has a luxury of clearly articulated, broadly shared values that are featured prominently and reiterated frequently. If you head to the Rust website this is the first sentence you’ll read:

Rust is a systems programming language that runs blazingly fast, prevents segfaults, and guarantees thread safety.

That gets right to it: it says that as a community, we value performance and robustness — and we believe that we shouldn’t have to choose between these two. (And we have seen that this isn’t mere rhetoric, as so many Rust decisions show that these values are truly the lodestar of the project.)

And with respect to inclusiveness, it is revealing that you will likely read that statement of values in your native tongue, as the Rust web page has been translated into thirteen languages. Just the fact that it has been translated into so many languages makes Rust nearly unique among its peers. But perhaps more interesting is where this globally inclusive view likely finds its roots: among the sites of its peers, only Ruby is similarly localized. Given that several prominent Rustaceans like Steve Klabnik and Carol Nichols came from the Ruby community, it would not be unreasonable to guess that they brought this globally inclusive view with them. This kind of inclusion is one that one sees again and again in the Rust community: different perspectives from different languages and different backgrounds. Those who come to Rust bring with them their experiences — good and bad — from the old country, and the result is a melting pot of ideas. This is an inclusiveness that runs deep: by welcoming such disparate perspectives into a community and then uniting them with shared values and a common purpose, Rust achieves a rich and productive heterogeneity of thought. That is, because the community agrees about the big things (namely, its fundamental values), it has room to constructively disagree (that is, achieve consensus) on the smaller ones.

Which isn’t to say this is easy! Check out Ashley Williams in the opening keynote from RustConf 2018 for how exhausting it can be to hash through these smaller differences in practice. Rust has taken a harder path than the “traditional” BDFL model, but it’s a qualitatively better one — and I believe that many of the things that I love about Rust are a reflection of (and a tribute to) its robust community.

9. The performance rips

Finally, we come to the last thing I discovered in my Rust odyssey — but in many ways, the most important one. As I described in an internal presentation, I had experienced some frustrations trying to implement in Rust the same structure I had had in C. So I mentally gave up on performance, resolving to just get something working first, and then optimize it later.

I did get it working, and was able to benchmark it, but to give some some context for the numbers, here is the time to generate a statemap in the old (slow) pure node.js implementation for a modest trace (229M, ~3.9M state transitions) on my 2.9 GHz Core i7 laptop:

% time ./statemap-js/bin/statemap ./pg-zfs.out > js.svg

real	1m23.092s
user	1m21.106s
sys	0m1.871s

This is bad — and larger input will cause it to just run out of memory. And here’s the version as reimplemented as a C/node.js hybrid:

% time ./statemap-c/bin/statemap ./pg-zfs.out > c.svg

real	0m11.800s
user	0m11.414s
sys	0m0.330s

This was (as designed) a 10X improvement in performance, and represents speed-of-light numbers in that this seems to be an optimal implementation. Because I had written my Rust naively (and my C carefully), my hope was that the Rust would be no more than 20% slower — but I was braced for pretty much anything. Or at least, I thought I was; I was actually genuinely taken aback by the results:

$ time ./ ./pg-zfs.out > rs.svg
3943472 records processed, 24999 rectangles

real	0m8.072s
user	0m7.828s
sys	0m0.186s

Yes, you read that correctly: my naive Rust was ~32% faster than my carefully implemented C. This blew me away, and in the time since, I have spent some time on a real lab machine running SmartOS (where I have reproduced these results and been able to study them a bit). My findings are going to have to wait for another blog entry, but suffice it to say that despite executing a shockingly similar number of instructions, the Rust implementation has a different load/store mix (it is much more store-heavy than C) — and is much better behaved with respect to the cache. Given the degree that Rust passes by value, this makes some sense, but much more study is merited.

It’s also worth mentioning that there are some easy wins that will make the Rust implementation even faster: after I had publicized the fact that I had a Rust implementation of statemaps working, I was delighted when David Tolnay, one of the authors of Serde, took the time to make some excellent suggestions for improvement. For a newcomer like me, it’s a great feeling to have someone with such deep expertise as David’s take an interest in helping me make my software perform even better — and it is revealing as to the core values of the community.

Rust’s shockingly good performance — and the community’s desire to make it even better — fundamentally changed my disposition towards it: instead of seeing Rust as a language to augment C and replace dynamic languages, I’m looking at it as a language to replace both C and dynamic languages in all but the very lowest layers of the stack. C — like assembly — will continue to have a very important place for me, but it’s hard to not see that place as getting much smaller relative to the barnstorming performance of Rust!

Beyond the first impressions

I wouldn’t want to imply that this is an exhaustive list of everything that I have fallen in love with about Rust. That list is much longer would include at least the ownership model; the trait system; Cargo; the type inference system. And I feel like I have just scratched the surface; I haven’t waded into known strengths of Rust like the FFI and the concurrency model! (Despite having written plenty of multithreaded code in my life, I haven’t so much as created a thread in Rust!)

Building a future

I can say with confidence that my future is in Rust. As I have spent my career doing OS kernel development, a natural question would be: do I intend to rewrite the OS kernel in Rust? In a word, no. To understand my reluctance, take some of my most recent experience: this blog entry was delayed because I needed to debug (and fix) a nasty problem with our implementation of the Linux ABI. As it turns out, Linux and SmartOS make slightly different guarantees with respect to the interaction of vfork and signals, and our code was fatally failing on a condition that should be impossible. Any old Unix hand (or quick study!) will tell you that vfork and signal disposition are each semantic superfund sites in their own right — and that their horrific (and ill-defined) confluence can only be unimaginably toxic. But the real problem is that actual software implicitly depends on these semantics — and any operating system that is going to want to run existing software will itself have to mimic them. You don’t want to write this code, because no one wants to write this code.

Now, one option (which I honor!) is to rewrite the OS from scratch, as if legacy applications essentially didn’t exist. While there is a tremendous amount of good that can come out of this (and it can find many use cases), it’s not a fit for me personally.

So while I may not want to rewrite the OS kernel in Rust, I do think that Rust is an excellent fit for much of the broader system. For example, at the recent OpenZFS Developers Summit, Matt Ahrens and I were noodling the notion of user-level components for ZFS in Rust. Specifically: zdb is badly in need of a rewrite — and Rust would make an excellent candidate for it. There are many such examples spread throughout ZFS and the broader the system, including a few in kernel. Might we want to have a device driver model that allows for Rust drivers? Maybe! (And certainly, it’s technically possible.) In any case, you can count on a lot more Rust from me and into the indefinite future — whether in the OS, near the OS, or above the OS.

Taking your own plunge

I wrote all of this up in part to not only explain why I took the plunge, but to encourage others to take their own. If you were as I was and are contemplating diving into Rust, a couple of pieces of advice, for whatever they’re worth:

I’m sure that there’s a bunch of stuff that I missed; if there’s a particular resource that you found useful when learning Rust, message me or leave a comment here and I’ll add it.

Let me close by offering a sincere thanks to those in the Rust community who have been working so long to develop such a terrific piece of software — and especially those who have worked so patiently to explain their work to us newcomers. You should be proud of what you’ve accomplished, both in terms of a revolutionary technology and a welcoming community — thank you for inspiring so many of us about what infrastructure software can become, and I look forward to many years of implementing in Rust!

Posted on September 18, 2018 at 3:31 pm by bmc · Permalink · 22 Comments
In: Uncategorized

Talks I have given (including bonus tracks!)

Increasingly, some people have expressed the strange urge to binge-watch my presentations. This potentially self-destructive behavior seems likely to have unwanted side-effects like spontaneous righteous indignation, superfluous historical metaphor, and near-lethal exposure to tangential anecdote — and yet I find myself compelled to enable it by collecting my erstwhile scattered talks. While this blog entry won’t link to every talk I’ve ever given, there should be enough here to make anyone blotto. That said, should you find yourself thirsty for more, check out the bonus tracks that consist of recorded conversations that I have had over the last few years.

To accommodate the more recreational watcher as well as the hardened addict, I have also broken my talks up into a a series of trilogies, with each following a particular subject area or theme. In the the future, as I give talks that become available, I will update this blog entry. And if you find that a link here is dead, please let me know!

Before we get to the list: if you only watch one talk of mine, please watch Principles of Technology Leadership (slides) presented at Monktoberfest 2017. This is the only talk that I have asked family and friends to watch, as it represents my truest self — or what I aspire that self to be, anyway.

The talks

Talks I have given, in reverse chronological order:

Trilogies of talks

As with anyone, there are themes that run through my career. While I don’t necessarily give talks in explicit groups of three, looking back on my talks I can see some natural groupings that make for related sequences of talks.

The Software Values Trilogy

In late 2016 and through 2017, it felt like fundamental values like decency and integrity were under attack; it seems appropriate that these three talks were born during this turbulent time:

The Debugging Trilogy

While certainly not the only three talks I’ve given on debugging, these three talks present a sequence on aspects of debugging that we don’t talk about as much:

The Beloved Trilogy

A common theme across my Papers We Love and Systems We Love talks is (obviously?) an underlying love for the technology. These three talks represent a trilogy of beloved aspects of the system that I have spent two decades in:

The Open Source Trilogy

While my career started developing proprietary software, I am blessed that most of it has been spent in open source. This trilogy reflects on my experiences in open source, from the dual perspective of both a commercial entity and as an individual contributor:

The Container Trilogy

I have given many (too many!) talks on containers and containerization, but these three form a reasonable series (with hopefully not too much overlap!):

The DTrace Trilogy

Another area where I have given many more than three talks, but these three form a reasonable narrative:

The Surge Lightning Trilogy

For its six year run, Surge was a singular conference — and the lightning talks were always a highlight. My lightning talks were not deliberately about archaic Unixisms, it just always seemed to work out that way — an accidental narrative arc across several years.

Bonus tracks

If you’ve somehow been left looking for more (!), here are (again, in reverse chronological order) various conversations that that I’ve had in various places. This list isn’t exhaustive, so if I’m missing a favorite, please let me know!

Posted on February 3, 2018 at 9:43 pm by bmc · Permalink · 3 Comments
In: Uncategorized

The sudden death and eternal life of Solaris

As had been rumored for a while, Oracle effectively killed Solaris on Friday. When I first saw this, I had assumed that this was merely a deep cut, but in talking to Solaris engineers still at Oracle, it is clearly much more than that. It is a cut so deep as to be fatal: the core Solaris engineering organization lost on the order of 90% of its people, including essentially all management.

Of note, among the engineers I have spoken with, I heard two things repeatedly: “this is the end” and (from those who managed to survive Friday) “I wish I had been laid off.” Gone is any of the optimism (however tepid) that I have heard over the years — and embarrassed apologies for Oracle’s behavior have been replaced with dismay about the clumsiness, ineptitude and callousness with which this final cut was handled. In particular, that employees who had given their careers to the company were told of their termination via a pre-recorded call — “robo-RIF’d” in the words of one employee — is both despicable and cowardly. To their credit, the engineers affected saw themselves as Sun to the end: they stayed to solve hard, interesting problems and out of allegiance to one another — not out of any loyalty to the broader Oracle. Oracle didn’t deserve them and now it doesn’t have them — they have been liberated, if in a depraved act of corporate violence.

Assuming that this is indeed the end of Solaris (and it certainly looks that way), it offers a time for reflection. Certainly, the demise of Solaris is at one level not surprising, but on the other hand, its very suddenness highlights the degree to which proprietary software can suffer by the vicissitudes of corporate capriciousness. Vulnerable to executive whims, shareholder demands, and a fickle public, organizations can simply change direction by fiat. And because — in the words of the late, great Roger Faulkner — “it is easier to destroy than to create,” these changes in direction can have lasting effect when they mean stopping (or even suspending!) work on a project. Indeed, any engineer in any domain with sufficient longevity will have one (or many!) stories of exciting projects being cancelled by foolhardy and myopic management. For software, though, these cancellations can be particularly gutting because (in the proprietary world, anyway) so many of the details of software are carefully hidden from the users of the product — and much of the innovation of a cancelled software project will likely die with the project, living only in the oral tradition of the engineers who knew it. Worse, in the long run — to paraphrase Keynes — proprietary software projects are all dead. However ubiquitous at their height, this lonely fate awaits all proprietary software.

There is, of course, another way — and befitting its idiosyncratic life and death, Solaris shows us this path too: software can be open source. In stark contrast to proprietary software, open source does not — cannot, even — die. Yes, it can be disused or rusty or fusty, but as long as anyone is interested in it at all, it lives and breathes. Even should the interest wane to nothing, open source software survives still: its life as machine may be suspended, but it becomes as literature, waiting to be discovered by a future generation. That is, while proprietary software can die in an instant, open source software perpetually endures by its nature — and thrives by the strength of its communities. Just as the existence of proprietary software can be surprisingly brittle, open source communities can be crazily robust: they can survive neglect, derision, dissent — even sabotage.

In this regard, I speak from experience: from when Solaris was open sourced in 2005, the OpenSolaris community survived all of these things. By the time Oracle bought Sun five years later in 2010, the community had decided that it needed true independence — illumos was born. And, it turns out, illumos was born at exactly the right moment: shortly after illumos was announced, Oracle — in what remains to me a singularly loathsome and cowardly act — silently re-proprietarized Solaris on August 13, 2010. We in illumos were indisputably on our own, and while many outsiders gave us no chance of survival, we ourselves had reason for confidence: after all, open source communities are robust because they are often united not only by circumstance, but by values, and in our case, we as a community never lost our belief in ZFS, Zones, DTrace and myriad other technologies like MDB, FMA and Crossbow.

Indeed, since 2010, illumos has thrived; illumos is not only the repository of record for technologies that have become cross-platform like OpenZFS, but we have also advanced our core technologies considerably, while still maintaining highest standards of quality. Learning some of the mistakes of OpenSolaris, we have a model that allows for downstream innovation, experimentation and differentiation. For example, Joyent’s SmartOS has always been focused on our need for a cloud hypervisor (causing us to develop big features like hardware virtualization and Linux binary compatibility), and it is now at the heart of a massive buildout for Samsung (who acquired Joyent a little over a year ago). For us at Joyent, the Solaris/illumos/SmartOS saga has been formative in that we have seen both the ill effects of proprietary software and the amazing resilience of open source software — and it very much informed our decision to open source our entire stack in 2014.

Judging merely by its tombstone, the life of Solaris can be viewed as tragic: born out of wedlock between Sun and AT&T and dying at the hands of a remorseless corporate sociopath a quarter century later. And even that may be overstating its longevity: Solaris may not have been truly born until it was made open source, and — certainly to me, anyway — it died the moment it was again made proprietary. But in that shorter life, Solaris achieved the singular: immortality for its revolutionary technologies. So while we can mourn the loss of the proprietary embodiment of Solaris (and we can certainly lament the coarse way in which its technologists were treated!), we can rejoice in the eternal life of its technologies — in illumos and beyond!

Posted on September 4, 2017 at 12:30 pm by bmc · Permalink · 38 Comments
In: Uncategorized

Reflections on Systems We Love

Last Tuesday, several months of preparation came to fruition in the inaugural Systems We Love. You never know what’s going to happen the first time you get a new kind of conference together (especially one as broad as this one!) but it was, in a word, amazing. The content was absolutely outstanding, with attendee after attendee praising the uniformly high quality. (For guided tours, check out both Ozan Onay’s excellent exegesis and David Cassel’s thorough New Stack story — and don’t miss Sarah Huffman’s incredible illustrations!) It was such a great conference that many were asking about when we would do it again — and there is already interest in replicating it elsewhere. As an engineer, this makes me slightly nervous as I believe that success often teaches you nothing: luck becomes difficult to differentiate from design. But at the risk of taunting the conference gods with the arrogance of a puny mortal, here’s some stuff I do think we did right:

Okay, so that’s a pretty long list of things that worked; what didn’t work so well? I would say that there was basically only a single issue: the packed schedule. We had 19 (!!) 20 minute talks, and there simply wasn’t time for the length or quantity of breaks that one might like. I think it worked out better than it sounds like it would (thanks to our excellent and varied presenters!), but it was nonetheless exhausting and I think everyone would have appreciated at least one more break. Still, there were essentially no complaints about the number of presentations, so we wouldn’t want to overshoot by slimming down too much; perhaps the optimal number is 16 talks spread over four sessions of four talks apiece?

So where to go from here? We know now that there is a ton of demand and a bunch of great content to match (I’m still bummed about the terrific submissions we turned away!), so we know that we can (and will) easily have this be an annual event. But it seems like we can do more: maybe an event on the east coast? Perhaps one in Europe? Maybe as a series of meetups in the style of Papers We Love? There are a lot of possibilities, so please let us know what you’d like to see!

Finally, I would like to reflect on the most personally satisfying bit of Systems We Love: simply by bringing so many like-minded people together in the same room and having them get to know one another, we know that lives have been changed; new connections have been made, new opportunities have been found, and new journeys have begun. We knew that this would happen in the abstract, but in recent days, we have seen it ourselves: in the new year, you will see new faces on the Joyent engineering team that we met at Systems We Love. (If it needs to be said, the love of systems is a unifying force across Joyent; if you find yourself captivated by the content and you’re contemplating a career change, we’re hiring!) Like most (if not all) of us, the direction of my life has been significantly changed by meeting or hearing the right person at the right moment; that we have helped facilitate similar changes in our own small way is intensely gratifying — and is very much at the heart of what Systems We Love is about!

Posted on December 21, 2016 at 12:16 pm by bmc · Permalink · Comments Closed
In: Uncategorized