Below All the Turtles

Fin

December 29, 2014

Note: This is the least useful article I’ve ever written. If you haven’t read my technical articles, do yourself a favour: skip this and read those other articles instead. There are 27,492 vaguely (or not so vaguely) angsty, bitter, pseudo-introspective articles published to Hacker News every day; there’s barely one marginally useful technical article a week. I’ve just linked you a month’s worth and change. Trust me on this. But of course, you won’t, which in part is why we’re here. On with it, then.

After 2 moderately successful careers spanning over 17 years, I’m retiring today.

Retirement does not mean idleness; my small ranch in North Idaho needs enough work to keep me busy until even the Social Security Administration, in its wisdom, thinks it fit that I retire (which statistically speaking would offer approximately 3 weeks in which to move to Arizona and learn to hate golf before beginning one’s terminal career as a font of Medicare reimbursement chits). There are ditches to dig, outbuildings to erect, trees to fell, slash to burn, livestock to husband, fences to string, game to hunt, and garden plots to dig, plant, and harvest, to say nothing of the ordinary joys of home ownership and the added labours of independent off-grid living. In the unlikely event that I ever “finish” all of that, there is no end of work to be done restoring, conserving, and improving open-pollinated plant varieties and heritage livestock breeds, the shockingly depleted remnants of our 12,000-year-old agricultural endowment and the alternative to patented genetically-modified monocultures. I look forward to doing my part for those efforts, and I will not be bored. There’s plenty more to be said about the positive reasons I’m hanging it up and what’s next, but they’re off-topic in this venue and for this audience, so I’ll stop there. If you’re interested — and if you consume food or water, or know someone who does, you should be — there is no shortage of resources out there with which to educate yourself on these topics.

Not Why I’m Retiring

I did not win the lottery, literally or figuratively. While my first career did include the tail end of the dot-com boom (the VERY tail end: the first post-layoff all-hands I attended was four months after I was hired out of college; 11 more followed in a crushing depression that made 2008 look like the go-go 80s), I’m not retiring because I struck it rich. None of the companies I’ve ever worked for has made the kind of splashy exit that has the Facebook kids buying up Caribbean islands on which to store their less-favoured Lotuses. The first two have vanished into the dustbin of history, the third was sold to its customers for peanuts (then apparently sold again), the fourth was sold by raging assholes to another raging asshole for an insignificant fraction of its worth, and the fifth is about to be taken private for considerably less than the market thought it was worth when I joined for a cup of coffee. The jury’s still out on Joyent, but anyone who figures to retire on their winnings from the tech industry options lotto may as well just buy tickets in the real one and enjoy the advantage of being fully vested upon leaving the bodega. That’s never been me. I’m thankful that as a practitioner of one of the very few remaining trades in the industrialised world to offer a degree of economic independence our grandparents would have called “middle-class”, I had the opportunity to choose this path and make it a reality. Beyond that, luck played no role.

Why I’m Retiring

It’s important not to understate the positives; I’ve achieved a major goal years in the planning and couldn’t be more excited by what’s next. At the same time, I don’t think bucolic country living is terribly interesting to most of the admittedly small potential audience at this venue. So instead I’ll talk about the problems with the industry and trade I’m leaving behind. Maybe, with some luck, a few of you will drain this festering cesspool and rebuild on it something worth caring about. Until then, it’s safe to say I’m running away from it as much as toward something else. Most big moves are like that; it’s never one thing.

At the most basic level, the entire industry (like, I suspect, most others) plays to employees’ desire to keep their jobs. Whether the existence of those jobs and the manner in which they are presently filled are in the best interests of the stockholders is irrelevant. A very cozy industry has resulted. Real change is slow, creative destruction slower still. Artificial network effects arise, in which employers want people experienced with the currently perceived “winning” technology and job-seekers want to list the currently perceived “winning” technology among their experience. Whether that technology is actually superior or even appropriate to the task at hand is at most an afterthought. Vendors line up behind whatever technology or company the hype wheel of fortune has landed on that year. Products are built with obvious and easily corrected shortcomings that exist to preserve “partnerships” with other vendors whose products exist solely to remedy them. Backs are scratched all around.

Worse, from the perspective of someone who appreciates downstack problems and the value created by solving them, is that the de facto standard technology stack is ossifying upward. As one looks at each progressively lower layer, the industry’s collective willingness to contemplate, much less sponsor, work at that layer diminishes. Interest among potential partners in Dogpatch was nonexistent. Joyent is choosing not to pursue it in favour of moving upstack. There is no real interest anywhere in alternatives to the PC, even though the basic architecture was developed for use cases that look nothing like the ones that are most relevant today. Every server still ships with enormous blobs of firmware so deeply proprietary that even the specifications required to write it come from Intel in binders with red covers reminding you that absolutely no disclosure, implicit or explicit, of the contents is permitted to anyone, ever. So much for the Open Source revolution. Every server still comes with a VGA port, and in most cases still has tasks that cannot be performed without using it. No “Unix server” vendor would even have tried to sell such a machine 20 years ago. The few timid efforts to improve hardware, such as OpenCompute, do nothing to address any of these problems, declining to address hardware architecture or firmware at all other than to specify that the existing de facto standards be followed. At most, OpenCompute is a form factor play; it’s about bending sheet metal for slightly greater density, not making a better system from a management, scalability, or software interface perspective. Meanwhile the most trumpeted “advance” of the last 10 years at the bottom of the stack is UEFI, which replaces parts of the system firmware… with a slightly modernised version of MS-DOS. It’s painfully obvious that the sole purpose of UEFI is to enable Microsoft to continue collecting royalties on every computer sold, a brilliant move on their part given the steady decline of Windows, but an abomination for everyone else. UEFI solves no problems for the operator, customer, or OS vendor. If anything, it creates more of them. There’s a better way to do this, but my central observation is that the solutions that would be better for everyone else are not those that would be best for the vendors: AMI, Microsoft, and Intel are quite happy with their cozy little proprietary royalty machine and have no incentive to engineer, or even enable others to engineer, anything better. The bottom of the stack is designed to serve vendors, not customers.

The state of reality at the OS layer is almost as bad. World-class jackass Rob Pike insists that OS research is dead. As an observation, it’s deliberately ignorant; as an assertion, it’s abominable. Mr. Pike and his ilk seem to believe that the way to improve the operating system is to throw it away and write a new one based on aesthetic appraisal of the previously-completed one. He spent years dithering around with an academic second system while the rest of the world went out and ran Unix in production and learned the lessons required to build a better system (be it a better Unix or something else entirely). Unsurprisingly, Plan9 is a colossal failure, and one look at the golang toolchain will satisfy even the casual observer as to why; it was clearly written by people who wasted the last 30 years. There are plenty of other problems with Mr. Pike and his language runtime, but I’ve covered those elsewhere. The salient aspect is that the industry, collectively, is excited about his work but not about anything I consider useful, well-done, or worthwhile. Looking at the development of operating systems that have learned the lessons of life in production, what’s really going on? More than you think, but not where you’re looking for it. The overwhelmingly dominant operating systems in use today are GNU/Linux (servers and mobile devices) and Microsoft Windows (all the legacy systems that aren’t mainframes). Neither has any of the features you need to manage large-scale deployments, and no one is doing anything about that. Microsoft, to its credit, has found a cash cow and clearly intends to milk it dry. There’s no shame in that, but it hardly advances the state of the art; their OS is actually well-suited to its niche on corporate desktops and doesn’t need to advance. GNU/Linux has been written by people with no sense of smell, no concept of architecture, and no ability to advance any large-scale piece of work. When they do try something big, it’s never the right thing; instead, it’s systemd, written by another world-class jackass. And even if systemd were properly scoped and executed well, it would be only a modest improvement on SMF… which has been running in production on illumos for 10 years. GNU/Linux is definitely not the place to look for exciting OS work.

The need for OS development and innovation is extreme. A great deal of interesting work is still being done in the illumos community, and elsewhere as well in several instances. But there seems to be little interest in using anything that’s not GNU/Linux. We come back to the basic principle driving everything: people only care about keeping their jobs. GNU/Linux is a trash fire, but it’s the de facto standard trash fire, just like Microsoft was in the 90s and IBM in the 70s. If you choose it and your project fails, it must have been the application developers’ fault; if you choose illumos and your project fails, it must be the OS’s — and thus YOUR — fault. Never mind that illumos is better. Never mind that it offers you better ways to debug, observe, and improve your (let’s face it, buggy as sin) application, or that it is designed and built for data centre deployment rather than desktop use. Your application developers won’t bother learning how to do any of that anyway, and they in turn probably resent the very presence of better tooling. After all, if they can’t explain why their software doesn’t work, it’s much better for them (they want to keep their jobs too) if they can blame shoddy or missing tooling. Considering how thin the qualifications to be an application developer are today, I can’t say I blame them. Most people in that role are hopelessly out of their depth, and better tooling only helps people who understand how computers work to begin with.

The net result of all this is that we have data centres occupying many hectares, filled with computers that are architecturally identical to a Packard Bell 486 desktop running MS-DOS long enough to boot a crippled and amateurish clone of Unix circa 1987, and an endless array of complex, kludged-up hardware and software components intended to hide all of that. I would ask for a show of hands from those who consider this progress, but too many are happily riding the gravy train this abomination enables.

Where Have All the Systems Companies Gone?

When I joined Joyent nearly 3 years ago, our then-CTO Jason Hoffman insisted that I was joining a systems company. While he was oft given to aspirational assertions, if nothing else it was a good aspiration to have. One could readily imagine stamping out turnkey data centre building blocks, racks of carefully selected — or even engineered! — hardware managed cradle-to-grave by SDC and SmartOS, an advanced Unix variant tailored for the data centre. Part of this vision has been fulfilled; the software stack has been in production on commodity hardware for years. Realising the rest of this exciting concept that still has no serious competition (probably because only a systems company would have the necessary array of technologies and the vision to combine them this way) requires a great deal of systems work: hardware, firmware, OS, and orchestration. The pieces all have to fit together; there is no place for impedance mismatches at 10,000-server scale. Meanwhile, everyone wants to rave about OpenStack, a software particle-board designed by committee, consisting of floor sweepings and a lot of toxic glue. Not surprisingly, since it’s not a proper system, few OpenStack deployments exceed a dozen nodes. And in this, the world falls all over itself to “invest” billions of dollars. Considering what Joyent did with a tiny fraction of that, it’s not hard to imagine the result if a single systems company were to invest even a tenth of it in building a coherent, opinionated, end-to-end system. That’s the problem I was so excited to tackle, a logical extension of our work at Fishworks to distributed systems.

I’m not retiring because I’m angry with Joyent, and of course I wish my friends and colleagues there well, but it’s fair to say I’m disappointed that we never really went down that road. I’d like to imagine that someone will, but I’ve seen no evidence for it. The closest thing that exists in the world is the bespoke (and DEEPLY proprietary) work Google has done for its own systems. Considering the “quality” of the open work they do, I can’t imagine they’ve solved the problem in a manner I’d appreciate, but the mere fact that there’s no opportunity to share it with the world and build a genuine technology company around it is sufficient to eliminate any thought of going there to work on it. Perhaps someday there will once again be systems companies solving these problems.

And Thus

It gets difficult to come to work each day filled with bitterness and disappointment. The industry is broken, or at least it feels broken to me. If I hadn’t saved up a small nest egg, or if I had expensive tastes, or if there were no other way to occupy my time that seemed any better, I’d probably just do what most people do everywhere: keep showing up and going through the motions. But that’s not the case.

It’s easy to assert sour grapes here. I get that. An alternative explanation is that everything is basically hunky-dory and I’m impossibly embittered by a lifetime of my own failure. I accept the plausibility of either, and offer no counterargument should you choose the latter. Fortunately, for me at least, either possibility dictates the same conclusion: it’s time to move on. The ability to predict the future, even (perhaps especially) one’s own, is a claim rooted in hubris, so I can’t say for certain that I’ll never be employed again, but I can state without reservation that I don’t plan to be. Maybe the industry will change for the better. Maybe I will. Maybe something else exciting will come along of which I want to be a part. Maybe not.

Thank You

I’m not going to do the thing where I try to name everyone I’ve enjoyed working with, or even everyone who has helped me out over the years. If we’ve worked together and we’d work together again, you know who you are. Thanks for everything, and my best to you all. May you build something great.

Golang is Trash

December 29, 2014

In the process of working on getting cgo to work on illumos, I learned that golang is trash. I should mention that I have no real opinion on the language itself; I’m not a languages person (C will do, thanks), and that’s not the part of the system I have experience with anyway. Rather, the runtime and toolchain are the parts on which I wish to comment.

Fundamentally, it’s apparent that gccgo was always the right answer. The amount of duplication of effort in the toolchain is staggering. Instead of creating (or borrowing from Plan9) an “assembly language” with its own assembler, “C” compiler (but it’s not really C), and an entire “linker” (that’s not really a linker nor a link-editor but does a bunch of other stuff), it would have been much better to simply reuse what already exists. While that would have been true anyway, a look at the quality of the code involved makes it even clearer. For example, the “linker” is extremely crude and is incapable of handling many common link-editing tasks such as mapfile processing, .dynamic manipulation, and even in some cases simply linking archive libraries containing objects with undefined external references. There’s no great shame in that if it’s 1980 and we don’t already have full-featured, fairly well debugged link-editors, but we do. Use them.

But I think the bit that really captures the essence of golang, as well as the psuedointellectual arrogance of Rob Pike and everything he stands for, is this little gem:

Instructions, registers, and assembler directives are always in UPPER CASE to remind you that assembly programming is a fraught endeavor.

Wait, what? Are you being paternalistic or are you just an amateur? Writing in normal (that is, adult) assembly language is not fraught at all. While Mr. Pike was busying himself with Plan9, the rest of us were establishing ABIs, writing thorough processor manuals, and creating good tools that make writing and debugging assembly no more difficult (if still somewhat slower) than C. That said, however, writing in the Fisher-Price “assembly language” that golang uses may very well be a fraught endeavor. For starters, there’s this little problem:

The most important thing to know about Go’s assembler is that it is not a direct representation of the underlying machine.

Um, ok. So what you’re telling me is that this is actually not assembly language at all but some intermediate compiler representation you’ve invented. That’s perfectly acceptable, but there’s a good reason that libgcc isn’t written in RTL. It gets better, though: if you’re going to have an intermediate representation, you’d think you’d want it to be both convenient for the tools to consume and sufficiently distinct from anything else that no human could possibly confuse it with any other representation, right? Not if you’re working on Plan9! Without the benefit of decades of lessons learned running Unix in production (because Unix is terrible and why would anyone want that instead of Plan9?), apparently such obvious thoughts never occurred to them, because the intermediate representation is almost a dead ringer for amd64 assembly! For example:

TEXT runtime·munmap(SB),NOSPLIT,$0

    MOVQ addr+0(FP), DI // arg 1 addr

    MOVQ n+8(FP), SI // arg 2 len

    MOVL $73, AX

    SYSCALL

    JCC 2(PC)

    MOVL $0xf1, 0xf1 // crash

RET

This is a classic product of technical hubris: it borrows enough from adult x86 assembly to seem familiar to someone knowledgeable in the field, but has enough pointless differences to be confusing. Clearly these are well-known instructions, with the familiar addressing modes and general syntax. But wait: DI is a register? How is that distinct from the memory location referred to by the symbol DI? Presumably these registers simply become reserved words and all actual variables must be lower-case. That would be fine if not for the fact that the author of a module does not own the namespace of symbols in other objects with which he may need to link his. What then? Oh, of course; I need to use the fake SB register for that, just like I don’t in any real assembly language. But it gets worse: what’s FP? Your natural assumption, knowing the ABI as one should, is that it’s a genericised reference to %rbp, the conventional frame pointer. WRONG! The documentation instead refers to it as a “virtual frame pointer”; in fact, rbp is usually used by the compiler as a general register, just as if you had foolishly built your code with gcc using -fomit-frame-pointer. Thanks, guys: confusing and undebuggable! We could go on, detailing the pointless divergence from actual x86 assembly and the failure to genuinely abstract registers and instructions in a way that would allow this “intermediate representation” to be generic across ISAs, but I think by this point it’s plain enough that this entire chunk of the toolchain is simply rubbish. The real toolchains everyone else uses were not invented at Lucent nor Google, so obviously they needed their own, written in seclusion with all the benefits of a 1980 worldview.

The last fun bit I wish to discuss is that funny little character between “runtime” and “munmap” in our previous example. You see, despite having written their own entire toolchain (including a compiler identifying itself as accepting C that does no such thing), the authors decided that the normal “.” character was simply too special to be repurposed in source code. Instead, it would retain its existing meaning as the customary dot operator. But this means some other character will be needed to identify symbols that should have a dot in their names. So obviously the natural choice here is some high Unicode character. Obviously. And equally obviously, when such code is compiled, the character is replaced in symbol names with an ordinary dot. Of course!

It’s no surprise that the golang people want to replace the “C” parts of the runtime. I would, too; the “C” language accepted by the Plan9 compiler is not really C. The compiler has no concept of a function pointer being equivalent to the function itself, or for that matter such similarly obscure aspects of the C standard as the constant identifier NULL (instead of NULL, one must write “nil”, quite possibly the most obnoxiously spurious product of NIH thinking I have ever seen in my life). But the problems with the golang toolchain and runtime go far beyond an idiosyncratic C dialect; the same thinking behind that oddity permeates the entire work. Everything about the implementation of the language environment feels amateurish. The best thing they could do at this point is start working on golang 2.0, with the intent to completely discard the entire toolchain and much of the runtime. Rewriting more of the runtime in Go is fine, too, but it’s critical that the language and compilers be mature enough to enable bootstrapping in some sensible way (perhaps the way that gcc builds libgcc with the new compiler, not relying in any way on installed tooling to build the finished artifact). There’s no obvious reason to turf the entire language, but reimplementing it sanely would be a huge benefit to their ecosystem. Every system already has good tools for assembling and linking code, and those tools support ABIs that enable easy reuse of external software. The Plan9 crowd needs to spend time appreciating why this is so instead of arrogantly ignoring it. A sane implementation that leverages those tools would make the Go language far more attractive.

libsunw_ssl, or, How SmartOS Avoids Sadness

April 10, 2014

The recent hubbub around CVE-2014-0160 (aka “heartbleed“) has led to a few questions about SmartOS‘s use of OpenSSL in the platform image. This is actually a very interesting side trip that has nothing to do with cryptography and very little to do with security; instead, it affords an opportunity to talk about ensuring correctness in dynamically-linked library references. Just to get this out of the way, I’ll note that one of the two versions of OpenSSL delivered by the platform prior to Robert Mustacchi‘s change was in fact vulnerable to this attack, but that library is not used to provide any TLS services so there is no way to exploit it. As we will see, that library is not usable at all by any software other than the platform itself, greatly limiting the potential scope of the problem. If you’re using SmartOS, you don’t need to worry about the OpenSSL in the platform image; OpenSSL in zones or KVM instances is another matter, and very likely does need attention.

Of greater interest to me is the presence of these files in the platform image:

/lib/64/libsunw_crypto.so.1.0.0
/lib/64/libsunw_ssl.so.1.0.0
/lib/libsunw_crypto.so.1.0.0
/lib/libsunw_ssl.so.1.0.0

You’ll note that there are no compilation symlinks (e.g., libsunw_ssl.so), nor are there any OpenSSL headers in /usr/include; together, that makes it very difficult for anyone to compile software that consumes the interfaces these libraries provide. We have however gone two steps beyond even that in our efforts to prevent customer software from using these libraries, and our reasons for doing so stem from many years of miserable experience delivering and using third-party libraries in Solaris and more recently illumos.

A Brief History of Sadness

In the very distant past, the Unix operating system was entirely self-contained: everything it relied upon was part of itself, and things that were not part of Unix were simply third-party software that operators could install or not as they saw fit. This was mostly fine, except that inevitably Bell Labs would not have a monopoly on the creation of software useful as part of an operating system. In time, the broad spread of innovation combined with changing expectations about what an operating system ought to provide led to the incorporation of various previously third-party software into many common Unix distributions. To the extent that Unix took over this software as repository of record, this was goodness: Unix grew new capabilities by adding the best software developed by others. Often, however, that was not the case, and the seeds of sadness were sown.

Fast-forward to the Solaris 10 era. By then, Solaris was delivering a fairly broad range of software for which the repository of record was outside Sun. There were two basic reasons for this:

The architecture of a few popular GNU/Linux distributions had changed customer expectations; the definition of “an operating system” had expanded to include a huge range of random third-party software packages that are not needed to use or manage the system itself but could be installed using OS tooling for the customer’s use.
Several third-party software packages, such as OpenSSL and libxml2, were being consumed directly by system software.

With the retirement of static linking with Solaris 9, these upstream libraries were being delivered as dynamic libraries just like the rest. Many of them, in order to accommodate the first use case in our list above, were delivered with headers and compilation symlinks as well. And with those headers and symlinks, the great sadness burst forth and thrived.

Architecturally, these libraries (and other software, though we’re concerned primarily with libraries here) provided interfaces that were not under Sun’s control. PSARC made some effort to communicate this to customers by requiring that these third-party interfaces generally be classified as External, later amended to Volatile before being collapsed back to Uncommitted. The gist of this, regardless of the precise terminology, was that customers consuming these interfaces were being told that they could not rely on them to remain compatible across minor releases (or even patches and updates). In a world in which the would-be consumers of these interfaces were mainly customers writing their own software from scratch, as was often the case with respect to other Solaris libraries in the past, that was usually adequate to address the problem. But the world had changed. Most of the software consuming these interfaces now is other third-party software, most of it originally developed on GNU/Linux and often built and tested nowhere else. Customers, somewhat understandably, wanted to be able to build and use (if not simply install via the OS’s packaging tools) those consumers — and have them work. Warning people that the interfaces were not to be relied upon was of little help; most customers were not directly aware that they were consuming them at all. People who were used to building software on a Solaris 2.4 system and running their binaries on everything from 2.4 through 10 were in for a rude surprise when patch releases broke their third-party software. Worse still, others were frustrated by the lag between injection of a piece of third-party software into the OS and its delivery; the versions of this software included with the operating system were invariably months or even years older than those currently available on the Internet. As a result, many pieces of third-party software that consumed these interfaces (but were not delivered with the OS) would not build or work correctly. The sadness raged throughout the land, ultimately contributing significantly to Solaris’s demise. Presumably Solaris still has this problem today.

Inside the Sadness

It should be apparent that this architectural model is untenable. An all-inclusive OS delivers, or attempts to deliver, almost every conceivable library as well as every remotely popular consumer. Since every piece of software is built consistently and packaged afresh for each release of the distribution, incompatible change in these third-party libraries (or even the OS itself) is of little importance to most customers. The primary measure of value is how many packages the distribution incorporates, not how well software built from third-party sources works across OS upgrades. While there are serious problems with this model, customers who stick with the packages provided by their distributor and upgrade frequently will at least in principle get the best of both worlds: recent software and a hassle-free upgrade path.

Similarly, an OS that is entirely self-contained and delivers no extra software (that is, software for which the repository of record is not the OS distributor’s own) is also safe. While users will be forced to obtain and build whatever software they desire, there is no architectural conflict: the interfaces provided by the OS remain compatible over time, so that the third-party software built by customers continues to work. This is the historical Unix model; it works well technically but often fails to meet customer expectations. Specifically, third-party software tends to be of exceptionally bad quality and is often developed by people with no understanding of portability whatsoever (or worse, an active disdain for it); as a result, getting it to build and work correctly is often a massive chore. Not surprisingly, customers aren’t enthusiastic about doing that themselves.

The origin of the sadness lies in seeking the middle ground: delivering a small subset of the extra software that customers would like to use. This “third way” is an architectural disaster, especially when combined with dynamic linking. Early Windows users may recall this class of problems as “DLL Hell”. Instead of offering the best of both worlds (plenty of extra software packages, low maintenance burden for the vendor, flexibility for the operator), it not only delivers the worst of both but introduces additional problems all its own:

Assuming that upstream library developers do not offer a usable backward-compability guarantee (or do, but that new major releases of them are being made and consumed by other software customers want to build and use), there is an unresolvable tension between providing customers the latest and greatest for their own use and avoiding breakage across patch or minor releases.
The vendor takes on a significant maintenance burden; if an upstream library consumed by the OS itself changes incompatibly, the vendor must recognise this and adapt the OS consumers accordingly. Otherwise, the only option is to fix the version of that library the OS delivers forever — which falls afoul of customer expectations.
Because the libraries delivered with the OS are often too old to be consumed by the latest revisions of third-party software, it’s common for the customer to build their own copies of the upstream library. There are other reasons for this practice as well, ranging from customisation to self-contained application deployment models. While the other aspects of the sadness are problematic from a business perspective, this problem, as we will see, is a clear and present danger to correct operation — of both the OS itself and customer-built software.
In Solaris, most of this upstream software was delivered in /usr/sfw, and software that consumed it had /usr/sfw/lib added to its DT_RPATH. In order to make the delivered gcc work properly, gcc itself then added this entry as well. In some releases. Meanwhile, GNU autoconf scripts throughout the world were chaotically updated to add it themselves (and/or to look in /usr/sfw/lib for libraries, whether the person building the software wanted it to or not). The end result was that reliably finding libraries at runtime became hit or miss.

At the core of the sadness are two fundamental problems:

Ensuring that the linker finds the library version against which the consumer was built, when loaded at runtime.
Preventing two incompatible library versions from occupying the same address space.

The first objective is relatively easy to achieve. If there is only one version of a library on the system, it’s simply a matter of ensuring that some combination of the consumer’s DT_RPATH and the system’s crle(1) configuration contains the path to the library, that the library’s filename matches the DT_SONAME of the library against which the consumer was built (because the consumer’s DT_NEEDED entry is recorded from it), and that the library is actually present on the system. It is easy to satisfy these constraints for all OS-delivered software; illumos’s build system generally does all of this, as does SmartOS’s. All that’s left for the end user to do is make sure LD_LIBRARY_PATH is not set, which sadly seems to be more difficult than one would expect (this is further complicated by third-party software that delivers shell scripts that explicitly set this environment variable even though it’s almost never necessary or appropriate).

The second objective is much more problematic. Consider the following library dependencies:

fooprog (RPATH: /usr/local/lib)
 |
 + DT_NEEDED: libA.so.1 => /usr/local/lib/libA.so.1
 |
 + DT_NEEDED: libB.so.1 => /lib/libB.so.1 (RPATH: /lib:/usr/lib)
                            |
                            + DT_NEEDED: libA.so.1 => /lib/libA.so.1

In our example, libB is an illumos-specific library, while libA is an upstream library. The copy in /lib is delivered with the operating system, while the one in /usr/local/lib has been built by the customer (perhaps because fooprog requires a different version of it from the one delivered by the OS in /lib). It is very easy to end up in a situation in which both copies of libA will occupy this fooprog process’s address space. Chaos will ensue; initialisation code may reference the wrong static data, functions with incompatible signatures may be called, and a generally difficult to debug core dump is the likely eventual outcome. Direct binding can alleviate some of these effects, but few customers build third-party software using the correct options. In general, much stronger medicine is required.

Avoiding the Sadness with OpenSSL

In our example above, libA.so.1 is an OpenSSL library. There are two incompatible versions of OpenSSL in general circulation: 0.9.8 and 1.0.1. Historically, illumos used 0.9.8, and most distributions delivered that version along with compilation symlinks and headers. Because SmartOS did so as well prior to May 2012, simply removing OpenSSL 0.9.8 from the platform was not an option; doing so could easily break customer binaries accidentally built against it in the past. While we will eventually remove this library, it will be some time before we can safely assume that no one is still using binaries build prior to the removal of its compilation symlinks. This, then, is why the OpenSSL 0.9.8 libraries are still delivered by SmartOS. The platform software itself does not use this version, however.

Instead, the platform uses the new libsunw_ssl.so and libsunw_crypto.so. In order to avoid the problem described above, these libraries are protected in four ways:

They have no compilation symlinks, so that linking in ‘-lsunw_crypto’ or similar will fail at build time.
There are no associated headers, making it impossible to accidentally build third-party software using the provided interfaces.
They have different names from those expected of OpenSSL libraries by third-party software (the “sunw_” prefix).
The globally-visible symbols within the library are different from those delivered by normal OpenSSL libraries; they too are prefixed with “sunw_”.

All of these changes are needed both to avoid accidental use of these Private libraries by customer software and to allow safe coexistence with customer- or pkgsrc-delivered OpenSSL libraries in a customer process’s address space. As a user of a SmartOS instance (whether in the Joyent Public Cloud, your own private cloud based on SmartDataCenter, or on your SmartOS system at home), you don’t have to worry about the operating system’s copy of OpenSSL. The platform software linked with our Private copy will always use that copy; your software will always use the copy provided by pkgsrc or your own application deployment package. And if they do end up in the same address space, their names and symbols will not conflict.

If you’re curious how this is achieved, take a look at our upstream software build system. Because the rest of the platform software is built against a set of headers that include our special sunw_prefix.h, there is normally no change required to consumers. A few upstream consumers relying on GNU autoconf or similar mechanisms bypass headers in their attempts to detect the presence or version of OpenSSL; in these cases, a few modest changes are required. All told, the maintenance burden associated with this approach has been very modest; my colleague Robert Mustacchi was able to upgrade our Private OpenSSL from 1.0.1d+ to 1.0.1g with less than a few hours of work and a very simple set of changes.

Other Software

There are a few other pieces of upstream software in SmartOS that will eventually require similar treatment. For now, because we have not modified the versions of libxml2, libexpat, and other such software in the platform from the last revision that was delivered with compilation symlinks, the existing libraries are doing double duty: they provide both backward-compatibility for customer software and important functionality consumed by the platform. As these are upgraded, we will take the same approach: the existing version will continue to be delivered for compatibility, while the new version will have its name and globally-visible symbols mangled. In all such cases, we already do not deliver compilation symlinks and headers.

As a user of a SmartOS instance, all you need to know is that software you build yourself should always depend on pkgsrc-delivered libraries, never those in the platform. It is of course safe to rely on platform libraries for which SmartOS or illumos is the repository of record, such as libc; this discussion is relevant only to software that is delivered by the platform but is also available from third parties. We’ll do the rest.

On Disk Failure

February 20, 2014

And we never failed to fail
It was the easiest thing to do

— Stephen Stills, Rick and Michael Curtis; “Southern Cross” (1981)

With Brian Beach’s article on disk drive failure continuing to stir up popular press and criticism, I’d like to discuss a much-overlooked facet of disk drive failure. Namely, the failure itself. Ignoring for a moment whether Beach’s analysis is any good or the sample populations even meaningful, the real highlight for me from the above back-and-forth was this comment from Brian Wilson, CTO of BackBlaze, in response to a comment on Mr. Beach’s article:

Replacing one drive takes about 15 minutes of work. If we have 30,000 drives and 2 percent fail, it takes 150 hours to replace those. In other words, one employee for one month of 8 hour days. Getting the failure rate down to 1 percent means you save 2 weeks of employee salary – maybe $5,000 total? The 30,000 drives costs you $4 million.

The $5k/$4million means the Hitachis are worth 1/10th of 1 percent higher cost to us. ACTUALLY we pay even more than that for them, but not more than a few dollars per drive (maybe 2 or 3 percent more).

Moral of the story: design for failure and buy the cheapest components you can. 🙂

He later went on to disclaim in a followup comment, after being rightly taken to task by other commenters for, among other things, ignoring the effect of higher component failure rates on overall durability, that his observation applies only to his company’s application. That rang hollow for me. Here’s why.

The two modern papers on disk reliability that are more or less required reading for anyone in the field are the CMU paper by Schroeder and Gibson and the Google paper by Pinheiro, Weber, and Barroso. Both repeatedly emphasise the difficulty of assessing failure and the countless ways that devices can fail. Both settle on the same metric for failure: if the operator decided to replace the disk, it failed. If you’re looking for a stark damnation of a technology stack, you won’t find a much better example than that: the only really meaningful way we have to assess failure is the decision a human made after reviewing the data (often a polite way of saying “groggily reading a pager at 3am” or “receiving a call from a furious customer”). Everyone who has actually worked for any length of time for a manufacturer or large-scale consumer of disk-based storage systems knows all of this; it may not make for polite cocktail party conversation, but it’s no secret. And that, much more than any methodological issues with Mr. Beach’s work, casts doubt on Mr. Wilson’s approach. Even ignoring for a moment the overall reduction in durability that unreliable components creates in a system, some but not all of which can be mitigated by increasing the spare and parity device counts at increased cost, the assertion that the cost of dealing with a disk drive failure that does not induce permanent data loss is the cost of 15 minutes of one employee’s time is indefensible. True, it may take only 15 minutes for a data centre technician armed with a box of new disk drives and a list of locations of known-faulty components to wander the data centre verifying that each has its fault LED helpfully lit, replacing each one, and moving on to the next, but that’s hardly the whole story.

"That's just what I wanted you to think, with your soft, human brain!"

Given the failure metric we’ve chosen out of necessity, it seems like we need to account for quite a bit of additional cost. After all, someone had to assemble the list of faulty devices and their exact locations (or cause their fault indicators to be activated, or both). Replacements had to be ordered, change control meetings held, inventories updated, and all the other bureaucratic accoutrements made up and filed in triplicate. The largest operators and their supply chain partners have usually automated some portion of this, but that’s really beside the point: however it’s done, it costs money that’s not accounted for in the delightfully naive “15-minute model” of data centre operations. Last, but certainly not least, we need to consider the implied cost of risk. But the most interesting part of the problem, at least to me personally, is the first step in the process: identifying failure. Just what does that entail, and what does it cost?

Courteous Failure

Most people not intimately familiar with storage serviceability probably assume, as I once did, that disk failure is a very simple matter. Something inside the disk stops working (probably generating the dreaded clicky-clicky noise in the process), and as a result the disk starts responding to every request made of it with some variant of “I’m broken”. The operating system notices this, stops using the disk (bringing a spare online if one is available), lights the fault LED on that disk’s bay, and gets on with life until the operator goes on site and replaces the broken device with a fresh one. Sure enough, that can happen, but it’s exceedingly rare. I’ll call this disk drive failure mode “courteous failure”; it has three highly desirable attributes:

It’s unambiguous; requests fail with an error condition.
It’s immediate; requests are terminated with an error condition as soon as they arrive.
It’s total; every media-related request fails.

Even in the case of courteous failure, many things still have to happen for the desired behaviour to occur. All illumos-derived operating systems support FMA, the fault management architecture. The implementation of this ranges from telemetry sources in both the kernel and userland to diagnosis and retirement engines that determine which component(s) are faulty and remove them from service. For operators, the main documented components in this system are fmd(1M), fmadm(1M), and syseventd(1M). With the cooperation of hardware, firmware, device drivers, kernel infrastructure, and the sysevent transport, these components are responsible for determining that a disk (or other component) is broken, informing the operator of this fact, and perhaps taking other actions such as turning on a fault indicator or instructing ZFS to stop using the device in favour of a spare. While this sounds great, the practical reality leaves much to be desired. Let’s take a look at the courteous failure case first.

Our HBA will transport the error status (normally CHECK CONDITION for SCSI devices) back to sd(7D), where we will generate a REQUEST SENSE command to obtain further details from the disk drive. Since we are assuming courteous failure, this command will succeed and provide us with useful detail when we then end up in sd_ssc_ereport_post(). Based on the specific sense data we retrieved from the device, we’ll then generate an appropriate ereport via scsi_fm_ereport_post() and ultimately fm_dev_ereport_postv(). This interface is private; however, DDI-compliant device drivers and many HBA drivers also generate their own telemetry via ddi_fm_ereport_post(9F) which has similar semantics. The posting of the ereport will result in a message (a sysevent) being delivered by the kernel to syseventd. One or more of fmd’s modules will subscribe to sysevents in classes relevant to disk devices and other I/O telemetry, and will receive the event. In our case, the Eversholt module will do this work. Eversholt is actually a general-purpose diagnosis engine that can diagnose faults in a range of devices; the relevant definitions for disks may be found in disk.esc. Since we’re assuming courteous failure due to a fatal media flaw, we’ll assume we end up diagnosing a media error fault. Once we do so, the disk will show up in our list of broken devices from ‘fmadm faulty’, and the sysevent generated, of class ‘fault.io.scsi.cmd.disk.dev.rqs.merr’, will be passed along to the disk-lights engine. This mechanism is responsible for (don’t laugh) turning on the LED for the faulty disk and turning it back off again when the disk is replaced. We’re almost done; all that remains is to tell ZFS to stop using the device and replace it with a spare if possible. The mechanism responsible for this is zfs-retire. As part of retiring the broken device, ZFS will also select a spare and begin resilvering onto it from the other devices in the same vdev (whether by mirroring or reconstructing from parity). Of course, all of these steps are reversed when the disk is eventually replaced, with the unfortunate exception of ZFS requiring an operator to execute the zpool(1M) replace command to trigger resilvering onto the new device and return the spare to the spares list.

If that sounds like a lot of moving pieces, that’s because it is. When they all work properly, the determination that a disk has failed is very easy to make: fmd and ZFS agree that the device is broken, all documented tools report that fact, and the device is automatically taken out of use and its fault LED turned on. If all failures manifested themselves in this way, there’d be little to talk about, and the no-trouble-found rate for disk drive RMAs would be zero. There would also be little or no customer impact; a few I/O requests might be slightly delayed as they’re retried against the broken device’s mirror or the data rebuilt from other devices in a RAIDZ set, but it’s unlikely anyone other than the operator would even notice. Unfortunately, the courteous failure mode I’ve just detailed is exceedingly rare. I can’t actually recall ever seeing it happen this way, although I’m sure that’s a product of selective memory. Let’s take a look at all the things that can go shatter this lovely vision of crisp, effective error handling.

Discourteous Failure, a.k.a. Real-World Failure

First, note that we make this diagnosis only when the underlying telemetry indicates that it’s fatal. This won’t happen if the request that triggered it is eligible to be retried, and by default, illumos’s SCSI stack will retry most commands many, many times before giving up (we’ve greatly reduced this behaviour in SmartOS). Since each retry can take seconds, even a minute, it can easily be minutes or even hours before the first error telemetry is generated! This is perhaps the most common, and also among the worst, disk drive failure modes: the endless timeout. This failure mode seems to be caused by firmware or controller issues; the drive simply never responds to any request, or to certain requests. The best approach here is to be much more aggressive about failing requests quickly; for example, the B_FAILFAST option used by ZFS will abort commands immediately if it can be determined that the underlying device has been physically removed from the system or is otherwise known to be unreachable. But this does not address the case in which the device is clearly present but simply refuses to respond. A milder variant of this failure mode is the long-retry case, in which the disk will internally retry a read on a marginal sector, trying to position the head accurately enough to recover the data. Most enterprise drives will retry for a few seconds before giving up; some consumer-grade devices will keep trying more or less forever. Of course, firmware can fail as well; modern disk drives have a CPU and an OS on them; that OS can panic like any other, and will do so due to bugs or unexpected hardware faults. Should this occur, requests are lost and will usually time out. When any of these failures occur, application software is perceived to hang for long periods of time or proceed extremely slowly. Other, less common, failure modes include returning bad data successfully, which only ZFS can detect; returning inaccurate sense data, precluding correct telemetry generation; and, most infuriatingly of all, working correctly but with excessively high latency. Most of these failure modes are not handled well, if at all, by current software.

One additional discourteous failure mode highlights the fundamental challenge of diagnosis especially well. Sometimes, whether because of a firmware fault or a hardware defect or fault, a disk drive will simply “go away”. The visible impact of this from software is exactly the same as if the disk drive were physically removed from the system. The industry standard practice in storage system serviceability is known as “surprise hotplug”; the system must support the unannounced removal and replacement of disk drives (limited only by the storage system’s redundancy attributes) without failing or indicating an error. It’s easy to see that satisfying this requirement and diagnosing the vanishing disk drive failure are mutually exclusive. One option is to ignore the surprise hotplug requirement in favour of something like cfgadm(1M). While simpler to implement, this approach really transfers the burden onto the operator, one of the hallmarks of poor design. Another option is to declare that a given hardware configuration requires the presence of disk drives in certain bays, and treat the absence of one as a fault. In one sense, this has the right semantics: a disk drive that goes away will be diagnosed as faulty in the same way as any other failure mode and its indicator illuminated. When it’s removed, it will already be faulty, so no new diagnosis will be made; when the replacement is inserted, the repair will be noticed. But there’s another problem here: let’s think about all the underlying causes that could lead to this failure mode:

The disk drive’s power or ground pin is broken.
The enclosure’s power or ground pin is broken.
The disk drive’s firmware has wedged itself in a state that cannot establish a SAS link.
The disk drive’s controller has failed catastrophically.
Hardware in the SAS expander on the backplane (or in the HBA’s controller ASIC) has failed.
Firmware in the SAS expander on the backplane (or in the HBA’s controller ASIC) has failed.
A bug in the HBA firmware is preventing the target from being recognised or enumerated.
An HBA device driver bug is causing higher level software to be told that the target is gone.

Notice that many of these root causes actually have nothing to do with the disk drive and will recur (often intermittently) on the same phy or bay, or in some cases on arbitrary phys or bays, even after the “faulty” disk drive is replaced. These failure modes are not academic; I’ve seen at least three of them in the field. The simplest way to distinguish mechanical removal of a disk drive from the various electrical and software-related failure modes is to place a microswitch in each disk drive bay and present its state to software. But few if any enclosures have this feature, and a quick scan of our list of possible root causes shows that it wouldn’t be terribly effective anyway: even distinguishing the removal case doesn’t tell us whether the disk drive, the enclosure, the HBA, the backplane, or one of several firmware and software components is the true source of the problem. Those that are not specific to the disk can occur intermittently for months before finally being root-caused, and can lead to a great many incorrect “faulty disk” diagnoses in the meantime. Similar problems can occur when reading self-reported status from the disk, such as via the acceptable temperature range log pages. Unfortunately, many disks have faulty firmware that will report absurd acceptable ranges incorrect values; this particular issue has resulted in hundreds of incorrect diagnoses in the field against at least two different disk drive models. So even seemingly reliable data can easily result in false positives, many of them with no provably correct alternative. Taking the human completely out of the loop is not merely difficult but outright impossible; a skilled storage FE will never lack for work.

Small-scale storage users rarely have occasion to notice any of this. For starters, disk drives are really quite reliable and few people will experience any of these failure modes if they have only a handful of devices. For another, most commodity operating systems don’t even make an effort to diagnose and report hardware failures, so at best they are limited to faults self-reported via mechanisms such as SMART. In the absence of self-reported failure, a disk drive failure on a desktop or laptop system is quite easy to confuse with any of a number of other possible faults: the system becomes excessively slow or simply hangs and stops working altogether. Larger-scale users, however, are familiar with many of these failure modes. Users of commercial storage arrays may experience them less often, as the larger established vendors not only have had many years to improve their diagnosis capabilities but also tend to diagnose faults aggressively and rely instead on an extensive burn-in protocol and highly capable RMA processing to manage false positives. That, unfortunately, leaves the rest of us in something of a tough middle ground. Fortunately, ZFS also has its own rudimentary error handling mechanism, though it does not deal with slow devices or endless timeouts any better than illumos itself. It can detect invalid data, however (other filesystems generally cannot), and on operating systems that lack illumos’s FMA mechanisms provides at least some ability to remove faulty disks from service. ZFS also handles sparing in automatically, and will resilver spares and repair blocks with bad data without operator intervention, minimising the window of vulnerability to additional failures.

Taken together, these capabilities are really just a promising start; as should be obvious from the discussion of discourteous failure modes, there’s a lot of open work here. I’m very happy that illumos has at least some of the infrastructure necessary to improve the situation; other than storage-specific proprietary operating systems, we’re in the best shape of anyone. Some promising work is being done in this area by the team at Nexenta, for example, building on the mechanism I described above. Even so, the situation remains ugly. Almost all of these discourteous failure modes will inevitably be customer-visible to some extent. Most of them will require an operator to diagnose the problem manually, often from little more than a support ticket stating the “nothing works” (this support ticket is often entirely accurate). Not only does that take time, but it is time during which a customer, or even several, is in pain. Manual diagnosis of discourteous disk drive failure is a common cause of poor customer experience among all storage providers (whether public or private), and in fact is among the more obnoxious routine challenges operators face. Independent of actually replacing the faulty devices themselves, operators will often have to spend considerable time observing a system, often at odd hours and under pressure, to determine (a) that a disk drive has failed, and (b) which one is to blame. As in some of the cases enumerated above, the problem may be intermittent or related to a firmware version or interoperability issue, and these problems can take arbitrarily long to thoroughly root-cause and correct. We’ve experienced several of the simpler discourteous failure modes ourselves over the past year, and even in the best cases, with many years of experience and deep knowledge of the entire technology stack, it’s rare that I’ve gone from problem statement to confirmed root cause in less than 15 minutes. An inexperienced first-line operator has no chance, and time will be lost escalating the issue, re-explaining it, gathering data, and so on. Multiple people will be involved, some to diagnose the problem, others to write incident reports, communicate with affected customers, or set up repairs. All of these processes can be streamlined to one degree or another, and many can be (and usually are) automated. But given the complexity I’ve outlined here, the idea that handling a discourteous disk drive failure requires a total of 15 minutes of effort from a single employee and never has any indirect costs is every bit as silly as it sounds. We have a long way to go before illumos reaches that level of reliability and completeness, and everyone else is still farther behind.

The Implied Cost of Risk

That brings us to the part we’ve been ignoring thus far: the knock-on effects of unreliable components on overall durability and the implied cost of risk. If one uses manufacturer-supplied AFRs and ignores the possibility of data loss caused by software, it’s very easy to “prove” that the MTTDL of an ordinary double-parity RAID array with a couple of hot spares is in the tens of thousands of years. But as is discussed at length in the CMU paper, AFR is not a constant; actual failure rates are best described by a Weibull distribution. To make matters worse, replacement events do not seem to be independent of one another and resilvering times are rising as disks become larger but not faster. When one further considers that both the CMU and Google researchers concluded that actual failure rates (presumably even among the best manufacturers’ products) are considerably higher than published, suddenly the prospect of data loss does not seem so remote. But what does data loss cost? Storage is a trust business; storage systems are in service much longer than most other IT infrastructure, and they occupy the bottom layer in the application stack. They have to work, and customers are understandably unhappy when they fail. The direct cost to customers of lost data can be considerable, and it’s not a stretch to suggest that a major data loss incident could doom a business like BackBlaze or Joyent’s Manta object storage service.

Instead of assessing risk by plugging manufacturer specifications into RAID formulae, I’d like to suggest a thought exercise courtesy of our over-financialised economy: Suppose you’re Ajit Jain and a medium-sized technology service company like BackBlaze or Joyent came to you asking for a policy that would compensate it for all the direct, incidental, and consequential damages that would arise from a major data loss incident induced by disk drive failure(s). You’ve read the CMU paper. You’ve read the Google paper. You’ve interviewed storage system vendors and experienced large-scale operators. What would you charge for such a policy? If you really believe that MTTDL is in the tens of thousands of years, you’d argue that such a policy should cost perhaps a thousand dollars a year. I expect Mr. Jain would ask a premium several orders of magnitude larger, probably on the order of a million dollars a year, implying that the true MTTDL across your entire service is at most a few decades (which it probably is). A cheap disk drive might easily cut that MTTDL by 80% (remember, failures are not independent and resilvering is not instantaneous!), at least quintupling the cost of our notional insurance policy. Instead of saving you $5000 a year, the more reliable drives are now saving you $4 million — the entire cost of your disk drive population. Mr. Wilson suggests that the marginal cost of 30,000 better disk drives amounts to perhaps $250,000 over the course of a disk’s 5-year lifetime, or $50,000 a year. We can quibble over the specific numbers, but if you take your operational knowledge of disk drive failure and put yourself in Mr. Jain’s shoes, would you really write this policy for less than that?

The implied cost of risk alone is sufficient to cast considerable doubt on the economic superiority of the worse-is-better approach. Perhaps when he said this approach is appropriate only to his business, Mr. Wilson really meant his actual metric is not MTTDL but mean time until someone notices that something they lost in the last 30 days was also lost on the backup service after the last upload and then complains loudly enough to cause a PR disaster. That may be uncharitable, but I can’t otherwise reconcile his position with the facts. If that’s what he meant, then substantially everyone can safely ignore BackBlaze as a useful example; we’re not seeking to optimise the same metrics.

Not Quite the Easiest Thing to Do

Mr. Stills is one hell of a musician, but we’ll need to look elsewhere for a realistic assessment of the fine art of failure. Far from being the easiest thing to do, failing is messy, inexact, and often takes far longer than any reasonable person would expect. Other times it’s so complete as to be indistinguishable from a magical disappearing act. Disk drives frequently fail to fail, and that’s why Mr. Wilson’s simplistic analysis, methodology aside, is grossly ignorant: it is predicated on a world that looks nothing like reality. The approach I’ve taken at Joyent is one informed by years of misery at Fishworks building, selling, and above all supporting disk-based storage products. It’s a multi-pronged approach: like BackBlaze, we design for failure, not only at the application level but also through improvements and innovation at the operating system level and even integration at the hardware and firmware levels; unlike Mr. Wilson, we acknowledge the real-world challenges in identifying failure and the true risks and costs associated with excess failure rates. As such, we work hard to identify and populate our data centres with the best components we can afford (which is to say, the best components our customers are willing to pay for), including the very HGST disk drives Mr. Beach concludes are superior but his company refuses to purchase. They’re not the cheapest components on the market, but they make up for their modest additional cost by enabling us to offer a better experience to our customers and reducing overhead for our Operations and Support teams. At the same time, we continue working toward an ideal future in which fault detection is crisp and response automatic and correct. Fault detection and response pose a systemic challenge, requiring a systemic response across all layers of the stack; it’s not enough to design for failure at the application layer and ignore opportunities to innovate in the operating system. Nor is the answer to disregard the implied cost of excess risk and just hope your customers won’t notice.

Anonymous Tracing on SmartOS

December 28, 2013

One of the lesser-used features of DTrace is anonymous tracing: the ability to trace events that occur after early boot has completed but before there is an opportunity to log into the system and start tracing. As Bryan says: you rarely need this facility, but when you do, you REALLY need it. Anonymous tracing relies on stuffing DOF into dtrace(7d)‘s driver.conf(4) file, /kernel/drv/dtrace.conf, and adding forceload directives to /etc/system so that DTrace’s modules are loaded as early as possible; the -A option will helpfully do both of these for you. But what if you’re on SmartOS, where these files are on a ramdisk? The whole point of anonymous tracing is to collect data during the next boot, and the contents of the ramdisk are lost before they can be used.

Bart Simpson struggling with the challenges of booting — Bart Simpson facing his fear of booting

Since the introduction of GRUB via the New Boot project roughly 10 years ago, illumos on x86 has required the use of a single boot module, known as the boot archive. This module, passed to the kernel via the Multiboot protocol, conventionally contains a UFS filesystem the kernel will use as the root filesystem during the middle phases of boot — after early loading but before enough infrastructure is built to mount the real root filesystem. A very limited VFS (“bootvfs“) provides access to this temporary root filesystem during this time, and facilities such as krtld can read kernel modules, driver.conf files, and other data such as /etc/system from it using a restricted set of filesystem-specific operations. Traditional illumos distributions require rebuilding the boot archive any time one of a number of such files have been modified in the real root filesystem, to preserve the fiction that the filesystem used during boot is the same persistent, ordinary one mounted at / when the system is running. That mechanism, then, picks up the changes needed for anonymous tracing to work and incorporates them into the archive prior to rebooting.

No similar mechanism exists on SmartOS; the only persistent means of modifying the boot archive is to rebuild the platform image (SmartOS’s name for the boot archive, though unlike the traditional boot archive, the platform image holds the entire contents of the normal root filesystem, not just a few special files used by the kernel). Building a new platform image can take anywhere from 15 minutes to 3 or 4 hours, depending on how much has changed and the method you’re using to rebuild it. Certainly it would be nice if there were a better way to override the contents of certain files contained in the platform image. So, during the recent Joyent hackathon, John Sonnenschein and I implemented a facility to pass arbitrary Multiboot modules to the OS at boot time and have them appear in the filesystem hierarchy. The first portion of this work has been integrated under OS-2629; this portion of the work provides facilities for handling any number of boot modules and making them available to the kernel at arbitrary locations in the filesystem hierarchy via a new bootvfs filesystem. This is sufficient to allow engineers to replace device drivers, driver.conf files, /etc/system, and other files consumed by the kernel prior to mounting the real root filesystem. The mechanism for passing in boot modules is described by some sketchy documentation we wrote up during the hackathon; better documentation will follow once the second phase of this work is completed, a filesystem providing access to these additional modules after the system has booted.

With this in hand, we can now enable anonymous tracing relatively easily on a standalone SmartOS host or SDC headnode. All we need is any location that we can modify from the running system and GRUB can read at boot time. If you’re using a USB key to store your SmartOS images, as we do, that’ll do fine. If you use PXE, you will need to modify the GRUB menu.lst your PXE server sends your SmartOS client. Using this facility with read-only optical media is left as an exercise for the reader. If we’re using a USB key, however, there are two basic steps involved.

Setting Up

Modify the GRUB configuration to pass the required two files to the kernel at boot time. You will need to convert the module declarations in your menu.lst from “old style” (which assumed a single module containing the boot archive) to the “new style” described in our writeup. This will end up looking something like this:

title Live 64-bit
kernel$ /os/20131227T225949Z/platform/i86pc/kernel/amd64/unix -B console=${os_console},${os_console}-mode="115200,8,n,1,-",headnode=true
module /os/20131227T225949Z/platform/i86pc/amd64/boot_archive type=rootfs name=ramdisk
module /os/20131227T225949Z/platform/i86pc/amd64/boot_archive.hash type=hash name=ramdisk
module /os/bootfs/etc/system type=file name=etc/system
module /os/bootfs/kernel/drv/dtrace.conf type=file name=kernel/drv/dtrace.conf

This need only be done once; you can then iterate as many times as needed. When you are finished using anonymous tracing, you can simply remove the two new entries.

Iterating

After modifying your GRUB configuration, it’s easy to set up a new set of enablings for your next boot. Simply run dtrace -A as you normally would, then copy the resulting /etc/system and /kernel/drv/dtrace.conf into the location you chose above (in my example above, it would be <path to USB key>/os/bootfs). When you reboot next, you will be greeted by helpful messages indicating that your enablings have been created:

NOTICE: enabling probe 0 (fbt::rts_new_rtsmsg:entry)
NOTICE: enabling probe 1 (proc:::exec-success)
NOTICE: enabling probe 2 (dtrace:::ERROR)

You may then log in normally and use dtrace -a to claim the trace data collected during boot.

Solving a Real-World Problem

A long-running source of annoyance to both SmartOS users and Joyent engineers working on SDC has been a bug causing the NTP service to enter the maintenance state on boot, more or less reliably. While this is merely annoying on standalone SmartOS systems, it has some very unfortunate knock-on effects on SDC compute nodes. This problem has resisted efforts to debug it: simply logging in and clearing the service invariably works, and any debugging commands added to the NTP start method chased the problem away entirely. The proximate cause is simple: in order to avoid ntpd hanging forever (itself an egregious bug in this widely-used software) if none of its servers is reachable, we perform a simple application-level ping on each one prior to starting up. This is itself a product of countless past headaches, and perhaps a sign that we should simply have fixed the daemon in the first place. In any case, this check would reliably fail during boot.

Anonymous tracing made this relatively simply to debug. By observing the generation of routing socket messages (see route(7p)) and correlating them with the execution of various programs during boot, it became fairly clear that the problem boiled down to two basic issues:

ifconfig(1m) does not wait for the kernel’s duplicate address detection to complete before exiting after making a change.
An unneeded service was using ifconfig to reset every interface’s netmask and broadcast addresses just prior to the NTP service starting.

Both conditions are needed to trigger this problem, and the window in which the NTP service needs to start before ifconfig’s changes take effect is only a couple hundred milliseconds wide, explaining the ease of chasing the bug away. The exceedingly simple solution: get rid of the useless service. In general, however, new services should be using ipadm(1m) instead of ifconfig; it has its own mechanism that does not suffer from this problem. Future investigations of apparent “impossible service races” during startup should consider this as a first-class possibility. Fortunately, with anonymous tracing now back in the toolbox, it’ll be much easier to evaluate this hypothesis.

Manta: What Lies Beneath

June 25, 2013

If you haven’t already read Mark’s introduction to Joyent’s new Manta service, you need to. There are plenty of exciting elements of this service, from basics like strong consistency to the lovingly crafted data processing abstractions that allow you to bring compute to your data. As with any large system, though, Manta’s visible interface is only a small fraction of the whole; I’d like to offer a few thoughts on the technology at the bottom of the Manta stack, the area where I’ve contributed most.

Prevailing wisdom holds that computers are a commodity, that there is no value to be found or created downstack. For the past decade or more, the more aggressive pundits have extended this belief into the operating system and even further upstack. As usual, they’re wrong (being wrong is, after all, the basic function of the pundit). The creation of another layer of abstraction is one of the few fundamental tools available to software engineers, but as powerful as this tool is, the universe cannot in fact be turtles all the way down. Every stack has a bottom and at the bottom of the stack (with apologies to particle physicists) lies hardware. Entwined with that hardware is its evil and seemingly omnipresent companion, firmware. Together these components provide the foundation on which all software is built. While the ideal foundation doesn’t exist today, even within the rigid confines of today’s heavily commoditised market there is enough variety available to build foundations that are better or worse for their purpose. All servers are not created equal. The basic function of hardware is to provide physical resources to applications, access to which is managed by the operating system. As such, the most basic way in which hardware configurations differ is in the balance of resources they provide. But hardware components also have architectural differences: choices made by their designers about the division of labour between hardware and firmware, or between firmware and operating system. Different components present different abstractions to layers above, and larger-scale hardware choices support or hinder systemic architectural objectives.

Shortly after I came to Joyent in early 2012, I began working on a plan to augment and replace our existing server fleet. As we began discussing Manta, it became clear that the project would require servers with a different balance than we needed in our public cloud. Like the Sun Storage 7000 systems I worked on at Fishworks, many of the systems at the heart of Manta would be storing user data. But there’s a catch! These same systems also execute user code; hence, bringing compute to data. What balance among CPU cycles, DRAM size, storage capacity, and storage performance would be required by such an application? With any new model come the unknowns, and this was, and still is, a key unknown. It will be your ideas, your use cases, that ultimately dictate how the Joyent Manta fleet is assembled. Our best guess, based on commodity economics and our experience, is embodied in the Mantis Shrimp, a 4U server capable of storing some 73 TiB of user data (soon to be nearer 100 with the introduction of 4 TB disk drives) and sharing nearly all of its remaining components with the systems comprising our public cloud infrastructure and the more conventional components of the Manta service. By standardising on components across our fleets, we reduce operational costs and engineering effort; at the same time, we have the flexibility to tune system balance across a broad spectrum while remaining well within the industry-wide price/performance “sweet spots”. Joyent provides unmatched transparency in our server and component selection: you can read the same BOMs and specifications our manufacturing partners work from in our repository on GitHub, you can purchase these certified systems for your own SmartDataCenter based private or hybrid clouds, and you can use basic OS tools to inspect the machine on which your software is running, whether in Manta or in the public cloud. Manta may seem magical, but the systems at the bottom of the stack are no mystery.

As we learn from our Manta customers, we’ll be adjusting our fleet to match demand for storage and compute; a critical part of our own big data strategy is understanding the utilisation of our infrastructure and adapting to customer needs. One of the great things about working for a systems company is being able to create and use tools at every level of the stack to collect the raw data that drives quality decisions. Without technology like DTrace and Cloud Analytics, our view of resource consumption would be woefully inadequate to this task; this kind of innovation is technologically impossible to accomplish without downstack support. More than once I’ve wondered how anyone can build software without the observability and debugging tools SmartOS offers.

The second set of important decisions is architectural. Storage architectures run the gamut from heavily centralised, vertically-scaled SANs to entirely decentralised systems built entirely around local storage. With the lessons from our Fishworks experience in mind, we’ve chosen the latter for both our cloud management stack and Manta. Every object in Manta is stored by default on 2 ZFS storage pools, each local to the server on which it is accessed. There are no SANs, no NAS heads, and no hardware RAID controllers. ZFS, while not visible to Manta consumers, is nevertheless providing both the crucial proven reliability essential to any storage product and the detailed observability required to diagnose and repair faults and assess future resource needs. This architecture is not a differentiator for Manta users but it enables us to make Manta faster, cheaper, and more reliable than it would otherwise be, and — crucially to our strategy for bringing compute to data — without requiring us to erasure-encode individual objects across the fleet. While Redmond has finally joined the durable pooled storage party, much of the world is still hamstrung by expensive SANs, opaque and unsafe hardware RAID controllers, or unreliable local storage. We’ve been working with ZFS so long that we often take it for granted, but it’s tough to overstate how bad the alternatives are and we’re very thankful to be deploying Manta atop ZFS and basic SAS HBA technology.

So, for the second time in my career, I find myself at the bottom of the stack, focused on technologies that are at once utterly essential and entirely invisible to the end user. As big a game-changer as Manta’s interface is, it cannot exist without a solid foundation beneath. No one in the industry has a better foundation than Joyent; Manta offers an example of what becomes possible when that foundation not only supports but actively aids upstack software. We hope you find it as exciting as we do.

Fin

Not Why I’m Retiring

Why I’m Retiring

Where Have All the Systems Companies Gone?

And Thus

Thank You

Golang is Trash

libsunw_ssl, or, How SmartOS Avoids Sadness

A Brief History of Sadness

Inside the Sadness

Avoiding the Sadness with OpenSSL

Other Software

On Disk Failure

Courteous Failure

Discourteous Failure, a.k.a. Real-World Failure

The Implied Cost of Risk

Not Quite the Easiest Thing to Do

Anonymous Tracing on SmartOS

Setting Up

Iterating

Solving a Real-World Problem

Manta: What Lies Beneath

Recent Posts

Fin

Golang is Trash

libsunw_ssl, or, How SmartOS Avoids Sadness

On Disk Failure

Anonymous Tracing on SmartOS

Manta: What Lies Beneath

Archives

Archives