Concurrency’s Shysters

For as long as I’ve been in computing, the subject of concurrency has always induced a kind of thinking man’s hysteria. When I was coming up, the name of the apocalypse was symmetric multiprocessing — and its arrival was to be the Day of Reckoning for software. There seemed to be no end of doomsayers, even among those who putatively had the best understanding of concurrency. (Of note was a famous software engineer who — despite substantial experience in SMP systems at several different computer companies — confidently asserted to me in 1995 that it was “simply impossible” for an SMP kernel to “ever” scale beyond 8 CPUs. Needless to say, several of his past employers have since proved him wrong…)

There also seemed to be no end of concurrency hucksters and shysters, each eager to peddle their own quack cure for the miasma. Of these, the one that stuck in my craw was the two-level scheduling model, whereby many user-level threads are multiplexed on fewer kernel-level (schedulable) entities. (To paraphrase what has been oft said of little-known computer architectures, you haven’t heard of it for a reason.) The rationale for the model — that it allowed for cheaper synchronization and lightweight thread creation — seemed to me at the time to be long on assertions and short on data. So working with my undergraduate advisor, I developed a project to explore this model both quantitatively and dynamically, work that I undertook in the first half of my senior year. And early on in that work, it became clear that — in part due to intractable attributes of the model — the two-level thread scheduling model was delivering deeply suboptimal performance…

Several months after starting the investigation, I came to interview for a job at Sun with Jeff, and he (naturally) asked me to describe my undergraduate work. I wanted to be careful here: Sun was the major proponent of the two-level model, and while I felt that I had the hard data to assert that the model was essentially garbage, I also didn’t want to make a potential employer unnecessarily upset. So I stepped gingerly: “As you may know,” I began, “the two-level threading model is very… intricate.” “Intricate?!” Jeff exclaimed, “I’d say it’s completely busted!” (That moment may have been the moment that I decided to come work with Jeff and for Sun: the fact that an engineer could speak so honestly spoke volumes for both the engineer and the company. And despite Sun’s faults, this engineering integrity remains at Sun’s core to this day — and remains a draw to so many of us who have stayed here through the ups and downs.) With that, the dam had burst: Jeff and I proceeded to gush about how flawed we each thought the model to be — and how dogmatic its presentation. So paradoxically, I ended up getting a job at Sun in part by telling them that their technology was unsound!

Back at school, I completed my thesis. Like much undergraduate work, it’s terribly crude in retrospect — but I stand behind its fundamental conclusion that the unintended consequences of the two-level scheduling model make it essentially impossible to achieve optimal performance. Upon arriving at Sun, I developed an early proof-of-concept of the (much simpler) single-level model. Roger Faulkner did the significant work of productizing this as an alternative threading model in Solaris 8 — and he eliminated the two-level scheduling model entirely in Solaris 9, thus ending the ill-begotten experiment of the two-level scheduling model somewhere shy of its tenth birthday. (Roger gave me the honor of approving his request to integrate this work, an honor that I accepted with gusto.)

So why this meandering walk through a regrettable misadventure in the history of software systems? Because over a decade later, concurrency is still being used to instill panic in the uninformed. This time, it is chip-level multiprocessing (CMP) instead of SMP that promises to be the End of Days — and the shysters have taken a new guise in the form of transactional memory. The proponents of this new magic tonic are in some ways darker than their forebears: it is no longer enough to warn of Judgement Day — they must also conjure up notions of Original Sin to motivate their perverted salvation. “The heart of the problem is, perhaps, that no one really knows how to organize and maintain large systems that rely on locking” admonished Nir Shavit recently in CACM. (Which gives rise to the natural follow-up question: is the Solaris kernel not large, does it not rely on locking or do we not know how to organize and maintain it? Or is that we do not exist at all?) Shavit continues: “Locks are not modular and do not compose, and the association between locks and data is established mostly by convention.” Again, no data, no qualifiers, no study, no rationale, no evidence of experience trying to develop such systems — just a naked assertion used as a prop for a complicated and dubious solution. Are there elements of truth in Shavit’s claims? Of course: one can write sloppy, lock-based programs that become a galactic, unmaintainable mess. But does it mean that such monstrosities are inevitable? No, of course not.

So fine, the problem statement is (deeply) flawed. Does that mean that the solution is invalid? Not necessarily — but experience has taught me to be wary of crooked problem statements. And in this case (perhaps not surprisingly) I take umbrage with the solution as well. Even if one assumes that writing a transaction is conceptually easier than acquiring a lock, and even if one further assumes that transaction-based pathologies like livelock are easier on the brain than lock-based pathologies like deadlock, there remains a fatal flaw with transactional memory: much system software can never be in a transaction because it does not merely operate on memory. That is, system software frequently takes action outside of its own memory, requesting services from software or hardware operating on a disjoint memory (the operating system kernel, an I/O device, a hypervisor, firmware, another process — or any of these on a remote machine). In much system software, the in-memory state that corresponds to these services is protected by a lock — and the manipulation of such state will never be representable in a transaction. So for me at least, transactional memory is an unacceptable solution to a non-problem.

As it turns out, I am not alone in my skepticism. When we on the Editorial Advisory Board of ACM Queue sought to put together an issue on concurrency, the consensus was twofold: to find someone who could provide what we felt was much-needed dissent on TM (and in particular on its most egregious outgrowth, software transactional memory), and to have someone speak from experience on the rise of CMP and what it would mean for practitioners.

For this first article, we were lucky enough to find Calin Cascaval and colleagues, who ended up writing a must-read article on STM in November’s CACM. Their conclusions are unavoidable: STM is a dog. (Or as Cascaval et al. more delicately put it: “Based on our results, we believe that the road for STM is quite challenging.”) Their work is quantitative and analytical and (best of all, in my opinion) the authors never lose sight of the problem that transactional memory was meant to solve: to make parallel programming easier. This is important, because while many of the leaks in the TM vessel can ultimately be patched, the patches themselves add layer upon layer of complexity. Cascaval et al. conclude:

And because the argument for TM hinges upon its simplicity and productivity benefits, we are deeply skeptical of any proposed solutions to performance problems that require extra work by the programmer.

And while their language is tighter (and the subject of their work a weightier and more active research topic), the conclusions of Cascaval et al. are eerily similar to my final verdict on the two-level scheduling model, over a decade ago:

The dominating trait of the [two-level] scheduling model is its complexity. Unfortunately, virtually all of its complexity is exported to the programmer. The net result is that programmers must have a complete understanding of the model the inner workings of its implementation in order to be able to successfully tap its strengths and avoid its pitfalls.

So TM advocates: if Roger Faulkner knocks on your software’s door bearing a scythe, you would be well-advised to not let him in…

For the second article, we brainstormed potential authors — but as we dug up nothing but dry holes, I found myself coming to an unescapable conclusion: Jeff and I should write this, if nothing else as a professional service to prevent the latest concurrency hysteria from reaching epidemic proportions. The resulting article appears in full in the September issue of Queue, and substantially excerpted in the November issue of CACM. Writing the article was a gratifying experience, and gave us the opportunity to write down much of what we learned the hard way in the 1990s. In particular, it was cathartic to explore the history of concurrency. Having been at concurrency’s epicenter for nearly a decade, I felt that the rise of CMP has been recently misrepresented as a failure of hardware creativity — and it was vindicating to read CMP’s true origins in the original DEC Piranha paper: that given concurrent databases and operating systems, implementing multiple cores on the die was simply the best way to deliver OLTP performance. That is, it was the success of concurrent software — and not the failure of imagination on the part of hardware designers — that gave rise to the early CMP implementations. Hopefully practitioners will enjoy reading the article as much as we enjoyed writing it — and here’s hoping that we live to see a day when concurrency doesn’t attract so many schemers and dreamers!

Posted on November 3, 2008 at 11:24 pm by bmc · Permalink
In: Solaris

10 Responses

Subscribe to comments via RSS

  1. Written by Garrett D'Amore
    on November 4, 2008 at 9:50 am
    Permalink

    Wow, an unbelievably excellent post on a topic that I have found myself wondering about as well. TM has seemed to me at least, to offer additional complexity without really solving the hard problems that I wind up using locks (and other synchronization primitives in the kernel) to solve, and your post puts to words what I’ve been feeling for a while. Thank you!

  2. Written by UX-admin
    on November 5, 2008 at 1:19 am
    Permalink

    "This article is half stern lecture on the merits of abstinence and half Kama Sutra."
    I love your writing style!
    BTW, a carfeul reader will have noted that you already knew about the two-level scheduling model in Solaris. Given that at that time Solaris wasn’t open source, how did you know how threading was implemented in Solaris, before having actually been hired by Jeff to work at Sun?

  3. Written by RNC
    on November 5, 2008 at 5:24 am
    Permalink

    So what is your view on Harware Transactional Memory as implemented in Rock? A CPU upon which Sun Micro’s future appears dependent?

  4. Written by Peter Schow
    on November 5, 2008 at 8:07 am
    Permalink

    @UX-admin: Take a look at reference [9] and [8] in the mentioned thesis.

  5. Written by Bryan Cantrill
    on November 5, 2008 at 2:54 pm
    Permalink

    UX-admin,
    Peter’s right; check out those references for the architecture of the two-level model. It should also be said (and I don’t think I actually said it in my thesis) that the lab in which I worked had a Solaris source license. Having the source was instrumental in me being able to do my research — a fact which helped inform my bias towards open source after I arrived at Sun…

  6. Written by Bryan Cantrill
    on November 5, 2008 at 2:59 pm
    Permalink

    RNC,
    Yes, a good question — and I suppose another way in which the two-level/TM analogue holds up is that Sun had/has a major stake in both. ;) The answer is that I’m not close enough to Rock to answer the implementation issue definitively — but when the TM issue was initially discussed in the CPU architecture committees in which I participated (in 2001), I did not withhold my skepticism. Rock is not — or should not be — "dependent" on TM. They have added support for it, and if some body of researchers or (less likely in my opinion) practitioners find that support useful, great. But TM in Rock should be a sideshow, not the main event…

  7. Written by Keith Adams
    on November 5, 2008 at 3:14 pm
    Permalink

    "Shysters" is, at least, uncharitable. The TM partisans are, in fact, wrong, and it’s becoming increasingly acceptable to say so in polite company. But I suspect that they’ve come to their wrongness in good faith…

  8. Written by Bryan Cantrill
    on November 5, 2008 at 3:37 pm
    Permalink

    Keith,
    First, loved your follow-up to our work:
    http://x86vmm.blogspot.com/2008/11/cantrill-and-bonwick-get-all-concurrent.html
    Yes, the microkernel debate is another interesting analogue, and your observations are spot-on (and I say this as one who did kernel work for a microkernel operating system). Do you think there are enough of these for a book? "Locked in the Cellar: System Software’s Crazy Aunts, 1970-present"?
    In terms of good faith: the problem I have is not that the TM folks are incorrect, it’s the arrogance of the sweeping assertions. Speaking personally, I have attempted to inject a little data/reality into the thinking of some TM partisans, if only to get them to narrow the scope of their assertions a bit. I have been roundly ignored each time. Perhaps not malice, but it is certainly true that there is a point at which malice and incompetence become impossible to distinguish from one another…

  9. Written by David Holmes
    on December 4, 2008 at 6:34 pm
    Permalink

    Here, here Bryan! As part of the core group involved in developing the Java Concurrency Utilities and having been teaching about concurrent programming in Java for over ten years, it was disheartening to read so much nonsense about the "CMT sky is falling". Does CMT impose additional challenges for effective concurrent programming? Sure. But there are so many fallacies in the arguments being put forward: the main one being that a single application needs to keep all of those core’s busy. I’m all for additional parallelism, but even then the simplistic programming models being advocated in a number of languages/platforms don’t even take into account that there are thresholds below which parallelizing a problem just doesn’t make sense. (Just because you can, doesn’t mean you should!).
    As for TM, well I’ve long been a software TM skeptic, for the reasons you outline: TM relies on the ‘M’ and once you have real programs that need to handle other things transactionally (or rather has things that can’t be handled transactionally!) then STM breaks down. I’ve read, and reviewed, a lot of academic papers on STM, and as the authors try to expand their models to cope with things that are inherently non-transactional, the programming model gets more and more complex, to the point where I don’t believe that the resulting model is any better than "threads & locks" – far from it. Plus the performance is terrible too.
    Hardware TM is a slightly different story. You can take advantage of HTM for very small critical sections code (that do only involve memory) and gain performance benefits over alternative techniques. And the programming model is somewhat simpler compared to lock-free techniques (but marginally so given the programming models at this low-level are fairly simple to begin with).
    Regards,
    David Holmes

  10. Written by Bryan Cantrill
    on December 4, 2008 at 9:53 pm
    Permalink

    David,
    Thanks for the thoughtful comments — it’s always a relief when a fellow domain expert agrees! As for HTM, I’m skeptical that they do indeed gain over traditional techniques, especially for small critical sections. After all, I’m still doing read-to-own bus transactions (no way around that), and that’s a much greater cost than the pipeline stalls from a compare&swap. Further: are the scenarios in which HTM results in a higher performing system for the contended case or the uncontended case? If the former (that is, if HTM only putatively shows a benefit when contention is high), HTM is falling into a classic architecture pitfall: optimizing for the wrong case. In the Solaris kernel — as in any mature system with fine-grained parallelism — the vast, vast majority of locks are uncontended. (Indeed, that’s the whole damn point of fine-grained parallelism.) And when we do find a lock that has high contention, we take the steps necessary to defract or eliminate that contention — we don’t optimize for the contention itself.
    My HTM skepticism is also heightened by the fact that the world has already had the opportunity to experiment with a flavor of HTM: namely, load-linked/store-conditional (implemented at least by Alpha and PowerPC). Now, there is obviously a difference between LL/SC and full-blown HTM, but if one is making the argument that HTM is essentially most useful for "very small critical sections", I think one needs to address what HTM would solve that LL/SC didn’t…

Subscribe to comments via RSS