Rebutting a rebuttal
Normally, I’m hesitant to engage people on the internet in an argument. Typically, you get about one or two useful responses before it descends into a complete flame war. So I thought I’d take my one remaining useful response and comment on this blog which is a rebuttal to my previous post, and accuses me of being a “single Sun misinformed developer”.
To state that the Linux kernel developers don’t believe in those good, sound engineering values, is pretty disingenuous … These [sic] is pretty much a baseless argument just wanting to happen, and as he doesn’t point out anything specific, I’ll ignore it for now.
Sigh. It’s one thing to believe in sound engineering values, and quite another to develop them as an integral part of your OS . I’m not saying the Linux doesn’t care about these things at all, just that they’re just not a high priority. The original goal of my post was not “our technology is better than yours,” only that we have different priorities. But if you want a technology comparison, here are some Solaris examples:
- Reliability – Reliability is more than just “we’re more stable than Windows.” We need to be reliable in the face of hardware failure and service failure. If I get an uncorrectable error on a user process page, predictive self healing can re-start the service without rebooting the machine and without risking memory corruption. Fault Management Architecture can offline CPUs in reponse to hardware errors and retire pages based on the frequency of correctable errors. ZFS provides complete end-to-end checksums, capable of detecting phantom writes and firmware bugs, and automatically repair bad data without affecting the application. The service management facility can ensure that transient application failures do not result in a loss of availability.
- Serviceability – When things go wrong (and trust me, they will go wrong), we need to be able to solve the problem in as little time as possible with the lowest cost to the customer and Sun. If the kernel crashes, we get a concise file that customers can send to support without having to reproduce the problem on an instrumented kernel or instruct support how to recreate my production environment. With the fault management architecture, an administrator can walk up to any Solaris machine, type a single command, and see a history of all faulty components in the system, when and how they were repaired, and the severity of the problems. All hardware failures are linked to an online knowledge base with recommended repair procedures and best practices. With ZFS, disks exhibiting questionable data integrity can automatically be removed from storage pools without interruption of normal service to prevent outright failure. Dynamic reconfiguration allows entire CPU boards can be removed from the system without rebooting.
- Observability – DTrace allows real-world administrators (not kernel developers) to see exactly what is happening on their system, tracing arbitrary data from user applications and the kernel, aggregating it and coordinating with disjoint events. With kmdb, developers can examine the static state of the kernel, step through kernel functions, and modify kernel memory. Commands like trapstat provide hardware trap statistics, and CPU event counters can be used to gather hardware-assisted profiling data via libcpc.
- Resource management – With Solaris resource management, users can control memory and CPU shares, IPC tunables, and a variety of other constraints on a per-process basis. Processes can be grouped into tasks to allow easy management of a class of applications. Zones allow a system to be partitioned and administrated from a central location, dividing the same physical resources amongst OS-like instances. With process rights management, users can be given individual privileges to manage privileged resources without having to have full root access.
That’s just a few features of Solaris off the top of my head. There are Linux projects out there approaching some semblance of these features, but I’d contend that none of them is as polished and comprehensive as those found in Solaris, and most probably live as patches that have not made their way into few mainstream distributions (RedHat), despite years of development. This is simply because these ideas are not given the highest priority, which is a judgement call by the community and perfectly reasonable.
Crash dumps. The main reason this option has been rejected is the lack of a real, working implementation. But this is being fixed. Look at the kexec based crashdump patch that is now in the latest -mm kernel tree. That is the way to go with regards to crash dumps, and is showing real promise. Eventually that feature will make it into the main kernel tree.
I blamed crash dumps on Linus. You blame their failure on poor implementation. Whatever your explanation, the fact remains that this project was started back in 1997-99. Forget kernel development – this is our absolute #1 priority when it comes to serviceability. It has taken the Linux community seven years to get something that is “showing real promise” and still not in the main kernel tree. Not to mention the post-mortem tools are extremely basic (30 different commands, compared with the 700+ available in mdb).
Kernel debuggers. Ah, a fun one. I’ll leave this one alone only to state that I have never needed to use one, in all the years of my kernel development. But I know other developers who swear by them. So, to each their own. For hardware bringup, they are essential. But for the rest of the kernel community, they would be extra baggage.
Yes, kernel debuggers are needed in a relatively small number of situations. But in these situations, they’re absolutely essential. Also, just because you haven’t used one yet doesn’t mean it isn’t necessary. All bugs can be solved simply by looking at the source code long enough. But that doesn’t mean it’s practical. The claim that it’s “extra baggage” is bizarre. Are you worried about additional source code? Binary bloat? How can a kernel be extra baggage if I don’t use it?
Tracing frameworks. Hm, then what do you call the kprobes code that is in the mainline kernel tree right now? This suffered the same issue that the crash dump code suffered, it wasn’t in a good enough state to merge, so it took a long time to get there. But it’s there now, so no one can argue it again.
Yes, the kprobes code that was just accepted into the mainline branch only a week and a half ago (that must be why I’m so misinformed). KProbes seems like a good first step, but it needs to be tied into a framework that administrators can actually use. LTT is a good beginning, but it’s been under development for five years and still isn’t integrated into the main tree. It’s quite obvious that the linux kernel maintainers don’t perceive tracing as anything more than a semi-useful debugging tool. There’s more to a tracing framework than just KProbes – any of our avid DTrace customers (administrators) are living proof of this falsehood.
We (kernel developers) do not have to accept any feature that we deem is not properly implemented, just because some customer or manager tells us we have to have it. In order to get your stuff into the kernel, you must first tell the community why it is necessary, and so many people often forget this. Tell us why we really need to add this new feature to the kernel, and ensure us that you will stick around to maintain it over time.
First of all, every feature in Solaris 10 was conceived by Solaris kernel developers based on a decade of interactions with real customers solving real problems. We’re not just a bunch of monkeys out to do management’s bidding. Second of all, you don’t implement a feature “just because some customer” wants it? What better reason could you possibly have? Perhaps you’re thinking that because some customer really wants something, we just integrate whatever junk we can come up with in a months time. If this were true, don’t you think you’d see Kprobes in Solaris instead of DTrace?
First off, this [binary compatibility] is an argument that no user cares about.
We have customers paying tens of millions of dollars precisely because we claim backwards compatibility. This is an example of where Linux is just competing in a different space than Solaris, hence the different priorities. If your customer is a 25 year old linux advocate managing 10 servers for the University, then you’re probably right. But if your customer is a 200,000 employee company with tens of thousands of servers and hundreds of millions of dollars riding on their applications, then you’re dead wrong..
The arguments he makes for binary incompatibility are all ones I’ve heard before. Yes, not having binary compatibility makes it easier to change a kernel interface. But it’s not exactly rocket science to maintain an evolving interface with backwards compatibility. Yes, you can rewrite interfaces. But you can do so in a well defined and well documented manner. How hard is it to have a stable DDI for all of 2.4, without worrying that 2.4.21 is incompatible with 2.4.22-ac? As far as I’m concerned, changing compiler options that break binary compatiblity is a bug. Fix your interfaces so they don’t depend on structures that change definition at the drop of a hat. Binary compatibility can be a pain, but it’s very important to a lot of our customers.
And when Sun realizes the error of their ways, we’ll still be here making Linux into one of the most stable, feature rich, and widely used operating systems in the world.
For some reason, all Linux advocates have an “us or them” philosophy. In the end, we have exactly what I said at the beginning of my first post. Solaris and Linux have different goals and different philosophies. Solaris is better at many things. Linux is better at many things. For our customers, for our business, Linux simply isn’t moving in the right direction for us. There’s only so many “we’re getting better” comments about Linux that I can take: the proof is in the pudding. The Linux community at large just isn’t motivated to accomplish the same goals we have in Solaris, which is perfectly fine. I like Linux; it has many admirable qualities (great hardware support, for example). But it just doesn’t align with what we’re trying to accomplish in Solaris.