Month: September 2004

Behind the music: /system/object

September 28, 2004

About a month ago, I added a new pseudo filesystem to Solaris: ‘objfs’, mounted at /system/object. This is an “under the hood” filesystem that no user will interact with directly. But it’s a novel solution to a particularly thorny problem for DTrace, and may be of some interest to the curious techies out there.

When DTrace was first integrated into Solaris, it had a few hacks to get around the problem of accessing kernel module data from userland. In particular, it opened /dev/kmem in order to find the address ranges of module text and data segments, and introduced a private modctl call in order to extract CTF and symbol table information. The end result was something that mostly worked, but with a few drawbacks. Opening /dev/kmem requires all privileges or membership in group sys, so even if you give a user the dtrace_kernel privilege, they still were unable to access kernel CTF and symbol information. Direct access via modctl necessitated a complicated (and sometimes broken) dance to allow 32-bit dtrace apps to work on a 64-bit kernel.

The solution was to create a new pseudo filesystem which would export information about the currently loaded objects in the kernel as standard ELF files. Choosing a pseudo filesystem over a new system call or modctl has several advantages:

Filesystems are great for exporting heirarchal data. We needed to export a collection of named data – perfect for a directory layout.
By modeling objects as directories and choosing an ELF file format, we have left room for expansion without having to go back and modify existing implementations.
We can leverage our existing toolset for working with ELF files: elfdump(1), nm(1), libelf(3LIB), and libctf. The userland changes to libdtrace(3LIB) were minimal because we already have established interfaces for working with ELF files.
Filesystems are easily virtualized in a local zone. DTrace is still not usable from within a local zone for a few small reasons, but we’re one step closer.
There are no data model issues. We simply export a 64-bit ELF object, and the gelf_xxx() routines handle the conversion transparently.

The final result is:

$ elfdump -c /system/object/genunix/object
Section Header[1]:  sh_name: .shstrtab
sh_addr:      0xa6eaea30      sh_flags:   [ SHF_STRINGS ]
sh_size:      0x46            sh_type:    [ SHT_STRTAB ]
sh_offset:    0x1c4           sh_entsize: 0
sh_link:      0               sh_info:    0
sh_addralign: 0x8
Section Header[2]:  sh_name: .SUNW_ctf
sh_addr:      0xa61f7000      sh_flags:   0
sh_size:      0x2e79d         sh_type:    [ SHT_PROGBITS ]
sh_offset:    0x20a           sh_entsize: 0
sh_link:      3               sh_info:    0
sh_addralign: 0x8
Section Header[3]:  sh_name: .symtab
sh_addr:      0xa61b5050      sh_flags:   0
sh_size:      0x1f7d0         sh_type:    [ SHT_SYMTAB ]
sh_offset:    0x2e9a7         sh_entsize: 0x10
sh_link:      4               sh_info:    0
sh_addralign: 0x8
Section Header[4]:  sh_name: .strtab
sh_addr:      0xa61d96dc      sh_flags:   [ SHF_STRINGS ]
sh_size:      0x1cd5e         sh_type:    [ SHT_STRTAB ]
sh_offset:    0x4e177         sh_entsize: 0
sh_link:      0               sh_info:    0
sh_addralign: 0x8
Section Header[5]:  sh_name: .text
sh_addr:      0xfe87e4a0      sh_flags:   [ SHF_ALLOC  SHF_EXECINSTR ]
sh_size:      0x198dc0        sh_type:    [ SHT_NOBITS ]
sh_offset:    0x6aed5         sh_entsize: 0
sh_link:      0               sh_info:    0
sh_addralign: 0x8
Section Header[6]:  sh_name: .data
sh_addr:      0xfec3eba0      sh_flags:   [ SHF_WRITE  SHF_ALLOC ]
sh_size:      0x3e1c0         sh_type:    [ SHT_NOBITS ]
sh_offset:    0x6aed5         sh_entsize: 0
sh_link:      0               sh_info:    0
sh_addralign: 0x8
Section Header[7]:  sh_name: .bss
sh_addr:      0xfed7a5f0      sh_flags:   [ SHF_WRITE  SHF_ALLOC ]
sh_size:      0x7664          sh_type:    [ SHT_NOBITS ]
sh_offset:    0x6aed5         sh_entsize: 0
sh_link:      0               sh_info:    0
sh_addralign: 0x8
Section Header[8]:  sh_name: .info
sh_addr:      0x1             sh_flags:   0
sh_size:      0x4             sh_type:    [ SHT_PROGBITS ]
sh_offset:    0x6aed5         sh_entsize: 0
sh_link:      0               sh_info:    0
sh_addralign: 0x8
Section Header[9]:  sh_name: .filename
sh_addr:      0xfec3e8e0      sh_flags:   0
sh_size:      0x10            sh_type:    [ SHT_PROGBITS ]
sh_offset:    0x6aed9         sh_entsize: 0
sh_link:      0               sh_info:    0
sh_addralign: 0x8

The string table, symbol table, and CTF data are all complete. You’ll notice that we also have text, data, and bss, but they’re marked SHT_NOBITS (which means they’re not present in the file). We use the section headers to extract information about the address range for each section, but we can’t actually export the data due to security. Obviously, letting ordinary users see the data section of loaded modules would be a Bad Thing.

To end in typical “Behind the Music” fashion – After a nightmare descent into drug an alcohol abuse, objfs once again was able to take control of its life (thanks mostly a loving relationship with libdtrace), and now lives a relaxing life on a Montana ranch.

Slashdotted…

September 27, 2004

So my last few posts have sparked quite a bit of discussion out there, appearing on slashdot as well as OSNews. It’s been quite an interesting experience, though it’s had a significant effect on my work productivity today 🙂 While I’m not responding to every post, I promise that I’m reading them (and thanks to those of you sending private mail, I promise to respond soon).

I have to say that I’ve been reasonably impressed with the discussion so far. Slashdot, as usual, leaves something to be desired (even reading at +5), but the comments in my blog and in my email have been for the most part very reasonable. There is a certain amount of typical fanboy drivel (more so on the pro-Linux side, but only because Solaris doesn’t have many fanboys). But there’s also a reasonable contingent on Slashdot fighting down the baseless arguments of the zealots. In the past, the debate has been rather one-sided. Solaris is usually dismissed as an OS for big computers for people with lots of money. Sun has traditionally let our marketing department do all the talking, which works really well for CEOs and CTOs (our paying customers), but not as well for spreading detailed technical knowledge to the developer community. We’re changing our business model – encouraging blogs, releasing Solaris Express, hosting discussions with actual kernel engineers, and eventually open sourcing Solaris – to encourage direct connections with the community at large.

We’ve been listening to the (often one-sided) discussion for a long time now, and it shows in Solaris. Solaris 10 has killer performance, even on single- and dual-processor x86 machines. Hardware support has been greatly improved (S10 installed on my Toshiba laptop without a hitch). We’re focusing on the desktop again, with X.Org integration, Gnome 2.6, Mozilla 1.7, and better open source packages all around. Sure, we’re still playing catchup in a lot of ways, but we’re learning. I only hope the Linux community can learn from Solaris’s strengths, and dismiss many of the Solaris stereotypes that have been implanted (not always without merit) over the course of history. Healthy competition is good, and can only benefit the customer.

As much as I would like to continue this debate forever, I think it’s time I get back to doing what I really love – making Solaris into the best OS it can be. I’ll probably be focusing on more technical posts for a while, but I’ll revive the discussion again at a future point. Until then, feel free to continue posting comments or sending me mail. I do read them, even if I don’t respond publicly.

Rebutting a rebuttal

September 24, 2004

Normally, I’m hesitant to engage people on the internet in an argument. Typically, you get about one or two useful responses before it descends into a complete flame war. So I thought I’d take my one remaining useful response and comment on this blog which is a rebuttal to my previous post, and accuses me of being a “single Sun misinformed developer”.

To state that the Linux kernel developers don’t believe in those good, sound engineering values, is pretty disingenuous … These [sic] is pretty much a baseless argument just wanting to happen, and as he doesn’t point out anything specific, I’ll ignore it for now.

Sigh. It’s one thing to believe in sound engineering values, and quite another to develop them as an integral part of your OS . I’m not saying the Linux doesn’t care about these things at all, just that they’re just not a high priority. The original goal of my post was not “our technology is better than yours,” only that we have different priorities. But if you want a technology comparison, here are some Solaris examples:

Reliability – Reliability is more than just “we’re more stable than Windows.” We need to be reliable in the face of hardware failure and service failure. If I get an uncorrectable error on a user process page, predictive self healing can re-start the service without rebooting the machine and without risking memory corruption. Fault Management Architecture can offline CPUs in reponse to hardware errors and retire pages based on the frequency of correctable errors. ZFS provides complete end-to-end checksums, capable of detecting phantom writes and firmware bugs, and automatically repair bad data without affecting the application. The service management facility can ensure that transient application failures do not result in a loss of availability.
Serviceability – When things go wrong (and trust me, they will go wrong), we need to be able to solve the problem in as little time as possible with the lowest cost to the customer and Sun. If the kernel crashes, we get a concise file that customers can send to support without having to reproduce the problem on an instrumented kernel or instruct support how to recreate my production environment. With the fault management architecture, an administrator can walk up to any Solaris machine, type a single command, and see a history of all faulty components in the system, when and how they were repaired, and the severity of the problems. All hardware failures are linked to an online knowledge base with recommended repair procedures and best practices. With ZFS, disks exhibiting questionable data integrity can automatically be removed from storage pools without interruption of normal service to prevent outright failure. Dynamic reconfiguration allows entire CPU boards can be removed from the system without rebooting.
Observability – DTrace allows real-world administrators (not kernel developers) to see exactly what is happening on their system, tracing arbitrary data from user applications and the kernel, aggregating it and coordinating with disjoint events. With kmdb, developers can examine the static state of the kernel, step through kernel functions, and modify kernel memory. Commands like trapstat provide hardware trap statistics, and CPU event counters can be used to gather hardware-assisted profiling data via libcpc.
Resource management – With Solaris resource management, users can control memory and CPU shares, IPC tunables, and a variety of other constraints on a per-process basis. Processes can be grouped into tasks to allow easy management of a class of applications. Zones allow a system to be partitioned and administrated from a central location, dividing the same physical resources amongst OS-like instances. With process rights management, users can be given individual privileges to manage privileged resources without having to have full root access.

That’s just a few features of Solaris off the top of my head. There are Linux projects out there approaching some semblance of these features, but I’d contend that none of them is as polished and comprehensive as those found in Solaris, and most probably live as patches that have not made their way into few mainstream distributions (RedHat), despite years of development. This is simply because these ideas are not given the highest priority, which is a judgement call by the community and perfectly reasonable.

Crash dumps. The main reason this option has been rejected is the lack of a real, working implementation. But this is being fixed. Look at the kexec based crashdump patch that is now in the latest -mm kernel tree. That is the way to go with regards to crash dumps, and is showing real promise. Eventually that feature will make it into the main kernel tree.

I blamed crash dumps on Linus. You blame their failure on poor implementation. Whatever your explanation, the fact remains that this project was started back in 1997-99. Forget kernel development – this is our absolute #1 priority when it comes to serviceability. It has taken the Linux community seven years to get something that is “showing real promise” and still not in the main kernel tree. Not to mention the post-mortem tools are extremely basic (30 different commands, compared with the 700+ available in mdb).

Kernel debuggers. Ah, a fun one. I’ll leave this one alone only to state that I have never needed to use one, in all the years of my kernel development. But I know other developers who swear by them. So, to each their own. For hardware bringup, they are essential. But for the rest of the kernel community, they would be extra baggage.

Yes, kernel debuggers are needed in a relatively small number of situations. But in these situations, they’re absolutely essential. Also, just because you haven’t used one yet doesn’t mean it isn’t necessary. All bugs can be solved simply by looking at the source code long enough. But that doesn’t mean it’s practical. The claim that it’s “extra baggage” is bizarre. Are you worried about additional source code? Binary bloat? How can a kernel be extra baggage if I don’t use it?

Tracing frameworks. Hm, then what do you call the kprobes code that is in the mainline kernel tree right now? 🙂 This suffered the same issue that the crash dump code suffered, it wasn’t in a good enough state to merge, so it took a long time to get there. But it’s there now, so no one can argue it again.

Yes, the kprobes code that was just accepted into the mainline branch only a week and a half ago (that must be why I’m so misinformed). KProbes seems like a good first step, but it needs to be tied into a framework that administrators can actually use. LTT is a good beginning, but it’s been under development for five years and still isn’t integrated into the main tree. It’s quite obvious that the linux kernel maintainers don’t perceive tracing as anything more than a semi-useful debugging tool. There’s more to a tracing framework than just KProbes – any of our avid DTrace customers (administrators) are living proof of this falsehood.

We (kernel developers) do not have to accept any feature that we deem is not properly implemented, just because some customer or manager tells us we have to have it. In order to get your stuff into the kernel, you must first tell the community why it is necessary, and so many people often forget this. Tell us why we really need to add this new feature to the kernel, and ensure us that you will stick around to maintain it over time.

First of all, every feature in Solaris 10 was conceived by Solaris kernel developers based on a decade of interactions with real customers solving real problems. We’re not just a bunch of monkeys out to do management’s bidding. Second of all, you don’t implement a feature “just because some customer” wants it? What better reason could you possibly have? Perhaps you’re thinking that because some customer really wants something, we just integrate whatever junk we can come up with in a months time. If this were true, don’t you think you’d see Kprobes in Solaris instead of DTrace?

First off, this [binary compatibility] is an argument that no user cares about.

We have customers paying tens of millions of dollars precisely because we claim backwards compatibility. This is an example of where Linux is just competing in a different space than Solaris, hence the different priorities. If your customer is a 25 year old linux advocate managing 10 servers for the University, then you’re probably right. But if your customer is a 200,000 employee company with tens of thousands of servers and hundreds of millions of dollars riding on their applications, then you’re dead wrong..

The arguments he makes for binary incompatibility are all ones I’ve heard before. Yes, not having binary compatibility makes it easier to change a kernel interface. But it’s not exactly rocket science to maintain an evolving interface with backwards compatibility. Yes, you can rewrite interfaces. But you can do so in a well defined and well documented manner. How hard is it to have a stable DDI for all of 2.4, without worrying that 2.4.21 is incompatible with 2.4.22-ac? As far as I’m concerned, changing compiler options that break binary compatiblity is a bug. Fix your interfaces so they don’t depend on structures that change definition at the drop of a hat. Binary compatibility can be a pain, but it’s very important to a lot of our customers.

And when Sun realizes the error of their ways, we’ll still be here making Linux into one of the most stable, feature rich, and widely used operating systems in the world.

For some reason, all Linux advocates have an “us or them” philosophy. In the end, we have exactly what I said at the beginning of my first post. Solaris and Linux have different goals and different philosophies. Solaris is better at many things. Linux is better at many things. For our customers, for our business, Linux simply isn’t moving in the right direction for us. There’s only so many “we’re getting better” comments about Linux that I can take: the proof is in the pudding. The Linux community at large just isn’t motivated to accomplish the same goals we have in Solaris, which is perfectly fine. I like Linux; it has many admirable qualities (great hardware support, for example). But it just doesn’t align with what we’re trying to accomplish in Solaris.

GPL thoughts and clarifications

September 24, 2004

So it seems my previous entry has finally started to stir up
some controversy. I’ll address some of the technical issues raised
here shortly. But first I thought I’d clarify my view of the GPL, with the help of an analogy:

Let’s say that I manufacture wooden two by fours, and that I want to make
them freely available under an “open source” license. There are several options
out there:

You have the right to use and modify my 2x4s to your hearts content.

This is the basis for open source software. It protects the rights of the
consumer, but imparts few rights to the developer.
You have the right to use my 2x4s however you please, but if you modify
one, then you have to make that modification freely available to the public in
the same fashion as the original.

This gives the developer a few more guarantees about what can and cannot
be done with his or her contributions. It protects the developer’s rights
without infringing on the rights of the consumer.
You have the right to use my 2×4 as-is, but if you decide to build a
house with it, then your house must be as freely available as my 2×4.

This is the provision of the GPL that I don’t agree with, and neither do
customers that we’ve talked to. It protects my rights as a developer, but
severely limits the rights of the consumer in what can and cannot be done with my public
donation.

This analogy has some obvious flaws. Open source software is neither excludable nor rival, unlike the house I just built. There is also a tenuous
line between derived works and fair use. In my example, I wouldn’t have the
right to the furniture put into your house. But I feel like its a reasonable
simplification of my earlier point.

As an open source advocate, I would argue that #1 is the “most free”. This
is why, in many ways, the BSD license is the “most open” of all the main
licenses. As a developer, I would argue that #2 is the best solution. My
contribution is protected – no one can make changes without giving it back to me(and the community at large). But my code is essentially a service, and I feel
everyone should have a right to that service, even if they go off and make money
from it.

The problems arise when we get to #3, which is the essential controversy of
the GPL. To me, this is a personal choice, which is why GPL advocacy often
turns into pseudo-religious fanaticism. In many ways, arguing with a GPL zealot
is like an atheist arguing with a religious fundamentalist. In the end, they
agree on nothing. The atheist leaves understanding the fundamentalist’s beliefs
and respects his or her right to have them. The fundamentalist leaves beliving
that the atheist will burn in hell for not accepting the one true religion.

This would be fine, except that GPL advocates often blur the line between #2
and #3, and make it seem like the protections of #2 can only be had if you fully
embrace the GPL in all its glory. I support the rights provided by #2. You
can scream and shout about the benefits of #3 and how it’s an inalienable right
of all people, but in the end I just don’t agree. Don’t equate the GPL with
open source – if you do want to argue the GPL, make it very clear which points you are arguing for.

One final comment about GPL advocacy. Time and again I see people talk about
easing migration, avoiding vendor lockin, and the necessity of consumer choice.
But in the same breath they turn around and scream that you must accept the GPL,
and any other license would be pure evil (at best, a slow and painful death). Why is it that we have the right to choose
everything except our choice of license? I like Linux. I like the GPL. The
GPL is not evil. There are a lot of great projects that benefit from the GPL. But it isn’t everything to all people, and in my opinion it’s not
what’s best for OpenSolaris.

[ UPDATE ]

As has been enumerated in the comments on this post, the original intent of the analogy is to show the the definition of derived works. As mentioned in the comments:

Say I post an example of a function foo() to my website. Oracle goes and uses that function in their software. They make no changes to it whatsover, and are willing to distribute that function in source code form with their product. If it was GPL, they would have to now release all of Oracle under the GPL, even though my code has not been altered. The consumer’s rights are preserved – they still have the same rights to my code as before it was put into Oracle. I just don’t see why they have a right to code that’s not mine.

Though I didn’t explain it well enough, the analogy was never intended to encompass right to use, ownership, distribution, or any of the other qualities of the GPL. It is a specific issue with one part of the GPL, and the analogy is intentionally simplistic in order to demonstrate this fact.

Debugging hard problems

September 22, 2004

Disclaimer: This is a technical post and not for the faint of heart. You have been warned.

I thought I’d share an interesting debugging session I had a while ago. Personally, I like debugging war stories. They are stories about hard problems solved by talented engineers. Any Solaris kernel engineer will tell you that we love hard problems – a necessary desire for anyone doing operating systems work. Every software discipline has its share of hard problems, but none approach the sheer volume, complexity, and variety of problems we encounter while working on an OS like Solaris.

This was one of those “head scratching” bugs that really had me baffled for a while. While by no means the most difficult bug I have seen (a demographic populated by memory corruption, subtle transient failures, and esoteric hardware interactions), it was certainly not intuitive. What made this bug really difficult was that a simple programming mistake exposed a longstanding and very subtle race condition with timing on multiprocessor x86 machines.

Of course, I didn’t know this at the time. All I knew is that a few of our amd64 build machines were randomly hanging. The system would be up and running fine for a while, and then *poof* no more keystrokes or ping. With KMDB, we just sent a break to the console and took a crash dump. As you’ll see, this problem would have been virtually impossible to solve without a crash dump, due to its unpredictable nature. I loaded mdb on one of these crash dumps to see what was going on:

> ::cpuinfo
ID ADDR     FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD   PROC
0 fec1f1a0  1f   16    0   0  yes    no t-300013 9979c340 make.bin
1 80fdc080  1f   46    0 164  yes    no t-320107 813adde0 sched
2 fec22ca8  1b   19    0  99   no    no t-310905 80aa3de0 sched
3 80fce440  1f   20    0  60  yes    no t-316826 812f79a0 fsflush

We have threads stuck spinning on all CPUs (as evidenced by the long time since last switch). Each thread is stuck waiting for a dispatch lock, except for one:

$gt; 914d4de0::findstack -v
914d4a84 0x2182(914d4de0, 7)
914d4ac0 turnstile_block+0x1b9(a4cc3170, 0, 92f32b40, fec02738, 0, 0)
914d4b0c mutex_vector_enter+0x328()
914d4b2c ipcl_classify_v4+0x30c(a622c440, 6, 14, 0)
914d4b68 ip_tcp_input+0x756(a622c440, 922c80f0, 9178f754, 0, 918cc328,a622c440)
914d4bb8 ip_rput+0x623(918ff350, a622c440)
914d4bf0 putnext+0x2a0(918ff350, a622c440)
914d4d3c gld_recv_tagged+0x108()
914d4d50 gld_recv+0x10(9189d000, a622c440)
914d4d64 bge_passup_chain+0x40(9189d000, a622c440)
914d4d80 bge_receive+0x60(9179c000, 917cf800)
914d4da8 bge_gld_intr+0x10d(9189d000)
914d4db8 gld_intr+0x24(9189d000)
9e90dd64 cas64+0x1a(b143a8a0, 3)
9e90de78 trap+0x101b(9e90de8c, 8ff904bc, 1)
9e90de8c kcopy+0x4a(8ff904bc, 9e90df14, 3)
9e90df00 copyin_nowatch+0x27(8ff904bc, 9e90df14, 3)
9e90df18 instr_is_prefetch+0x15(8ff904bc)
9e90df98 trap+0x6b2(9e90dfac, 8ff904bc, 1)
9e90dfac 0x8ff904bc()

Strangely, we’re stuck in new_mstate():

> turnstile_block+0x1b9::dis
turnstile_block+0x1a2:          adcl   $0x0,%ecx
turnstile_block+0x1a5:          movl   %eax,0x3bc(%edx)
turnstile_block+0x1ab:          movl   %ecx,0x3c0(%edx)
turnstile_block+0x1b1:          pushl  $0x7
turnstile_block+0x1b3:          pushl  %esi
turnstile_block+0x1b4:          call   -0x93dae 
turnstile_block+0x1b9:          addl   $0x8,%esp
turnstile_block+0x1bc:          cmpl   $0x6,(%ebx)
turnstile_block+0x1bf:          jne    +0x1a    
turnstile_block+0x1c1:          movl   -0x20(%ebp),%eax
turnstile_block+0x1c4:          movb   $0x1,0x27b(%eax)
turnstile_block+0x1cb:          movb   $0x0,0x27a(%eax)
turnstile_block+0x1d2:          movl   $0x1,0x40(%esi)

One of the small but important features in Solaris 10 is that microstate accounting is turned on by default. This means we record timestamps when changing system state, rather than relying on clock-based process accounting. Unfortunately, I can’t post the source code to new_mstate() (another benefit of OpenSolaris – no secrets). But I can say that there is a do/while loop where we spin waiting for the present to catch up to the past. The ia32 architecture was not designed with large MP systems in mind. In particular, the chips have a high resolution timestamp counter (tsc) with the particularly annoying property that different CPUs do not have to be in sync. This means that a thread can read tsc from one CPU, get bounced to another, and read a tsc value that appears to be in the past. This is very rare in practice, and we really should never go through this new_mstate() loop more than once. On these machines, we were looping for a very long time.

We read this tsc value from gethrtime_unscaled(). Slogging through the assembly for new_mstate(), we see that the result is stored as a 64-bit value in -0x4(%ebp) and -0x1c(%ebp):

new_mstate+0xe3:                call   -0x742d9 
new_mstate+0xe8:                movl   %eax,-0x4(%ebp)
new_mstate+0xeb:                movl   %edx,-0x1c(%ebp)

Looking at our current stack, we are able to reconstruct the value:

> 0x914d4a84-4/X
0x914d4a80:     d53367ab
> 0x914d4a84-1c/X
0x914d4a68:     2182

This gives us the value 0x2182d53367ab. But our microstate accounting data tells us something entirely different:

> 914d4de0::print kthread_t t_lwp->lwp_mstate
{
t_lwp->lwp_mstate.ms_prev = 0x3
t_lwp->lwp_mstate.ms_start = 0x1b3c14bea4be
t_lwp->lwp_mstate.ms_term = 0
t_lwp->lwp_mstate.ms_state_start = 0x3678297e1b0a
t_lwp->lwp_mstate.ms_acct = [ 0, 0x1b3c14bf554d, 0, 0, 0, 0, 0, 0, 0x1052, 0x10ad ]
}

Now we see the problem. We’re trying to catch up to 0x3678297e1b0a clock ticks, but we’re only at 0x2182d53367ab. This means we’re going to stay in this loop for the next 23,043,913,724,767 clock ticks! I wouldn’t count on this routine returning any time soon, even on a 2.2GHz Opteron. Using a complicated MDB pipeline, we can print the starting microstate time for every thread in the system and sort the output to find the highest values:

> ::walk thread | ::print kthread_t t_lwp | ::grep ".!=0" | ::print klwp_t \
lwp_mstate.ms_state_start ! awk -Fx '{print $2}' | perl -e \
'while () {printf "%s\n", hex $_}' | sort -n | tail
30031410702452
30031411732325
30031412153016
30031412976466
30031485108972
30031485108972
36845795578503
59889720105738
59889720105738
59889720105738

The three highest values all belong to the same thread (which share a lwp), and are clearly way out of line. This is where the head scratching begins. We have a thread which, at one point in time, decided to put approximately double the expected value into ms_state_start, and since then has gone back to normal, waiting for a past which will (effectively) never arrive.

So I pour through the source code for gethrtime_unscaled(), which is really tsc_gethrtimeunscaled(). After searching for half an hour, with several red herrings, I finally discover that, through some botched indirection, we were actually calling tsc_gethrtime() from gethrtime_unscaled() by mistake. One version returns the number of clock ticks (unscaled), while the other returns the number of nanoseconds (scaled, and more expensive to calculate). Immediately this reveals itself as a source of many other weird timing bugs we were seeing with the amd64 kernel. Undoing this mistake brought much of our system back to normal, but I was worried that there was another bug lurking here. Even if we were getting nanoseconds instead of clock ticks, we still shouldn’t end up spinning in the kernel – something else was fishy.

The problem was very obviously a nasty race condition. Further mdb analysis, DTrace analysis, or custom instrumentation would be futile, and wouldn’t provide any more information than I already had. So I retreated to the source code, where things finally started to come together.

The implementation of tsc_gethrtime() is actually quite complicated. When CPUs are offlined/onlined, we have to do a little dance to calculate the amount of skew between the other CPUs. On top of this, we have to keep the CPUs from drifting too far from one another. We have a cyclic called once a second to record the ‘master’ tsc value for a single CPU, which is used to create a running baseline. Subsequent calls to tsc_gethrtime() use the delta from this baseline to create a unified view of time. Turns out there has been a subtle bug in this code for a long time, and only by calling gethrtime() on every microstate transition were we able to expose it.

In order to factor in power management, we have a check in tsc_gethrtime() to see if the present is less than the past, in which case we ignore the delta between the last master tsc. This doesn’t seem quite right. Because the tsc registers can drift between CPUs, we have the remote possibility that we read tsc on the master CPU 0 at nearly the same instant we read another tsc value on CPU 1. If the CPUs have drifted just the right amount, our new tsc value will appear to be in the past, even though it really isn’t. The end result is that we nearly double the returned value, which is the exact behavior we were seeing.

This is one of those “one in a million” race conditions. You have to be on a MP x86 system where the high resolution timers have drifted ever so slightly: too much and a different algorithm kicks in. Then, you have to call gethrtime() at almost exactly the same time as tsc_tick() is called from the cyclic subsystem (once a second). If you hit the window of a few clock cycles just right, you’ll get an instantaneous bogus value, but the next call to gethrtime() will return to normal. This isn’t fatal in most situations. The only reason we found it at all is because we were accidentally calling gethrtime() thousands of times per second, resulting in a fatal microstate accounting failure.

Hopefully this has given you some insight into the debugging processes that we kernel developers (and our support engineers) go through every day, as well as demonstrating the usefulness of mdb and post mortem debugging. After six long hours of on and off debugging, I was finally able to nail this bug. With a one line change, I was able to fix a glaring (but difficult to identify) bug in the amd64 kernel, while simultaneously avoid a nasty race condition that we have been hiding from for a long time.

Analysts on OpenSolaris

September 21, 2004

There’s been a lot of press about Sun recently thanks to our Network Computing 04Q3 event. It’s hard to miss some of the coverage out there. Jim pointed out this article over at eWeek, which has some good suggestions, but also some gross misconceptions. I thought I’d look at some of the quotes and respond, as a Solaris kernel developer, and explain what OpenSolaris really means and why we’re doing it.

First, there is the obligatory comment about Linux vs. Solaris within Sun. John Loiacono explained some of the reasons why we’re investing in Solaris rather than jumping on the Linux bandwagon. To which Eric Raymond responded:

The claim that not controlling Linux limits one’s ability to innovate is a load of horse puckey … In open source, you always have that ability, up to and including forking the code base, if you don’t like the way things are being run.

Taken out of context, this seems like an entirely reasonable position. But when you put it in the context of OpenSolaris vs. Linux, it quickly becomes irrelevant. The main reason we can’t just jump into Linux is because Linux doesn’t align with our engineering principles, and no amount of patches will ever change that. In the Solaris kernel group, we have strong beliefs in reliability, observability, serviceability, resource management, and binary compatibility. Linus has shown time and time again that these just aren’t part of his core principles, and in the end he is in sole control of Linux’s future. Projects such as crash dumps, kernel debuggers, and tracing frameworks have been repeatedly rejected by Linus, often because they are perceived as vendor added features. Not to mention the complete lack of commitment to binary compatibility (outside of the system call interface). Kernel developers make it nearly impossible to maintain a driver outside the Linux source tree (nVidia being the rare exception), whereas the same apps (and drivers) that you wrote for Solaris 2.5.1 will continue to run on Solaris 10. Large projects like Zones, DTrace, and Predictive Self Healing could never be integrated into Linux simply because they are too large and touch too many parts of the code. Kernel maintainers have rejected patches simply because of the amount of change (SMF, for example, modified over 1,000 files). That’s not to say that Linux doesn’t have many commendable principles, not the least of which is their commitment to open source. But there’s just no way that we can shoehorn Solaris principles into the Linux kernel.

Of course, as Eric Raymond says, we could create a fork of the Linux kernel. But this idea lies somewhere between idealistic and completely ludicrous. First of all, there’s the sheer engineering effort. Even after porting all the huge Solaris 10 (and 9, and 8 …) features to a branch of the Linux kernel, we would enter into a perpetual game of “catchup” with the main branch. We’d be spending all of our time merging patches and testing rather than innovating. With features such as guaranteed binary compatibility, it may not even be possible. Forget the fact that such a fork would probably never be accepted by the Linux community at large. The real problem with creating a fork of the Linux kernel is simply that the GPL doesn’t align with our corporate principles. We want to have ISVs embedding Solaris in their set-top box without worrying about how to dance around the GPL while keeping their IP private. Even if you can tiptoe around the issue now by putting your code in a self-contained module, the Linux kernel developers could actively work against you in the future. Of course, we could still choose a GPL compatible license for OpenSolaris, at which point I’ll end up eating my words.

In the end, dumping Solaris into Linux makes no sense, either technically or philosophically. I have yet to hear a convincing argument of why ditching Solaris would be a good thing for Sun. And I can’t begin to imagine justification for forking the Linux kernel. To be clear, we’re not out to rule OpenSolaris with an iron fist. Because we own our intellectual property, we can make a licensing decision that reflects our corporate goals. And because we’ve put all the engineering effort behind that IP, we can instill similar beliefs into the community that we spawn. These beliefs may change over time: we would love to see a OpenSolaris community where we are merely a participant in a much larger game. But we’ll be able to build a foundation with ideas that are important to us, and fundamentally different from those of the Linux community.

Getting back to the article, we have another quote from Gary Barnett from Ovum:

If Sun releases only Solaris for SPARC with a peculiar open-source license that’s not compatible with the GPL, it’s not going to be a big deal. All you’ll get is the right to help Sun improve its software … If they produce a license that’s not in the spirit of the open-source community, they won’t do themselves any favors at all

It’s hard to know where to begin with statements like these. First and foremost is the idea that we will release Solaris “only for SPARC”. No matter how many times we say it, people just don’t seem to get it. Both x86 and SPARC are built from the same source! There is a very small bit of Solaris that is platform specific, and is scattered throughout the codebase such that it would be impossible to separate. Second, we’ve already stated that the license will be OSI compliant. I can’t elaborate on specifics, and whether it will be GPL compatible, but it really shouldn’t matter as long as it has OSI approval. The GPL is not the be-all, end-all of open source licenses. There are bad aspects of the GPL, and not every project in the world should use it. If we do end up being GPL-incompatible, the only downside will be that you cannot use the source code in Linux or another GPL project. But why must everything exist to contribute to Linux? I can’t take Linux code and drop it into FreeBSD, so why can’t the same be true with OpenSolaris? Not to mention the many benefits of being GPL-incompatible, like being able to mix OpenSolaris code with proprietary source code.

Most importantly, contributors to OpenSolaris won’t just be helping “Sun improve its software.” By nature of making it open source, it will no longer be “our software”. All the good things that come with an OSI license (right to use, fork, and modify code) will prevent us from ever owning OpenSolaris. If you contribute, you will helping improve your software, which you can then use, modify, or repackage to your heart’s content. You will be part of a community. Yes, it won’t be the Linux community. But it will be a community you can choose to join, either as a developer or a user, as alternative to Linux.

Sun is hoping that making source code available will cause a community as large, as diverse and as enthusiastic as that around Linux to gather around Solaris. Just offering source code is not enough to create such a community. Sun would need to do a great deal of work to make that happen.

Now here’s a comment (by Dan Kusnetzky of IDC) that actually makes sense. He understands that we are out to create a community. Like us, he knows that just throwing source code over the wall is not enough. He has good suggestions.

And we’re listening. We have been researching this project for a very long time. We have talked to numerous customers, ISVs, and open source leaders to try and define what makes a community successful. Clever people have already noticed that we have begun a (invite only) pilot program; several open source advocates are already involved in helping us shape a vision for OpenSolaris. Creating a community doesn’t happen overnight. We are gradually building something meaningful, being sure to take the right approach at each step. The community will evolve over time. And I’m looking forward to it.

Eric Schrock's Blog

Month: September 2004

Behind the music: /system/object

Slashdotted…

Rebutting a rebuttal

GPL thoughts and clarifications

Debugging hard problems

Analysts on OpenSolaris

Recent Posts

Agile Data Technology

Enterprise Software Hackathons

Engineer Anti-Patterns

A node.js CLI?

Data Replication: Building a better NDMP

Data Replication: Approaching the Problem

Archives

Archives