Eric Schrock's Blog

Month: August 2004

In case you missed it, Matt Ahrens has a great post talking about ZFS. It’s nice to see this conversation finally beginning; we have been itching to see this out in the public for far too long. Every step we take (Solaris Express, blogs.sun.com, OpenSolaris) brings us closer to our customers and the Solaris community. I’m looking forward to more from the ZFS team (and I promise not to steal their thunder).

I’ve talked a lot in the past about kernel debugging, usually comparing the Solaris tools against what’s available in Linux and other Operating Systems. I thought that I’d walk through a concrete example of kernel debugging, first with our debugger KMDB (which replaced the venerable kadb in build 61 of Solaris 10), and then with the available Linux kernel debuggers (KDB and KGDB). I’ll also examine port-mortem analysis tools for Linux.

For today’s lesson, I’ve picked a problem where a port-mortem or live kernel debugger is absolutely necessary. Those of you who have done multithreaded programming in the past have most likely had to deal with a deadlock here or there. In your garden variety deadlock, one thread grabs lock A and then lock B, while another thread grabs lock B and then lock A. If they both get their first lock but not their second, they’ll both be blocked waiting for locks which will never be released. I’ve decided to spice it up a bit by looking at reader-writer locks. With an rwlock, you can have any number of readers, but only one writer. This makes problems much more difficult to debug. With a mutex, the current owner is stored as part of the data structure. But with a rwlock, the only indication you have is the number of active readers.

In our fantasy land, we have the following situation: commands are getting stuck in the kernel. Forget kill -9; if you’re stuck waiting on a kernel synchronization primitive (wihtout accepting interruption by signals), you’re in trouble. The timeline for our threads looks something like this:

thread 1 thread 2
rw_enter(rwlock, RW_READER)
<stuck somewhere>
rw_enter(rwlock, RW_WRITER)
<blocked>

Loading KMDB

Something’s ill on our system. We know that our ‘ps’ command is hanging for some reason. So we enter KMDB:

# mdb -K
Loaded modules: [ ufs unix krtld nca lofs genunix ip usba specfs nfs random sctp ]
[0]>

KMDB is ready to be loaded on every system running Solaris 10 (build 61 or later). You can also boot -k which will load kmdb in the background. If you do this, you can send a break (via a tip line or L1/F1-A) to get into KMDB.

Find our troubled thread

In this example, we know that ‘ps’ is hanging. Finding the troubled stack is easy:

[0]> ::pgrep ps | ::walk thread | ::findstack
rw_enter_sleep+0x144()
as_pageunlock+0x48()
default_physio+0x340()
pread+0x244()
syscall_trap+0x88()

Get the address of the rwlock

This takes a little bit of work because we don’t have the necessary debugging information in the kernel to restore arguments from arbitrary positions within a function. In this example we’re at rw_enter_sleep+0x144. Using ::findstack -v or $C we can get the frame pointer for each frame. At this point, we have to examine the disassembly around each call site to find out where our first parameter came from. We've been kicking around some ideas to do this automatically, but I'll owe Matt another nickel if I mention it out loud... At the moment, you need to do something similar to the following (on SPARC - x86 is similar):

[0] > as_pageunlock+48::dis
as_pageunlock+0x20:             mov       3, %i4
as_pageunlock+0x24:             call      -0x2318       
as_pageunlock+0x28:             restore
as_pageunlock+0x2c:             add       %l5, 0x30, %l0
as_pageunlock+0x30:             sethi     %hi(0x1069400), %i5
as_pageunlock+0x34:             sethi     %hi(0x1069400), %i1
as_pageunlock+0x38:             ldx       [%i5 + 0x1e8], %l6
as_pageunlock+0x3c:             mov       %l0, %o0
as_pageunlock+0x40:             ldx       [%i1 + 0x1e0], %i0
as_pageunlock+0x44:             add       %i2, %i3, %i3
as_pageunlock+0x48:             call      -0x143754     
as_pageunlock+0x4c:             mov       1, %o1
as_pageunlock+0x50:             mov       %l5, %o0
as_pageunlock+0x54:             mov       %i2, %o1
as_pageunlock+0x58:             call      -0x2cdc       
as_pageunlock+0x5c:             clr       %o2
as_pageunlock+0x60:             add       %i3, %i0, %l7
as_pageunlock+0x64:             and       %i2, %l6, %l3
as_pageunlock+0x68:             ldx       [%o0 + 0x38], %i2
as_pageunlock+0x6c:             and       %l7, %l6, %l4
as_pageunlock+0x70:             mov       %l3, %o1
[0]> 000002a101344ee1+7ff::print struct frame
{
fr_local = [ 0x3000eda5358, 0, 0x1, 0xfd1ff, 0x30018c3dec8, 0xb, 0x8, 0x1 ]
fr_arg = [ 0x3df8515a500, 0xb, 0x2, 0x300068202d8, 0x3000eda5358, 0x1 ]
fr_savfp = 0x2a101344f91
fr_savpc = 0x1186134
fr_argd = [ 0, 0x810, 0x102, 0, 0x3001503a1a0, 0x2a101345930 ]
fr_argx = [ 0x300021541d8 ]
}
[0]> 0x3000eda5358::rwlock
ADDR      OWNER/COUNT FLAGS          WAITERS
3000eda5358        READERS=1  B011      30018878f20 (R)
||      30018c86000 (R)
WRITE_WANTED -------+|      30018c86300 (R)
HAS_WAITERS --------+      30018c86600 (R)
30018c86900 (R)
30018c86f00 (R)
30018c87200 (R)
[0]>

Find who owns the reader lock

Here's another place where things get tough. Since we don't record readers as part of the rwlock_t (only the count), we can't just look at memory and see who owns this. We could just dump the stack of every thread and try to guess who's the owner, but on a large server this quickly becomes impractical. This is where you need ::kgrep, which will find references to a given pointer within the kernel. From there, we pipe it to ::whatis (or more recently, ::whattype), which can tell us in which stacks the value is referenced. This dramatically cuts down our list of choices. After some guess-and-check (but significantly less than looking at every thread in the system), we find the real culprit of the crime.

[0]> ::kgrep 30006820000 | ::whatis
2a100527730 is in thread 30013e04920's stack
2a100527770 is in thread 30013e04920's stack
2a100527780 is in thread 30013e04920's stack
2a100ecf730 is in thread 3000f962300's stack
2a100ecf770 is in thread 3000f962300's stack
2a100ecf780 is in thread 3000f962300's stack
2a101171730 is in thread 3001506db60's stack
2a101171770 is in thread 3001506db60's stack
[...]
[0]> 30013e05220::findstack
stack pointer for thread 30013e05220: 2a1011a09a1
[ 000002a1011a09a1 sema_p+0x140() ]
000002a1011a0a51 biowait+0x60()
000002a1011a0b01 ufs_getpage_miss+0x2f0()
000002a1011a0c01 ufs_getpage+0x6a8()
000002a1011a0d61 fop_getpage+0x48()
000002a1011a0e31 segvn_fault+0x834()
000002a1011a0fe1 as_fault+0x4a0()
000002a1011a10f1 pagefault+0xac()
000002a1011a11b1 trap+0xc14()
000002a1011a12f1 utl0+0x4c()
[0]>

Looking at this now, it makes me want a new MDB command, ::whatthread, which would find all threads with stacks containing a given pointer. With MDB's modular design and stable API, it's easy to connect ::kgrep and ::whatis into a self-contained command. But that's a RFE for another day.

Figure out what our culprit is doing

From here, you're pretty much own your own. In the case outlined about, we're stucking waiting for I/O while we have a read lock on the address space. The gory details are beyond a single blog post, but what's been covered is more than enough for a crash-course in kernel debugging. The best part about this bug is that you could never figure this out with custom instrumentation or 'classic' debugging techniques. This is a race condition of the worst kind - all the printf()s in the world wouldn't have helped you here. And if you need to reboot in order to load your custom kernel, you'll probably never see the problem again.

Post mortem debugging and Linux

On Solaris, the post-mortem case is exactly the same as the live KMDB case. Simply reboot -d (or initiate a sync from KMDB or the OK prompt if it's really hung) and then run mdb on the resulting crash dump. You can also examine the kernel while the system is still running with mdb -k, but some commands (such as ::kgrep) will fail because memory is actively changing.

In upcoming posts, I'll wade through documentation on linux kernel debuggers, trying to figure out how they would solve this same problem. Note that this could be "solved" by wading through the stack traces of every thread on the system, and picking out the troublemaker thread by eye. This requires intimate knowledge of the source code (all of it), is prone to error, and extremely tedious. It's painful on a simple desktop, and quickly becomes impossible for large servers with thousands of threads. Also, keep in mind this is just a demonstration of kernel debugging - the worst bugs are much more complicated.

In a departure from all my previous blog posts, I thought I’d try my hand at a personal entry. Yesterday, the Olympic Track and Field competitions began, with the U.S. taking Silver in the Men’s shot put. What was supposed to be a sweep ended up in disaster: Cantwell never qualified at the trials, Godina fouled his first two and didn’t make it to the finals, and Nelson fouled all but his first throw. If you’re wondering how I know all of this, it’s becuase I’m a track nut. I’ve been running track and field since sophmore year of high school, and by this point know almost all the men’s world records, and a fair number of the women’s.

Back in high school, I ran everything; most notably the 110 hurdles, 300 hurdles, and triple jump. In college I was a walk-on, and focused solely on the triple jump (with a few 4×400 legs sprinkled here and there). The Olympic triple jump preliminaries start tomorrow. Despite the craziness at the U.S. Olympic trials, I’d put my money on Christian Olsson, with Jadel Gregario and Kenta Bell rounding out the top three.

Back in college, I was a fairly mediocre Division-I athlete, managing to jump 14.61 (47’11.5″) at the Ivy League championships my sophmore year. In contrast, Olsson’s best is 17.83 (58’6″) and the world record is a whopping 18.29 (60’0″). When you have a moment, try measuring out 20 yards and imagine traversing the distance in 3 steps.

For your amusement, I’ll leave you with some track pictures, courtesy of Dan Grossman, a great friend of Brown Track and Field. Some of these are certainly less flattering than others, thanks to the speedsuit and my unique facial expressions.

Harvard 2001
Sean Thomas and myself
Possibly one of my two good jumps sophmore year
Warming up at Heps 2001
Me not getting my feet out in front of me
More jumping at Yale
Indoor Heps 2002 (with bleached hair)
More indoor heps
Ridiculously cold meet at Harvard
Heps champsionships @Navy 2002
A very bad jump at Heps

By this point, I hope you’ve enjoyed my humiliation. Next post I’ll get back to some real issues, including kernel debugging and the joys of KMDB.

I just ran across this interview with Linus. If you’ve read my blog, you know I’ve talked before about OS innovation, particularly with regards to Linux and Solaris. So I found this particular Linus quote very interesting:

There’s innovation in Linux. There are some really good technical features that I’m proud of. There are capabilities in Linux that aren’t in other operating systems. A lot of them are about performance. They’re internal ways of doing things in a very efficient manner. In a kernel, you’re trying to hide the hard work from the application, rather than exposing the complexity.

This seems to fly in the face of his previous statements, where he claimed that there’s no innovation left at the OS level. Some people have commented on my previous blog entry that Linux cannot innovate simply because of its size: with so many hardware platforms to support, it becomes impossible to integrate large projects. What you’re left with is “technical features” to improve performance, which (in general) are nothing more than algorithm refinement and clever tricks1.

I also don’t buy the “supporting lots of platforms means no chance for innovation” stance. Most of the Solaris 10 innovations (Zones, Greenline, Least Privileges) are completely hardware independent. Those projects which are hardware dependent develop a platform-neutral infrastructure, and provide modules only for those platforms which they suppot (FMA being a good example of this). Testing is hard. It’s clear that Linux testing often amounts to little more than “it compiles”. Large projects will always break things; even with our stringent testing requirements in Solaris, there are always plenty of “follow on” bugfixes to any project. If Linux (or Linus) is afraid of radical change, then there will be no chance for innnovation2.

As we look towards OpenSolaris, I’m intrigued by this comment by Linus:

I am a dictator, but it’s the right kind of dictatorship. I can’t really do anything that screws people over. The benevolence is built in. I can’t be nasty. If my baser instincts took hold, they wouldn’t trust me, and they wouldn’t work with me anymore. I’m not so much a leader, I’m more of a shepherd. Now all the kernel developers will read that and say, “He’s comparing us to sheep.” It’s more like herding cats.

Stephen has previously questioned the importance of a single leader for an open source project. In the course of development for OpenSolaris, we’ve looked at many open source models, ranging from the “benevolent dictatorship” to the community model (and perhaps briefly, the pornocracy). I, for one, question Linus’ scalability. Too many projects like crash dumps, kernel debugging, and dynamic tracing have been rejected by Linus largely because Linux debugging is “as good as it needs to be.” Crash dumps and kernel tracing are “vendor features,” because real kernel developers should understand the code well enough to debug a problem from a stack trace and register dump. I love the Solaris debugging tools, but I know for a fact that there are huge gaps in our capabilities, some of which I hope to fill in upcoming releases. I would never think of Solaris debugging as perfect, or beyond reproach.

So it’s time for Space Ghosts’s “something…. to think about….”:

Is it beneficial to have one man, or one corporation, have the final say in the development of an open source project?


1ZFS is an example of truly innovative performance improvements. Once you chuck the volume manager, all sorts of avenues open up that nobody’s ever though about before.

2To be clear, I’m not an ardently anti-Linux. But I do believe that technology should speak for itself, irregardless of philosophical or “religious” issues. Obviously, I’m part of the Solaris kernel group, and I believe that Solaris 10 has significant innovations that no other OS has. But I also have faith in open source and the open development model. If you know of Linux features that are truly innovative (and part of the core operating system), please leave a comment and I’ll gladly acknowledge them.

In the footnote a few days ago, I commented on the fact that the history of Solaris debugging could rougly be divded into three ‘eras’. As someone interested in UNIX history, I decided to dig through the Solaris archives and put together a chronology of Solaris debuggability and observability tools. For fun, I divided it into eras to parallel Earth’s history. And I swear I’m not out to make anyone feel like a dinosaur (or a prokaryote, for that matter).

I’ve only been around for one of these “dawn of a new era” arrivals, DTrace1. When one of these revolutionary tools arrive, it’s amazing to see how quickly engineers avoid their own past. Try asking Bryan to debug a performance problem on Solaris 9, and you’ll probably get some choice phrases politely explaining that while he appreciates the importance of your problem, he would rather throw himself down a slide of broken glass and into a vat of rubbing alcohol. Being the neophyte that I am, I’ve only ventured into the ‘Paleozoic era’ on one occasion. After an MDB session on a Solaris 8 crashdump (paraphrased slightly):

$ mdb 0
> ::print
mdb: invalid command '::print': unknown dcmd name
> ::help print
mdb: unknown command: print
> ::please print
mdb: invalid command'::please': unknown dcmd name
> ::ihateyou
$

I quickly ran away screaming, never to return. I think I ended up hiding in a corner of my office for two hours, cradling my DTrace answerbook and whispering “there’s no place like home” over and over. I’m still a spoiled brat, but at least I have respect and admiration for those Solaris veterans who crawled through debugging hell so that I could live a comfortable life2. It’s also made me feel sorry for the Linux (and Windows) developers out there. Not in the Nelson Muntz “Ha ha! You don’t have DTrace!” sense. More like “Poor little guy. It’s not his fault his species never evolved opposable thumbs.” There are a lot of brilliant Linux developers out there, stuck in a movement that doesn’t embrace debugging or observability as fundamental goals. But this post is supposed to be about history, not Linux. So without further ado, my brief history of Solaris (soon to be available in refrigerator magnet form):

<1989 HADEAN SunOS 4.X
adb, ptrace, crash
1990 ARCHAEAN SVr4 merge
/proc
truss(1)
1991 vtrace
vmstat(1M)
iostat(1M)
1992 SOLARIS 2.0
1993 mpstat
SOLARIS 2.2
1994 Kernel slab allocator
TNF
basic ptools
SOLARIS 2.4
1995
1996 SOLARIS 2.5.1
PROTEROZOIC Next generation /proc
Userland watchpoints
1997 lockstat(1M)
pkill and pgrep
libproc
1998 savecore on by default
SOLARIS 7
1999 libproc for corefiles
coreadm(1M)
prstat(1M)
lockstat kernel profiling
PALEOZOIC MDB(1)
::findleaks
2000 SOLARIS 8
EOL of crash(1M)
2001 live process control for MDB
EOL of adb(1)
pargs and preap
MESOZOIC kernel CTF data
trapstat(1M)
2002 SOLARIS 9
libumem(3LIB) and umem_debug(3MALLOC)
::typegraph for mdb(1)
2003 Userland CTF
coreadm(1M) content control
CENOZOIC DTrace(1M)
intrstat(1M)
2004 DTrace pid provider for x86
pfiles with pathnames
DTrace sched, proc providers
CTF for core libraries
DTrace I/O provider
KMDB(1)
DTrace MIB, fpuinfo providers
Per-thread ptools

These are my choices based on SCCS histories and putback logs. Obviously, I’ve failed to include some things. Leave a comment or email if you think something’s not getting the recognition it deserves (keeping in mind this is a blog post, not a book).


1 I actually started exactly one day before DTrace integrated. But I had some experience (albeit limited) as an intern the previous year.

2 In all seriousness, it’s not that I don’t have to ever debug anything, or that the problems we have today are somehow orders of magnitude simpler than those in the past. What these tools provide is a dramatic reduction in time to root-cause. You still need the same inquisitive and logical mind to debug hard problems, its just that good tools let you form questions and get answers faster than you could before. Really good tools (like DTrace) let you ask the previously unanswerable questions. You may have been able to debug the problem before, but you would have ended up running around in circles trying to get data that’s now immediately available thanks to DTrace.

I just saw this article the other day, and was amazed that someone could so clearly miss the point of OpenSolaris and Sun’s business strategy. In the article, the (apparently anonymous) author tries to argue that Jonathan and Adam are on two opposing sides of an internal corporate struggle between the executivess and the engineers over business strategy. He begins with a gross misinterpretation of Jonathan’s blogs:

As far as I can dissect Schwartz’s argument, it’s that IBM made a big mistake by building features on top of open source, because that removes any possible hardware lockin, and lets customers move to less expensive hardware that they can buy from any old vendor.

The point of Jonathan’s argument (whether or not you believe it is up to you) is not about hardware at all. The point is that because IBM relies completely on Linux (and doesn’t have a distribution of its own), they cannot offer the same integrated solution that others can. If you want to run Linux on IBM hardware, you’ll most likely be running RedHat. But if RedHat provides an integrated application server, why would you turn around and buy WebSphere from IBM? Jonathan also discusses software lockin when moving between Linux distributions, but this also has nothing to do with hardware. Yet the author continues to focus on “hardware lockin,” twisting Adam’s blog entry into this lovely conclusion:

If Solaris is open-sourced, then Sun is about to undermine its own hardware lockin.

For historical reasons, Sun has a reputation of big iron and proprietary UNIX, and with that comes hardware lockin. But if you look at Solaris and the rest of our software stack, you’ll quickly see this is no longer true. Whether or not you believe in Sun’s hardware prices (think x86 and Opteron), you can’t ignore the fact that Solaris runs on commodity x86 hardware. And even if you don’t want Solaris, Java Desktop System runs on Linux, and Java Enterprise System runs on Windows and HP-UX. To top it off, they all run on open standards; you can move your apps to non-Sun systems (as long as they are standards compliant) without worry. So where’s the hardware lockin? Sun’s message is that we sell systems. Maybe we’re not quite there yet, but eventually you’ll be able to get an integrated software stack (JES or JDS) on whatever hardware platform and Operating System you choose. If you want to run Solaris, great. If you want to get hardware from us, great. If not, we’ll make damn sure our software all works together as best as possible.

The author does, however, get one thing right: OpenSolaris will enable ports to new platforms. Ask a kernel engineer or ask a VP, they will both tell you this is a good thing. If our customers want to run on POWER, and OpenSolaris enables this port to happen, then we can bring the weight of our entire software portfolio onto the platform, and do it better than anyone else. Go buy your hardware from IBM, but we’ll give you a complete software solution with Java Enterprise System for less than IBM. As Adam points out, OpenSolaris can only help Sun. We love choice, and choice is good.

Recent Posts

April 21, 2013
February 28, 2013
August 14, 2012
July 28, 2012

Archives