Another interesting bug
I know it’s been a long time since I posted a blog entry. But I’ve either been too busy, out of the country, working on (not yet public) projects, or fixing relatively uninteresting bugs. But last week I finally nailed down a nasty bug that had been haunting me for several weeks, so I thought I’d share some of the experience. I apologize if this post gets a little too technical and/or incomprehensible. But I found it to be an interesting exercise, and hopefully sharing it will get me back in the blogging spirit.
First a little background. In Solaris, we have a set of kernel functions known as ‘copyops’ used to transfer data between the kernel and userland. In order to support watchpoints and SunCluster, we maintain a backup vector of functions used when one of these fails. For example, if you have a piece of data on a watched page, we keep that page entirely unmapped. If the kernel tries to read data from this page, the copyin() function will initially fail, before falling back on watch_copyin(). This goes and temporarily maps in the page, does the copy (triggering a watchpoint if necessary) and then unmapping the page. In this way, the average kernel consumer has no idea that there was a watched area on the page.
Clustering uses this facility in their pxfs (proxy filesystem) implementation. In order to support ioctl() calls that access an unspecified amount of memory, they use the copyops vector to translate any reads or writes into over-the-wire requests for the necessary data. These requests are always done from kernel threads, with no attached user space, so any attempt to access userland should fault before vectoring off to their copyops vector.
OK, on to the bug. During testing, SunCluster folks found that they were getting essentially random memory corruption during some ioctl() calls over pxfs on SPARC machines. After trying in vain to understand the crash dumps, the Clustering folks were able to reproduce the problem on DEBUG bits. In addition to getting traptrace output (a black-box style record of OS traps), the kernel failed an ASSERT() deep in the sfmmu HAT (Spitfire Memory Management Unit Hardware Address Translation) layer during a copyin() call. This smoking gun pointed straight to the copyops. We expect a kernel thread accessing userland to generate a T_DATA_EXCEPTION trap, but instead we were getting a T_DATA_MMU_MISS trap, which the HAT was not prepared to handle (nor should it have to).
I spent nearly a week enhancing my copyops test suite, and following several wrong paths deep into SPARC trap tables and the HAT layer. But no amount of testing could reproduce the problem. Finally, I noticed that we had reached the sfmmu assertion as a kernel thread, but our secondary ASI was set to INVALID_CONTEXT instead of KCONTEXT. On SPARC, all addresses are implicitly tagged with an ASI (address space identifier) that lets us refer to kernel addresses and user addresses without having to share the address psace like we do on x86. All kernel threads are supposed to use KCONTEXT (0) as their secondary ASI. INVALID_CONTEXT (1) is reserved for userland threads in various invalid states. Needless to say, this was confusing.
I knew that somehow we were setting the secondary ASI improperly, or forgetting to set it when we should. I began adding some ASSERTs to a custom kernel and quickly ruled out the former. Finally I booted a kernel with some debug code added to resume(), and panicked almost instantly. It was clear that we were coming out of resume() as a kernel thread, but with INVALID_CONTEXT as our secondary ASI. Many hours of debugging later, I finally found my culprit in resume_from_zombie(), which is used when resuming from an exited thread. When a user thread is exiting, we re-parent to p0 (the kernel ‘sched’ process) and set our secondary ASI to INVALID_CONTEXT. If, in resume(), we switch from one of these threads to another kernel thread, we see that they both belong to the same process (p0) and don’t bother to re-initialize the secondary ASI. We even have a function, hat_thread_exit(), designed to do exactly this, only it was a no-op on SPARC. I added a call to sfmmu_setctx_sec() to this function, and the problem disappeared. Technically, this has been a bug since the dawn of time, but it had no ill side effects until I changed the way the copyops were used, and SunCluster began testing on S10.
Besides the sheer amount of debugging effort, this bug was interesting for several reasons:
- It was impossible to root cause on a non-DEBUG kernel. While we try to make the normal kernel as debuggable as possible, memory corruption (especially due to corruption in the HAT layer) is one of those problems that needs to be caught as close to the source as possible. Solaris has a huge amount of debug code, as well as facilities like traptrace that can only be enabled on a debug kernel due to performance overhead.
- The cause of the problem was separated from the symptom by an arbitrary period of time. Once we switched to a kernel thread with a bad ASI, we could harmlessly switch between any number of kernel threads before we find one that actually tries to access userland.
- It was completely unobservable in constrained test scenarios. We not only needed to create kernel threads that accessed userland, but we needed to have a userland thread exit and then switch immediately to one of these threads. Needless to say, this is not easy to reproduce, especially when you don’t understand exactly what’s going on.
- This would have been nearly unsolvable on most other OSes. Without a kernel debugger, post mortem crash dump analysis, and tools like DTrace and traptrace records, I doubt I could have ever solved this problem. This is one of those situations where a stack trace and a bunch of printf() calls would never have solved the problem.
While this wasn’t the most difficult problem I’ve ever had to debug, it certainly ranks up there in recent memory.