Eric Schrock's Blog

Month: July 2004

As you may have noticed from Adam’s blog, our time at OSCON was a rousing success. Unfortunately, I don’t have enough time to write up a real post, since I’m on vacation for the next few days. Adam summed things up pretty well; the two points I’d reiterate are:

  1. We are eager to learn how to do open Solaris right.

    Sun has a lot of experience with open source projects, with varying degrees of success. Our meeting with open source leaders was extremely informative; I myself never realized how difficult it is to build a developer community that really works. We’re not just throwing source over the wall as a PR stunt or to get free labor; we’re doing it (among other reasons) to build a thriving community centered around Solaris. And we need you to help us get it right.

  2. Solaris 10 technology sells itself.

    Before our BOF, most people we met were skeptical of Solaris. Because we’re a proprietary UNIX, we’ve gained a reputation of being an old dinosaur: Linux is fast and new and evolving, Solaris is slow and old and stagnant. This couldn’t be further from the truth, and it doesn’t take a marketing campaign to convince the world otherwise. Once people see DTrace, Zones, Solaris Management Framework, Predictive Self Healing, ZFS, and all the other great features in Solaris 10, there’s really no question that Solaris is alive and well. Whether you are an administrator or a developer, there will be something in Solaris that will blow you away. If you haven’t seen Solaris 10 in action, get your Solaris Express today and spread the word.

Sun is sending a contingent to the O’Reilly Open Source Convention in Portland this week. In a last minute change of schedule, I will be attending (even though my name is not in the official BOF description). But I’ll be there, along with fellow kernel engineers Bart, Andy, and Adam. We will be learning about open source development and discussing Solaris. There will be a BOF Thursday night for all to attend. Come learn about Solaris 10 and open source Solaris, or at least show up for the free food and T-shirts.

In build 60 (Beta 5 or SX 7/04), I fixed a long standing Solaris bug: mounted filesystems could not contain spaces. We would happily mount the filesystem, but then all consumers of /etc/mnttab would fail. This resulted in sad situations like:

# df -h
Filesystem             size   used  avail capacity  Mounted on
/dev/dsk/c0d0s0         36G    13G    22G    38%    /
/devices                 0K     0K     0K     0%    /devices
/dev/dsk/c0d0p0:boot    11M   2.3M   8.4M    22%    /boot
/proc                    0K     0K     0K     0%    /proc
mnttab                   0K     0K     0K     0%    /etc/mnttab
fd                       0K     0K     0K     0%    /dev/fd
swap                  1002M    24K  1002M     1%    /var/run
swap                  1003M   1.3M  1002M     1%    /tmp
# mount -F lofs /export/space\ dir /mnt/space\ mnt
/export/space dir       /mnt/space mnt  lofs    dev=1980000     1090718041
# df -h
df: a line in /etc/mnttab has too many fields
#

Luckily you could unmount the filesystem, but it was quite annoying to say the least. The resulting fix was really an exploration into bad interface design.

/etc/mnttab

This file has been around since the early days of Unix (at least as far back as SVR3). Each line is a whitespace-delimited set of fields, including special device, mount point, filesystem type, mount options, and mount time (see mnttab(4) for more information). Historically, this was a plain text file. This meant that the user programs mount(1M) and umount(1M) were responsible for making sure its contents were kept up to date. This could be very problematic: imagine what would happen if the program died partway through adding an entry, or root accidently removed an entry without actually unmounting it. Once the contents were corrupted, the admin usually had to resort to rebooting, rather than trying to guess what the proper contents. Not to mention it makes mounting filesystems from within the kernel unnecessarily complicated.

In Solaris 8, we solved part of the problem by creating the mntfs pseudo filesystem. From this point onward, /etc/mnttab was no longer a regular text file, but a mounted filesystem. The contents are generated on-the-fly from the kernel data structures. This means that the contents are always in sync with the kernel1, and that the user can’t accidentally change the contents. However, we still had the problem that the mount points could not contain spaces, because space was a delimiter with special meaning.

getmntent() and friends

On top of this broken interface, a C API was developed that had even worse problems. Consider getmntent(3c):

int getmntent(FILE *fp, struct mnttab *mp);

There are several problems with this interface:

  1. The user is responsible for opening and closing the file

    There is only one mount state for the kernel; why should the user have to know that /etc/mnttab is the place where the entries are stored?

  2. The first parameter is a FILE *

    If you’re developing a system interface, you should not enforce using the C stdio library. Every other system API takes a normal file descriptor instead./p>

  3. The memory is allocated by the function on demand

    This causes all sorts of problems, including making multithreaded difficult, and preventing the user from controlling the size of the buffer used to read in the data.

  4. There is no relationship between the memory and the open file

    Because of this, a lazy programmer can close the file after the last call to getmntent() while still using the memory, so it must be kept around indefinitely.

By now, it should be obvious that this was an ill-conceived API built on top of a broken interface. Off the top of my head, if I were to re-design these interfaces I would come up with something more like:

mnttab_t *mnttab_init(void);
int mnttab_get(mnttab_t *mnttab, struct mntent *ent, void *scratch, size_t scratchlen);
void mnttab_fini(mnttab_t *mnttab);

The solution

Once /etc/mnttab became a filesystem, we could add ioctl(2) calls to do whatever we wanted. Once we’re in the kernel, we know exactly how long each field of the structure is. We create a set of NULL-terminated strings directly in user space, and simply return pointers to them. This was more complicated than it sounds for the reasons outlined above. We also had to maintain the ability to read the file directly. With this fix, all C consumers “just work”. Scripted programs will still choke on a mnttab entry with spaces, but this is a minority by far.

Note that the files /etc/vfstab and /etc/dfs/sharetab still suffer from this problem. There has been some discussion about how to resolve these issues, with the new Service Management Facility being touted as a possible solution. And ZFF (Sun’s next generation filesystem) is avoiding /etc/vfstab altogether.


1 There is always the possibility that the mounted filesystems change between the time the file is opened and the data is read.

Just thought I’d call attention to the fact that the Service Management Facility (SMF) has successfully integrated into build 64 of Solaris 10. Stephen has posted some teasers, and will hopefully continue with more examples, as well as encouraging his fellow team members to get into the blogging mood. This is one of the most visible Solaris 10 features, and brings reliability, availability, and ease of administration to new levels. It is supposed to hit the streets as build 65, aka Beta 7, aka Solaris Express 9/04. Stay tuned…

In a departure from recent musings on the inner workings of Solaris, I thought I’d examine one of the issues that Bryan has touched on in his blog. Bryan has been looking at some of the larger issues regarding OS innovation, commoditization, and academic research. I thought I’d take a direct approach by examining our nearest competitor, Linux.

Bryan probably said it best: We believe that the operating system is a nexus of innovation.

I don’t have a lot of experience with the Linux community, but my impression is that the OS is perceived as a commodity. As a result, Linux is just another OS; albeit one with open source and a large community to back it up. I see a lot of comments like “Linux basically does everything Solaris does” and “Solaris has a lot more features, but Linux is catching up.” Very rarely do I see mention of features that blow Solaris (or other operating systems) out of the water. Linus himself has said:

A lot of the exciting work ends up being all user space crap. I mean, exciting in the sense that I wouldn’t car [sic], but if you look at the big picture, that’s actually where most of the effort and most of the innovation goes.

So Linus seems to agree with my intuition, but I’m in unfamiliar territory here. So, I pose the question:

Is the Linux operating system a source of innovation?

This is a specific question: I’m interested only in software innovation relating to the OS. Issues such as open source, ISV suport, and hardware compatibility are irrelevant, as well as software which is not part of the kernel or doesn’t depend on its facilities. I consider software such as the Solaris ptools as falling under the purview of the operating system, because they work hand-in-hand with the /proc filesystem, a kernel facility. Software such as GNOME, KDE, X, GNU tools, etc, are all independent of the OS and not germane to this discussion. I’m also less interested in purely academic work; one of the pitfalls of academic projects is that they rarely see the light of day in a real-world commercial setting. Of course, most innovative work must begin as research before it can be viable in the industry, but certainly proven technologies make better examples.

I can name dozens of Solaris innovations, but only a handful of Linux ones. This could simply be because I know so much about Solaris and so little about Linux; I freely acknowledge that I’m no Linux expert. So are there great Linux OS innovations out there that I’m just not aware of?

In my last post I described how watchpoints work in Solaris, or how they’re supposed to work. The reality is that there have been some small problems that have prevented a large number of watchpoints from being practical for complicated programs. I’ve made some changes in Solaris 10 so that they work in all situations, which made it onto Adam’s Top 11-20 Features in Solaris 10.

How watchpoints are used

Typically, watchpoints are used in one of two ways. First, they are used for debugging userland applications. If you know that memory is getting corrupted, or know that a variable is being modified from an unknown location, you can set a watchpoint through a debugger and be notified when the variable changes. In this case, we only have to keep track of a handful of watchpoints. But they are also used for memory allocator redzones, to prevent buffer overflows and memory corruption. For every allocation, you put a watched region on either end, so that if the program tries to access unknown territory, a SIGTRAP signal is sent so the program can be debugged. In this case, we have to deal with thousands of watchpoints (two for every allocation), and we fault on virtually every heap access1.

Watchpoints in strange places

Watchpoints have worked for the most part since they were put into Solaris. Whenever a watchpoint is tripped, we end up in the kernel, where we have to look at the instruction we faulted on and take appropriate action. There were some instructions that we didn’t quite decode properly when there were watchpoints present. On SPARC, the cas and casx instructions (used heavily in recent C++ libraries) could cause a SEGV if they tried to access a watched page. On x86, instructions that accessed the stack (pushl and movl, for example) would cause a similar segfault if there was a watchpoint on a stack page.

Multithreaded programs

There has been a particularly nasty watchpoint problem for a while when dealing with lots of watchpoints in multithreaded programs. When one thread hit a watchpoint, we have to stop all the other threads. But in the process of stopping, those threads may trigger a watchpoint, we try to stop the original watchpoint thread. We end up spinning in the kernel, where the only solution is to reboot the system.

Scalability

In the past, watchpoints were kept in a linked list for each process. This means that every time a program added a watchpoint or accessed a watched page, it would spend a linear amount of time trying to find the watchpoint. This is fine when you only have a handful of watchpoints, but can be a real problem when you have thousands of them. These linked lists have since been replaced with AVL trees. Individual watchpoints may be slow, but 10,000 watchpoints have nearly the same impact as 10 watchpoints. This can result in as much as 100x improvement for large number of watchpoints.

All of the above problems have been fixed in Solaris 10. The end result is that tools like watchmalloc(3malloc) and dbx’s memory checking features are actually practical on large programs.


1 Remember that we have to fault on every access to a page that contains a watchpoint, even if it’s not the address we’re actually interested in.

Recent Posts

April 21, 2013
February 28, 2013
August 14, 2012
July 28, 2012

Archives