Eric Schrock's Blog

Live from Brazil

May 31, 2005

Dave Powell and myself have arrived at FISL, an open source conference in Brazil, along with a crowd of other Sun folks. Dave and I (with introduction from Sun VP Tom Goguen) will be hosting a 4 hour OpenSolaris pre-event tomorrow, June 1st. We’ll be talking about all the cool features available in OpenSolaris, as well as how Solaris development works today and how we hope it will work in the future. If you’re attending the conference, be sure to stop by to learn about OpenSolaris, and what makes Solaris (and Solaris developers) tick. We’ll also be hanging around the Sun booth during the rest of the conference, giving mini-presentations, demos, answering questions, and chatting with anyone who will listen. We’re happy to talk about OpenSolaris, Solaris, Sun, or your favorite scenes from Monty Python and the Holy Grail. Oh yeah, there will be lots of T-shirts and Solaris 10 DVDs as well.

In the News(.com)

May 12, 2005

So it looks like my blog made it over to the frontpage of news.com in this article about slipping Solaris 10 features. Don’t get your hopes up – I’m not going to refute Genn’s claims; we certainly are not scheduled for a specific update at the moment. But pay attention to the details: ZFS and Janus will be available in an earlier Solaris Express release. I also find it encouraging that engineers like myself have a voice that actually gets picked up by the regular press (without being blown out of proportion or slashdotted).

I would like to point out that I putback the last major chunk of command redesign to the ZFS gate yesterday 😉 There are certainly some features left to implement, but the fact that I re-whacked all of the userland components (within six weeks, no less) should not be interpreted as any statement of schedule plans. Hopefully I can get into some of the details of what we’re doing but I don’t want to be seen as promoting vaporware (even though we have many happy beta customers) or exposing unfinished interfaces which are subject to change.

I also happen to be involved with the ongoing Janus work, but that’s another story altogether. I swear there’s no connection between myself and slipping products (at least not one where I’m the cause).

Update: So much for not getting blown out of proportion. Leave it to the second tier news sites to turn “not scheduled for an update” into “delayed indefinitely over deficiencies”. Honestly, rewriting 5% of the code should hardly be interpreted as “delayed indefinitely” – so much for legitimate journalism. Please keep in mind that all features will hit Software Express before a S10 Update, and OpenSolaris even sooner.

GDB to MDB Migration, Part One

May 10, 2005

In past comments, it has been pointed out that a transition guide between GDB and MDB would be useful to some developers out there. A full comparison would also cover dbx(1), but I’ll defer this to a later point. Given the number of available commands, I’ll be dividing up this post into at least two pieces.

Before diving into too much detail, it should be noted that MDB and GDB have slightly different design goals. MDB (and KMDB) replaced the aging adb(1) and crash(1M), and was designed primarily for post-mortem analysis and live kernel analysis. To this end, MDB presents the same interface when debugging a crash dump as when examining a live kernel. Solaris corefiles have been enhanced so that all the information for the process (including library text and type information) is present in the corefile. MDB can examine and run live processes, but lacks some of the features (source level debugging, STABS/DWARF support, conditional breakpoints, scripting language) that are standard for developer-centric tools like GDB (or dbx). GDB was designed for interactive process debugging. While you can use GDB on corefiles (and even LKCD crash dumps or Linux kernels – locally and remotely), you often need the original object files to take advantage of GDB’s features.

Before going too far into MDB, be sure to check out Jonathan’s MDB Cheatsheet as a useful quick reference guide, with some examples of stringing together commands into pipelines. Seeing as how I’m not the most accomplished GDB user in the world, I’ll be basing this comparison off the equivalent GDB reference card.

	GDB	MDB	Description
Starting Up
	`gdb program`	`mdb path` `mdb -p pid`	Start debugging a command or running process. GDB will treat numeric arguments as pids, while mdb explicitly requires the ‘-p’ option
	`gdb program core`	`mdb [ program ] core`	Debug a corefile associated with ‘program’. For MDB, the program is optional and is generally unnecessary given the corefile enhancements made during Solaris 10.
Exiting
	`quit`	`::quit`	Both programs also exit on Ctrl-D.
Getting Help
	`help` `help command`	`::help` `::help dcmd` `::dcmds` `::walkers`	In mdb, you can list all the available walkers or dcmds, as well as get help on a specific dcmd. Another useful trick is `::dmods -l module` which lists walkers and dcmds provided by a specific module.
Running Programs
	`run arglist`	`::run arglist`	Runs the program with the given arguments. If the target is currently running, or is a corefile, MDB will restart the program if possible.
	`kill`	`::kill`	Forcibly kill and release target.
	`show env`	`::getenv`	Display current environment.
	`set env var string`	`::setenv var=string`	Set an environment variable.
	`get env var`	`::getenv var`	Get a specific environment variable.
Shell Commands
	`shell cmd`	`! cmd`	Execute the given shell command.
Breakpoints and Watchpoints
	`break func` `break *addr`	`addr::bp`	Set a breakpoint at the given address or function.
	`break file:line`	`-`	Break at the given line of the file. MDB does not support source level debugging.
	`break ... if expr`	`-`	Set a conditional breakpoint. MDB doesn’t support conditional breakpoints, though you can get a close approximation via the `-c` option (though its complicated enough to warrant its own post).
	`watch expr`	`addr::wp -rwx [-L size]`	Set a watchpoint on the given region of memory.
	`info break` `info watch`	`::events`	Display active watchpoints and breakpoints. MDB will show you signal events as well.
	`delete [n]`	`::delete n`	Delete the given breakpoint or watchpoints.

I think that's enough for now; hopefully the table is at least readable. More to come in a future post.

Designing for Failure

April 14, 2005

In the last few weeks, I’ve been completely re-designing the ZFS commands from the ground up¹. When I stood back and looked at the current state of the utilities, several glaring deficiencies jumped out at me². I thought I’d use this blog entry to focus on one that near and dear to me. Having spent a great deal of time with the debugging and observability tools, I’ve invariably focused on answering the question “How do I diagnose and fix a problem when something goes wrong?”. When it comes to command line utilities, the core this problem is in well-designed error messages. To wit, running the following (former) ZFS command demonstrates the number one mistake when reporting error messages:

# zfs create -c pool/foo pool/bar
zfs: Can't create pool/bar: Invalid argument
#

The words “Invalid argument” should never appear as an error message. This means that at some point in the software stack, you were able to determine there was a specific problem with an argument. But in the course of passing that error up the stack, any semantic information about the exact nature of the problem has been reduced to simply EINVAL. In the above case, all we know is that one of the two arguments was invalid for some unknown reason, and we have no way of knowing how to fix it. When choosing to display an error message, you should always take the following into account:

An error message must clearly identify the source of the problem in a way that that the user can understand.

An error message must suggest what the user can do to fix the problem.

If you print an error message that the administrator can’t understand or doesn’t suggest what to do, then you have failed and your design is fundamentally broken. All too often, error semantics are given a back seat during the design process. When approaching the ZFS user interface, I made sure that error semantics were a fundamental part of the design document. Every command has complete usage documentation, examples, and every possible error message that can be emitted. By making this part of the design process, I was forced to examine every possible error scenario from the perspective of an administrator.

A grand vision of proper failure analysis can be seen in the Fault Management Architecture in Solaris 10, part of Predictive Self Healing. A complete explanation of FMA and its ramifications is beyond the scope of a single blog entry, but the basic premise is to move from a series of unrelated error messages to a unified framework of fault diagnosis. Historically, when hardware errors would occur, an arbitrary error message may or may not have been sent to the system log. The error may have been transient (such as an isolated memory CE), or the result of some other fault. Administrators were forced to make costly decisions based on a vague understanding of our hardware failure semantics. When error messages did succeed in describing the problem sufficiently, they invariably failed in suggesting how to fix the problem. With FMA, the sequence of errors is instead fed to a diagnosis engine that is intimately familiar with the characteristics of the hardware, and is able to produce a fault message that both adequately describes the real problem, as well as how to fix it (when it cannot be automatically repaired by FMA).

Such a wide-ranging problem doesn’t necessarily compare to a simple set of command line utilities. A smaller scale example can be seen with the Solaris Management Facility. When SMF first integrated, it was incredibly difficult to diagnose problems when they occurred³. The result, after a few weeks of struggle, was one of the best tools to come out of SMF, svcs -x. If you haven’t tried this command on your Solaris 10 box, you should give it a shot. It does automated gathering of error information and combines it into output that is specific, intelligible, and repair-focused. During development of the final ZFS command line interface, I’ve taken a great deal of inspiration from both svcs -x and FMA. I hope that this is reflected in the final product.

So what does this mean for you? First of all, if there’s any Solaris error message that is unclear or uninformative that is a bug. There are some rare cases when we have no other choice (because we’re relying on an arbitrary subsystem that can only communicate via errno values), but 90% of the time its because the system hasn’t been sufficiently designed with failure in mind.

I’ll also leave you with a few cardinal⁴ rules of proper error design beyond the two principles above:

Never distill multiple faults into a single error code. Any error that gets passed between functions or subsystems must be traceable back to a single specific failure.
Stay away from strerror(3c) at all costs. Unless you are truly interfacing with an arbitrary UNIX system, the errno values are rarely sufficient.
Design your error reporting at the same time you design the interface. Put all possible error messages in a single document and make sure they are both consistent and effective.
When possible, perform automated diagnosis to reduce the amount of unimportant data or give the user more specific data to work with.
Distance yourself from the implementation and make sure that any error message makes sense to the average user.

¹No, I cannot tell you when ZFS will integrate, or when it will be available. Sorry.

²This is not intended as a jab at the ZFS team. They have been working full steam on the (significantly more complicated) implementation. The commands have grown organically over time, and are beginning to show their age.

³Again, this is not meant to disparage the SMF team. There were many more factors here, and all the problems have since been fixed.

⁴ “cardinal” might be a stretch here. A better phrase is probably “random list of rules I came up with on the spot”.

Bug of the week

April 3, 2005

There are many bugs out there that are interesting, either because of an implementation detail or the debugging necessary to root cause the problem. As you may have noticed, I like to publicly expound upon the most interesting ones I’ve fixed (as long as it’s not a security vulnerability). This week turned up a rather interesting specimen:

6198523 dirfindvp() can erroneously return ENOENT

This bug was first spotted by Casper back in November last year while trying to do some builds on ZFS. The basic pathology is that at some point during the build, we’d get error messages like:

sh: cannot determine current directory

Some ideas were kicked around by the ZFS team, and after the problem seemed to go away, the team believed that some recent mass of changes had also fixed the problem. Five months later, Jonathan hit the same bug on another build machine running ZFS. As I wrote the getcwd() code, I was determined to root cause the problem this time around.

Back in build 56 of S10, I moved getcwd(3c) into the kernel, along with changes to store pathnames with vnodes (which is used by the DTrace I/O provider as well as pfiles(1)). Basically, we first try to do a forward lookup on the stored pathname; if that works, then we simply return the resolved path¹. If this fails (vnode paths are never guaranteed to be correct), then we have to fall back into the slow path. This slow path involves looking up the parent, finding the current vnode in parent, prepending path, and repeat. Once we reach the root of the filesystem, we have a complete path.

To debug this problem, I used this D script to track the behavior of dirtopath(), the function that performs the dirty work of the slow path. Running this for a while produced a tasty bit of information:

dirtopath       /export/ws/build/usr/src/cmd/sgs/ld
lookup(/export/ws/build/usr/src/cmd, .make.dependency.8309dfdc.234596.166) failed (2)
dirfindvp(/export/ws/build/usr/src/cmd,/export/ws/build/usr/src/cmd/sgs) failed (2)
dirtopath() returned 2

Looking at this, it was clear that dirfindvp() (which finds a given vnode in its parent) was inappropriately failing. In particular, after a failed lookup for a temporary make file, we bail out of the loop and report failure, despite the fact that “sgs” is still sitting there in the directory. A long look at the code revealed the problem. Without revealing too much of the code (OpenSolaris, where are you?), it’s essentially structured like so:

while (!err && !eof) {
/* ... */
while ((intptr_t)dp < (intptr_t)dbuf + dbuflen) {
/* ... */
/*
* We only want to bail out if there was an error other
* than ENOENT.  Otherwise, it could be that someone
* just removed an entry since the readdir() call, and
* the entry we want is further on in the directory.
*/
if (err != ENOENT) {
break;
}
}
}

The code is trying to avoid exactly our situation: we fail to do a lookup of a file we just saw beacuse the contents are rapidly changing. The bug is that in the while loop we have a check for !err && !eof. If we fail to look up an entry, and it’s the last entry in the chunk we just read, then we’ll prematurely bail out of the enclosing while loop, returning ENOENT when we shouldn’t. Using this test program, it’s easy to reproduce on both ZFS and UFS. There are several noteworthy aspects of this bug:

The bug had been in the gate for over a year, and there hadn’t been a single reported build failure.
It only happens when the cached vnode value is invalid, which is rare².
It is a race condition between readdir, lookup, and remove.
On UFS, inodes are marked as deleted but can still be looked up until the delete queue is processed at a later point. ZFS deletes entries immediately, so this was much more apparent on ZFS.
Because of the above, it was incredibly transient. It would have taken an order of magnitude more time to root cause if not for DTrace, which excels at solving these transient phenomena

A three line change and the bug was fixed, and will make it back to S10 in time for Update 1. If it hadn’t been for those among us willing to run our builds on top of ZFS, this problem would not have been found until ZFS integrated, or a customer escalation cost the company a whole bunch of money.

¹ There are many more subtleties here relating to Zones, and verifying that the path hasn’t been changed to refer to another file. The curious among you will have to wait for OpenSolaris.

² I haven’t yet investigated why we ended up in the slow path in this case. First things first.

How not to code

March 31, 2005

This little gem came up in conversation last night, and it was suggested that it would make a rather amusing blog entry. A Solaris project had a command line utility with the following, unspeakably horrid, piece of code:

/*
* Use the dynamic linker to look up the function we should
* call.
*/
(void) snprintf(func_name, sizeof (func_name), "do_%s", cmd);
func_ptr = (int (*)(int, char **))
dlsym(RTLD_DEFAULT, func_name);
if (func_ptr == NULL) {
fprintf(stderr, "Unrecognized command %s", cmd);
usage();
}
return ((*func_ptr)(argc, argv));

So when you type “a.out foo”, the command would sprintf into a buffer to make “do_foo”, and rely on the dynamic linker to find the appropriate function to call. Before I get a stream of comments decrying the idiocy of Solaris programmers: the code will never reach the Solaris codebase, and the responsible party no longer works at Sun. The participants at the dinner table were equally disgusted that this piece of code came out of our organization. Suffice to say that this is much better served by a table:

for (i = 0; i < sizeof (func_table) / sizeof (func_table[0]); i++) {
if (strcmp(func_table[i].name, cmd) == 0)
return (func_table[i].func(argc, argv));
}
fprintf(stderr, "Unrecognized command %s", cmd);
usage();

I still can’t imagine the original motivation for this code. It is more code, harder to understand, and likely slower (depending on the number of commands and how much you trust the dynamic linker’s hash table). We continually preach software observability and transparency – but I never thought I’d see obfuscation of this magnitude within a 500 line command line utility. This prevents us from even searching for callers of do_foo() using cscope.

This serves as a good reminder that the most clever way of doing something is usually not the right answer. Unless you have a really good reason (such as performance), being overly clever will only make your code more difficult to maintain and more prone to error.

Update – Since some people seem a little confused, I thoght I’d elaborate two points. First off, there is no loadable library. This is a library linked directly to the application. There is no need to asynchronously update the commands. Second, the proposed function table does not have to live separately from the code. It would be quite simple to put the function table in the same file with the function definitions, which would improve maintainability and understability by an order of magnitude.

Live from Brazil

In the News(.com)

GDB to MDB Migration, Part One

Starting Up

Exiting

Getting Help

Running Programs

Shell Commands

Breakpoints and Watchpoints

Designing for Failure

Bug of the week

How not to code

Recent Posts

Agile Data Technology

Enterprise Software Hackathons

Engineer Anti-Patterns

A node.js CLI?

Data Replication: Building a better NDMP

Data Replication: Approaching the Problem

Archives

Archives