Writing the OpenSolaris Bible

2008 was a busy year for me since I spent most of my free
time co-authoring a book on OpenSolaris; the
OpenSolaris Bible.

Having never written a book before, this was a new experience for me.
Nick originally had
the idea for writing a book on OpenSolaris and he’d already published
Professional C++ with Wiley,
so he had an agent and a relationship with a publisher. In December 2007 he contacted
me about being a co-author and after thinking it through, I agreed. I had
always thought writing a book was something I wanted to do, so I was
excited to give this a try. Luckily,
Dave agreed to be the
third author on the book, so we had our writing team in place. After
some early discussions, Wiley decided our material fit best into their
“Bible” series, hence the title.

In early January 2008 the three of us worked on the outline and decided which chapters
each of us would write. We actually started writing in early
February of 2008. Given the publishing schedule we had with Wiley, we had
to complete each chapter in about 3 weeks, so there wasn’t a lot of time to
waste. Also, because this project was not part of our normal work for
Sun, we had to ensure that we only worked on the book on our own time, that is evenings and
weekends. In the end it turned out that we each wrote
exactly a third of the book, based on the page counts.
Since the book came out at around 1000 pages, with approximately
950 pages of written material, not counting front matter or the index,
we each wrote over 300 pages of content. Over the course of the project we were
also fortunate that many
of our friends and colleagues who work on OpenSolaris were willing to review
our early work and provide much useful feedback.

We finished the first draft at the end of August 2008 and worked on the revisions
to each chapter through early December 2008. Of course the
OpenSolaris 2008.11
release came out right at the end of our revision process, so we had to scramble
to be sure that everything in the book was up-to-date with respect to the new
release.

From a personal perspective, this was a particularly difficult year because we
also moved to a “new” house in April of 2008. Our new house is actually about
85 years old and hadn’t been very well maintained for a while, so it needs some
work. The first week we moved in, we had the boiler go out, the sewer back up
into the basement, the toilet and the shower wouldn’t stop running, the
electrical work for our office took longer than expected, our DSL wasn’t hooked
up right, and about a million other things all seemed to go wrong. Somehow we
managed to cope with all of that, keep working for our real jobs, plus I was able
to finish my chapters for the book on schedule. I’m pretty sure
Sarah
wasn’t expecting anything like this when I talked to her about working on the book
the previous December.
Needless to say, we’re looking forward to a less hectic 2009.

If you are at all interested in OpenSolaris, then I hope you’ll find something in our
book that is worthwhile, even if you already know a lot about the OS. The book is
targeted primarily at end-users and system administrators. It has
a lot of breadth and we tried to include a balanced mix of introductory material as well as advanced
techniques. Here’s the table of contents so you can get a feel for whats in the book.

I. Introduction to OpenSolaris.
1. What Is OpenSolaris?
2. Installing OpenSolaris.
3. OpenSolaris Crash Course.
II. Using OpenSolaris
4. The Desktop.
5. Printers and Peripherals.
6. Software Management.
III. OpenSolaris File Systems, Networking, and Security.
7. Disks,  Local File Systems, and the Volume Manager.
8. ZFS.
9. Networking.
10. Network File Systems and Directory Services.
11. Security.
IV. OpenSolaris Reliability, Availability, and Serviceability.
12. Fault Management.
13. Service Management.
14. Monitoring and Observability.
15. DTrace.
16. Clustering for High Availability.
V. OpenSolaris Virtualization.
17. Virtualization Overview.
18. Resource Management.
19. Zones.
20. xVM Hypervisor.
21. Logical Domains (LDoms).
22. VirtualBox.
VI. Developing and Deploying on OpenSolaris.
23. Deploying a Web Stack on OpenSolaris.
24. Developing on OpenSolaris.

If this looks interesting, you can pre-order a copy from Amazon
here. It comes out early next month, February 2009, and
I’m excited to hear peoples reaction once they’ve actually had a chance to look
it over.

Posted on January 6, 2009 at 4:35 pm by jerry · Permalink · 7 Comments
In: General

Updating zones on OpenSolaris 2008.11 using detach/attach

In my
last post
I talked a bit about the new way that software and dataset management works for
zones on the
2008.11
release.

One of the features that is still under development is to provide
a way to automatically keep the non-global zones in sync with
the global zone when you do a ‘pkg image-update’. The
IPS
project still needs some additional enhancements to be
able to describe the software dependencies between the
global and non-global zones. In the meantime, you must
manually ensure that you update the non-global zones after
you do an image-update and reboot the global zone. Doing
this will create new ZFS datasets for each zone which you can
then manually update so that they match the global zone software
release.

The easiest way to update the zones is to use the new detach/attach
capabilities we added to the 2008.11 release. You can simply detach
the zone, then re-attach it. We provide some support for the zone
update on attach
option for ipkg-branded zones, so you can use ‘attach -u’ to simply update
the zone.

The following shows an example of this.

# zoneadm -z jj1 detach
# zoneadm -z jj1 attach -u
Global zone version: pkg:/entire@0.5.11,5.11-0.101:20081119T235706Z
Non-Global zone version: pkg:/entire@0.5.11,5.11-0.98:20080917T010824Z
Updating non-global zone: Output follows
Cache: Using /var/pkg/download.
PHASE                                          ITEMS
Indexing Packages                        54/54
DOWNLOAD                                    PKGS           FILES       XFER (MB)
Completed                                     54/54   2491/2491   52.76/52.76
PHASE                                        ACTIONS
Removal Phase                            1253/1253
Install Phase                                 1440/1440
Update Phase                               3759/3759
Reading Existing Index                            9/9
Indexing Packages                               54/54
pkg:/entire@0.5.11,5.11-0.98:20080917T010824Z

Here you can see how the zone is updated when it is re-attached to the
system. This updates the software in the currently active dataset associated with
the global zone BE. If you roll-back to an earlier image, the dataset associated
with the zone and the earlier BE will be used instead of this newly updated dataset.
We’ve also enhanced the IPS code so it can use the pkg cache from the global
zone, thus the zone update is very quick.

Because the zone attach feature is implemented as a brand-specific capability,
each brand provides its own options for how zones can be attached. In addition
to the -u option, the ipkg brand supports a -a or -r option. The -a option allows
you to take an archive (cpio, bzip2, gzip, or USTAR tar) of a zone from another
system and attach it. The -r option allows you to receive the output of a ‘zfs send’
into the zone’s dataset. Either of these options can be combined with -u to
enable zone migration from one OpenSolaris system to another. An additional
option, which didn’t make it into 2008.11, but is in the development release, is
the -d option, which allows you to specify an existing dataset to be used for the
attach. The attach operation will take that dataset and add all of the properties
needed to make it usable on the current global zone BE.

If you used zones on 2008.11, you might have noticed that the zone’s dataset
is not mounted when the zone is halted. This is something we might change in
the future, but in the meantime, one final feature related to zone detach is that it
leaves the zone’s dataset mounted. This provides and easy way to access the zone’s
data. Simply detach the zone, then you can access the zone’s mounted file system,
then re-attach the zone.

Posted on December 23, 2008 at 9:59 am by jerry · Permalink · Leave a comment
In: General

zones on OpenSolaris 2008.11

The
OpenSolaris 2008.11
release just came out and we’ve made
some significant changes in the way that zones are installed
on this release. The motivation for these changes are so that we
can eventually have software management operations using
IPS
work in a non-global zone much the same way as they work
in the global zone. Global zone software management
uses the
SNAP Upgrade
project along with IPS and the idea is to create a new Boot
Environment (BE) when you update the software in the global
zone. A BE is based on a ZFS snapshot and clone, so that you
can easily roll back if there are any problems with the newly
installed software. Because the software in the non-global zones
should be in sync with the global zone, when a new BE is created
each of the non-global zones must also have a new ZFS snapshot and
clone that matches up to the new BE.

We’d also eventually like to have the same software management capabilities
within a non-global zone. That is, we’d like the non-global zone
system administrator to be able to use IPS to install software in
the zone, and as part of this process, a new BE inside the zone would
be created based on a ZFS snapshot and clone. This way the
non-global zone can take advantage of the same safety features for
rolling back that are available in the global zone.

In order to provide these capabilities, we needed to make some
important changes in how zones are laid out in the file system.
To support all of this we need the actual zone root file system
to be its own delegated ZFS dataset. In this way the non-global zone
sysadmin can make their own ZFS snapshots and clones of the zone root
and the IPS software can automatically create a new BE within the zone
when a software management operation takes place in the zone.

The gory details of this are discussed in
the
spec.

All of the capabilities described above don’t work yet, but we have laid
a foundation to enable this for the future. In particular, when you create
a new global zone BE, all of the non-global zones are also cloned as well.
However, running image-update in the global zone still doesn’t update each
individual zone. You still need to do that manually, as Dan described
in his
blog
about zones on the 2008.05 release. In a future post I’ll talk about some other ways
to update each zone. Another feature that isn’t done yet is the full
SNAP Upgrade support from within the zone itself. That is, zone roots
are now delegated ZFS datasets, but when you run IPS inside the zone itself,
a new clone is not automatically created. Adding this feature should be fairly
straightforward though, now that the basic support is in the release.

With all of these changes to how zone roots use ZFS in 2008.11, here is
a summary of the important differences and limitations with using zones
on 2008.11.

1) Existing zones can’t be used. If you have zones
installed on an earlier release of OpenSolaris and image-update to 2008.11 or
later, those zones won’t be usable.

2) Your global zone BE needs a UUID. If you are running 2008.11 or later
then your global zone BE will have a UUID.

3) Zones are only supported in ZFS. This means that the zonepath
must be a dataset. For example, if the zonepath for your
zone is /export/zones/foo, then /export/zones must be a dataset.
The zones code will then create the foo dataset and all the
underlying datasets when you install the zone.

4) As I mentioned above, image-updating the global BE doesn’t update
the zones yet. After you image-update the global zone, don’t forget to
update the new BE for each zone so that it is in sync with the global zone.

Posted on December 10, 2008 at 9:14 am by jerry · Permalink · Leave a comment
In: Solaris

A busy week for zones

This is turning out to be a busy week for zones related news. First,
the newest version of Solaris 10, the 8/07 release, is now
available.
This release includes the improved resource management
integration with zones that has been available for a while now in
the OpenSolaris nevada code base and which I blogged about

here
. It also includes other zones enhancements such as
brandz and IP instances. Jeff Victor has a nice description
of all of these new zone features on his
blog.

If that wasn’t enough, we have started to talk about our latest
project, code named Etude. This is a new brand for zones, building on the
brandz framework, and allows you to run a Solaris 8 environment
within a zone. We have been working on this project for a good part
of the year and it is exciting to finally be able to talk more about it.
With Etude you can quickly consolidate those dusty old
Solaris 8 SPARC systems, running on obsolete hardware, onto current generation,
energy efficient,

systems
.
Marc Hamilton, VP of Solaris Marketing, describes
this project at a high level on his
blog
but for more details, Dan Price, our project lead, wrote up a really
nice overview on his
blog.
If you have old systems still running Solaris 8 and would like an
easy path to Solaris 10 and to newer hardware, then this project
might be what you need.

Posted on September 6, 2007 at 8:02 am by jerry · Permalink · Leave a comment
In: Solaris

Containers in SX build 56

The many Resource Management (RM) features in Solaris
have been developed and evolved over the course of years and several releases.
We have resource controls, resource pools, resource capping
and the Fair Share Scheduler (FSS). We have rctls, projects, tasks,
cpu-shares, processor sets and the rcapd(1M). All of these features
have different commands and syntax to configure the
feature. In some cases, particularly with resource pools, the
syntax is quite complex and long sequences of commands are needed
to configure a pool. When you first look at RM it is not immediately
clear when to use one feature vs. another or if some combination
of these features is needed to achieve the RM objectives.

In Solaris 10 we introduced Zones, a lightweight system virtualization
capability. Marketing
coined the term ‘containers’ to refer to a combination of
Zones and RM within Solaris. However, the integration
between the two was fairly weak. Within Zones we had the ‘rctl’
configuration option, which you could use to set a couple of zone specific
resource controls, and we had the ‘pool’ property which could
be used to bind the zone to an existing resource pool, but that was it.
Just setting the ‘zone.cpu-shares’ rctl wouldn’t actually
give you the right cpu shares unless you also configured the
system to use FSS. But, that was
a separate step and easily overlooked. Without the correct
configuration of these various, disparate components even a simple
test, such as a fork bomb within a zone, could disrupt the
entire system.

As users started experimenting with Zones we found that
many of them were not leveraging the RM capabilities provided
by the system. We would get dinged in

evaluations

because Zones, without a correct RM configuration, didn’t provide all
of the containment users needed.
We always expected Zones and RM to be used together, but due the
the complexity of the RM features and the loose integration between
the two, we were seeing that few Zones users actually had a proper RM
configuration. In addition, our RM for memory control
was limited to rcapd running within a zone and capping RSS on projects.
This wasn’t really adequate.

About 9 months ago the Zones engineering team started a project to
try to improve this situation. We didn’t want to just paper over
the complexity with things like a GUI or wizards, so it took
us quite a bit of design before we felt like we hit upon
some key abstractions that we could use to truly simplify the
interaction between the two components. Eventually we settled upon
the idea of organizing the RM features into ‘dedicated’ and ‘capped’
configurations for the zone. We enhanced resource pools to add
the idea of a ‘temporary pool’ which we could dynamically instantiate
when a zone boots. We enhanced rcapd(1M) so that we could do physical
memory capping from the global zone. Steve Lawrence did a lot
of work to improve resident set size (RSS) accounting as well
as adding new rctls for maximum swap and locked memory.
These new features significantly improve RM of memory for Zones.
We then enhanced the Zones infrastructure to automatically do
the work to set up the various RM features that were configured
for the zone. Although the project made many smaller
improvements, the key ideas are the two new configuration options
in zonecfg(1M). When configuring a zone you can now configure
‘dedicated-cpu’ and ‘capped-memory’. Going forward, as additional
RM features are added, we anticipate this idea will evolve gracefully
to add ‘dedicated-memory’ and ‘capped-cpu’ configuration. We also
think this concept can be easily extended to support RM features for other
key parts of the system such as the network or storage subsystem.

Here is our simple diagram of how we eventually unified the RM
view within Zones.

| dedicated  |  capped
---------------------------------
cpu    | temporary  | cpu-cap
| processor  | rctl*
| set        |
---------------------------------
memory | temporary  | rcapd, swap
| memory     | and locked
| set*       | rctl

*

memory sets

and

cpu caps

are under development but are not yet part of Solaris.

With these enhancements, it is now almost
trivial to configure RM for a zone. For example, to configure
a resource pool with a set of up to four cpu’s, all you do in zonecfg is:

zonecfg:my-zone> add dedicated-cpu
zonecfg:my-zone:dedicated-cpu> set ncpus=1-4
zonecfg:my-zone:dedicated-cpu> set importance=10
zonecfg:my-zone:dedicated-cpu> end

To configure memory caps, you would do:

zonecfg:my-zone> add capped-memory
zonecfg:my-zone:capped-memory> set physical=50m
zonecfg:my-zone:capped-memory> set swap=128m
zonecfg:my-zone:capped-memory> set locked=10m
zonecfg:my-zone:capped-memory> end

All of the complexity of configuring the associated RM capabilities
is then handled behind the scenes when the zone boots. Likewise,
when you migrate a zone to a new host, these RM settings migrate too.

Over the course of the project we

discussed

these ideas within the
opensolaris Zones community where we benefited from much good
input which we used in the final design and implementation.
The full details of the project are available

here

and

here
.

This work is available in

Solaris Express

build 56 which was just
posted. Hopefully folks using Zones will get a chance to try
out the new features and let us know what they think. All of
the core engineering team actively participates in the

zones discuss

list and we’re happy to try to answer any questions or just hear
your thoughts.

Posted on February 1, 2007 at 12:25 pm by jerry · Permalink · 2 Comments
In: Solaris

SVM root mirroring and GRUB

Although I haven’t been working on SVM for over 6 months (I am
working on Zones now), I still get
questions about SVM and x86 root mirroring from time to time. Some
of these procedures are different when using the new x86 boot
loader (GRUB) that is now part of Nevada and S10u1. I have some old
notes that I wrote up about 9 months ago that describe the
updated procedures and I think these are still valid.

Root mirroring on x86 is more complex than is root mirroring
on SPARC. Specifically, there are issues with being able to boot
from the secondary side of the mirror when the primary side fails.
On x86 machines the system BIOS and fdisk partitioning are the
complicating factors.

The x86 BIOS is analogous to the PROM interpreter on SPARC.
The BIOS is responsible for finding the right device to boot
from, then loading and executing GRUB from that device.

All modern x86 BIOSes are configurable to some degree but the
discussion of how to configure them is beyond the scope of this
document. In general you can usually select the order of devices that
you want the BIOS to probe (e.g. floppy, IDE disk, SCSI disk, network)
but you may be limited in configuring at a more granular level.
For example, it may not be possible to configure the BIOS to probe
the first and second IDE disks. These limitations may be a factor
with some hardware configurations (e.g. a system with two IDE
disks that are root mirrored). You will need to understand
the capabilities of the BIOS that is on your hardware. If your
primary boot disk fails you may need to break into the BIOS
while the machine is booting and reconfigure to boot from the
second disk.

On x86 machines fdisk partitions are used and it is common to have
multiple operating systems installed. Also, there are different flavors
of master boot programs (e.g. LILO) in addition to GRUB which is the standard
Solaris master boot program. The boot(1M) man page is a good resource
for a detailed discussion of the multiple components that are used during
booting on Solaris x86.

Since SVM can only mirror Solaris slices within the Solaris fdisk
partition this discussion will focus on a configuration that only
has Solaris installed. If you have multiple fdisk partitions
then you will need to use some other approach to protect the data
outside of the Solaris fdisk partition.

Once your system is installed you create your metadbs and root
mirror using the normal procedures.

You must ensure that both disks are bootable so that you can boot from
the secondary disk if the primary fails. You use the installgrub program
to setup the second disk as a Solaris bootable disk (see installgrub(1M)).
An example command is:

/sbin/installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c0t1d0s0

Solaris x86 emulates some of the behavior of the SPARC eeprom. See
eeprom(1M). The boot device is stored in the “bootpath” property that
you can see with the eeprom command. The value should be assigned to the
the device tree path of the root mirror. For example:

bootpath=/pseudo/md@0:0,10,blk

Next you need to modify the GRUB boot menu so that you can manually boot
from the second side of the mirror, should this ever be necessary.
Here is a quick overview of the GRUB disk naming convention.

(hd0),(hd1) — first & second BIOS disk (entire disk)

(hd0,0),(hd0,1) — first & second fdisk partition of first BIOS disk

(hd0,0,a),(hd0,0,b) — Solaris/BSD slice 0 and 1 on first fdisk
partition on the first BIOS disk

Hard disk names starts with hd and a number, where 0 maps to BIOS
disk 0×80 (first disk enumerated by the BIOS), 1 maps to 0×81, and so on.
One annoying aspect of BIOS disk numbering is that the order may change
depending on the BIOS configuration. Hence, the GRUB menu may become
invalid if you change the BIOS boot disk order or modify the disk
configuration. Knowing the disk naming convention is essential to
handling boot issues related to disk renumbering in the BIOS.
This will be a factor if the primary disk in the mirror is not seen by
the BIOS so that it renumbers and boots from the secondary disk in the
mirror. Normally this renumbering will mean that the system can
still automatically boot from the second disk, since you configured
it to boot in the previous steps, but it becomes a factor
when the first disk becomes available again, as described below.

You should edit the GRUB boot menu in /boot/grub/menu.lst and add
an entry for the second disk in the mirror.
It is important that you be able to manually boot from
the second side of the mirror due to the BIOS renumbering described
above. If the primary disk is unavailable, the boot archive on that
disk may become stale. Later, if you boot and that disk is available
again, the BIOS renumbering would cause GRUB to load that stale boot
archive which could cause problems or may even leave the system unbootable.

If the primary disk is once again made available and then you reboot without
first resyncing the mirror back onto the primary drive, then you
should use the GRUB menu entry for the second disk to manually boot from
the correct boot archive (the one on the secondary side of the mirror).
Once the system is booted, perform normal metadevice maintenance to resync
the primary disk. This will restore the current boot archive to the
primary so that subsequent boots from that disk will work correctly.

The previous procedure is not normally necessary since you would replace the
failed primary disk using cfgadm(1M) and resync but it will be required
if the primary is simply not powered on, causing the BIOS to miss the disk
and renumber. Subsequently powering up this disk and rebooting would
cause the BIOS to renumber again and by default you would boot from the
stale disk.

Note that all of the usual considerations of mddb quorum apply
to x86 root mirroring, just as they do for SPARC.

Posted on February 20, 2006 at 11:42 am by jerry · Permalink · Leave a comment
In: Solaris

Moving and cloning zones

Its been quite a long time since my last blog entry.
I have moved over from the SVM team onto the Zones team
and I have been busy coming up to speed on Zones.

So far I have just fixed a few Zones bugs but now I am
starting to work on some new features. One of the big
things people want from Zones is the ability to move
them and clone them.

I have a

proposal

for moving and cloning zones over on the
OpenSolaris zones discussion. This has been approved by our internal
architectural review committee and the code is basically done so it
should be available soon.
Moving a zone is currently limited to a single system but the
next step is migrating the zone from one machine to another.
Thats the next thing we’re going to work on.

For cloning we currently copy the bits from one zone instance
to another and we’re seeing significant performance wins compared
to installing the zone from scratch (13x faster on one test machine).
With

ZFS
now available
it seems obvious that we could use ZFS clones to quickly clone
Zone instances. This is something that we are actively looking at
but for now we don’t recommend that you place your zonepath on ZFS.
This is bug

6356600
and is due to the current limitation that you won’t be able to
upgrade your system if your zonepath is on ZFS. Once the upgrade
issues have been resolved, we’ll be extending Zone cloning to be
better integrated with ZFS clones. In the meantime, while you can
use ZFS to hold your zones, you need to be aware that the system won’t
be upgradeable.

Posted on December 8, 2005 at 10:56 am by jerry · Permalink · One Comment
In: Solaris

FROSUG

I haven’t written a blog for quite a while now. I’m actually
not working on SVM right now. Instead, I am busy
on some zones related work. It has been
a busy summer. My wife and I were in Beijing for about 10 days
talking to some customers about Solaris.
Sarah has
posted
some pictures and written a funny story
about our trip.

Last night we had the inaugural meeting of
FROSUG
(the Front Range Open Solaris Users Group).
I gave an overview presentation introducing
OpenSolaris.
The meeting seemed to go well and it got blogged
by Stephen O’Grady, which
is pretty cool. Hopefully it will take off and we can get
an active community of OpenSolaris people in the Denver area.

Posted on August 12, 2005 at 2:15 pm by jerry · Permalink · One Comment
In: General

SVM resync cancel/resume

The latest release of
Solaris Express
came out the other day. Dan has his usual excellent

summary
.
He mentions one cool new SVM feature but it might be easy to overlook
it since there are so many other new things in this release. The new
feature is the ability to cancel a mirror resync that is underway.
The resync is checkpointed and you can restart it later. It will simply pick
up where it left off. This is handy
if the resync is effecting performance and you’d like to wait until
later to let it run. Another use for this is if you need to reboot.
With the old code, if a full resync was underway and you rebooted,
the resync would start over from the beginning. Now, if you cancel it
before rebooting, the checkpoint will allow the resync to pick up where
it left off.

This code is already in
OpenSolaris.
You can see the CLI changes in
metasync.c
and the library changes in
meta_mirror_resync_kill.
The changes are fairly small because most of the underlying support
was already implemented for multi-node disksets. All we had to do was
add the CLI option and hook in to the existing ioctl. You can see some of
this resync functionality in the
resync_unit
function. There is a nice big comment there which explains some of this
code.

Technorati Tag:
OpenSolaris

Technorati Tag:
Solaris

Posted on June 23, 2005 at 7:28 am by jerry · Permalink · 16 Comments
In: Solaris

SVM and the B_FAILFAST flag

Now that
OpenSolaris
is here it is a lot easier to talk about some of the interesting
implementation details in the code. In this post I wanted to
discuss the first project I did after I started to work on the
Solaris Volume Manager (SVM). This is on my mind right now
because it also happens to be related to one of my most recent changes
to the code. This change is not even in
Solaris Express
yet, it is only available in
OpenSolaris.
Early access to these kind of changes is just one small reasons why OpenSolaris is so cool.

My first SVM project was to add support for the B_FAILFAST flag.
This flag is defined in /usr/include/sys/buf.h and it was
implemented in some of the disk drivers so that I/O requests
that were queued in the driver could be cleared out quickly when
the driver saw that a disk was malfunctioning. For SVM the
big requester for this feature was our clustering software. The
problem they were seeing was that in a production environment
there would be many concurrent I/O requests queued up down in
the sd driver. When the disk was failing the sd driver would
need to process each of these requests, wait for the timeouts
and retrys and slowly drain its queue. The cluster software
could not failover to another node until all of these pending
requests had been cleared out of the system. The B_FAILFAST flag
is the exact solution to this problem. It tells the driver
to do two things. First, it reduces the number of retries that
the driver does to a failing disk before it gives up and returns
an error. Second, when the first I/O buf that is queued up in
the driver gets an error, the driver will immediately error
out all of the other, pending bufs in its queue. Furthermore,
any new bufs sent down with the B_FAILFAST flag will immediately
return with an error.

This seemed fairly straightforward to implement in SVM. The
code had to be modified to detect if the underlying devices
supported the B_FAILFAST flag and if so, the flag should be
set in the buf that was being passed down from the md driver
to the underlying drivers that made up the metadevice. For
simplicity we decided we would only add this support to the
mirror metadevice in SVM. However, the more I looked at this,
the more complicated it seemed to be. We were worried about
creating new failure modes with B_FAILFAST and the big concern was the
possibility of a “spurious” error. That is, getting back an
error on the buf that we would not have seen if we had let the
underlying driver perform its full set of timeouts and retries.
This concern eventually drove the whole design of the initial B_FAILFAST
implementation within the mirror code. To handle this spurious
error case I implemented an algorithm within the driver so that when we
got back an errored B_FAILFAST buf we would resubmit that buf without the
B_FAILFAST flag set. During this retry, all of the other failed I/O bufs would
also immediately come back up into the md driver. I queued those
up so that I could either fail all of the them after the retried buf
finally failed or I could resubmit them back down to the underlying
driver if the retried I/O succeeded. Implementing this correctly
took a lot longer than I originally expected when I took this first
project and it was one of those things that worked but I was never
very happy with. The code was complex and I never felt
completely confident that there wasn’t some obscure error condition
lurking here that would come back to bite us later. In addition,
because of the retry, the failover of a component within a mirror
actually took *longer* now if there was only a single I/O
being processed.

This original algorithm was delivered in the S10 code and was also released
as a patch for S9 and SDS 4.2.1. It has been in use for a couple of years
which gave me some direct experience with how well the B_FAILFAST
option worked in real life. We actually have seen one or two
of these so called spurious errors but in all cases there were real,
underlying problems with the disks. The storage was marginal
and SVM would have been better off just erroring out those components
within the mirror and immediately failing over to the good side
of the mirror. By this time I was comfortable with this idea so
I rewrote the B_FAILFAST code within the mirror driver. This new
algorithm is what you’ll see today in the
OpenSolaris
code base. I basically decided to just trust the error we get
back when B_FAILFAST is set. The code will follow the normal error
path so that it puts the submirror component into the maintenance state
and just uses the other, good side of the mirror from that point onward.
I was able to remove the queue and simplify the logic almost back to
what it was before we added support for B_FAILFAST.

However, there is still one special case we have to worry about when
using B_FAILFAST. As I mentioned above, when B_FAILFAST is set, all
of the pending I/O bufs that are queued down in the underlying driver
will fail once the first buf gets an error. When we are down to the
last side of a mirror the SVM code will continue to try to do I/O to
the those last submirror components, even though they are taking errors.
This is called the LAST_ERRED state within SVM and is an attempt to try to
provide access to as much of your data as possible. When using B_FAILFAST
it is probable that not all of the failed I/O bufs will have been
seen by the disk and given a chance to succeed. With the new algorithm
the code detects this state and reissues all of the I/O bufs without
B_FAILFAST set. There is no longer any queueing, we just resubmit the I/O
bufs without the flag and all future I/O to the submirror is done
without the flag. Once the LAST_ERRED state is cleared the code will
return to using the B_FAILFAST flag.

All of this is really an implementation detail of mirroring in SVM.
There is no user-visible component of this except for a change in
the behavior of how quickly the mirror will fail the errored drives
in the submirror. All of the code is contained within the mirror
portion of the SVM driver and you can see it in
mirror.c.
The function
mirror_check_failfast
is used to determine if all of the components
in a submirror support using the B_FAILFAST flag. The
mirror_done
function is called when the I/O to the underlying submirror is complete.
In this function we check if the I/O failed and if B_FAILFAST was set.
If so we call the
submirror_is_lasterred
function to check for that
condition and the
last_err_retry
function is called only when we
need to resubmit the I/O. This function is actually executed in
a helper thread since the I/O completes in a thread separately from
the thread that initiated the I/O down into the md driver.

To wrap up, the SVM md driver code lives in the source tree at
usr/src/uts/common/io/lvm.
The main md driver is in the
md
subdirectory
and each specific kind of metadevice also has its own subdirectory (
mirror,
stripe,
etc.). The SVM command line utilities live in
usr/src/cmd/lvm
and the shared library code that SVM uses lives in
usr/src/lib/lvm.
Libmeta
is the primary library. In another post
I’ll talk in more detail about some of these other components of SVM.

Technorati Tag:
OpenSolaris

Technorati Tag:
Solaris

Posted on June 14, 2005 at 8:05 am by jerry · Permalink · 3 Comments
In: Solaris