Eric Schrock's Blog

Month: November 2008

In the past, I’ve discussed the evolution of disk FMA. Much has been accomplished in the past year, but there are still several gaps when it comes to ZFS and disk faults. In Solaris today, a fault diagnosed by ZFS (a device failing to open, too many I/O errors, etc) is reported as a pool name and 64-bit vdev GUID. This description leaves something to be desired, referring the user to run zpool status to determine exactly what went wrong. But the user is still has to know how to go from a cXtYdZ Solaris device name to a physical device, and when they do locate the physical device they need to manually issue a zpool replace command to initiate the replacement.

While this is annoying in the Solaris world, it’s completely untenable in an appliance environment, where everything needs to “just work”. With that in mind, I set about to plug the last few holes in the unified plan:

  • ZFS faults must be associated with a physical disk, including the human-readable label
  • A disk fault (ZFS or SMART failure) must turn on the associated fault LED
  • Removing a disk (faulted or otherwise) and replacing it with a new disk must automatically trigger a replacement

While these seem like straightforward tasks, as usual they are quite difficult to get right in a truly generic fashion. And for an appliance, there can be no Solaris commands or additional steps for the user. To start with, I needed to push the FRU information (expressed as a FMRI in the hc libtopo scheme) into the kernel (and onto the ZFS disk label) where it would be available with each vdev. While it is possible to do this correlation after the fact, it simplifies the diagnosis engine and is required for automatic device replacement. There are some edge conditions around moving and swapping disks, but was relatively straightforward. Once the FMRI was there, I could include the FRU in the fault suspect list, and using Mike’s enhancements to libfmd_msg, dynamically insert the FRU label into the fault message. Traditional FMA libtopo labels do not include the chassis label, so in the Fishworks stack we go one step further and re-write the label on receipt of a fault event with the user-defined chassis name as well as the physical slot. This message is then used when posting alerts and on the problems page. We can also link to the physical device from the problems page, and highlight the faulty disk in the hardware view.

With the FMA plumbing now straightened out, I needed a way to light the fault LED for a disk, regardless of whether it was in the system chassis or an external enclosure. Thanks to Rob’s sensor work, libtopo already presents a FMRI-centric view of indicators in a platform agnostic manner. So I rewrote the disk-monitor module (or really, deleted everything and created a new fru-monitor module) that would both poll for FRU hotplug events, as well as manage the fault LEDs for components. When a fault is generated, the FRU monitor looks through the suspect list, and turns on the fault LED for any component that has a supported indicator. This is then turned off when the corresponding repair event is generated. This also had the side benefit of generating hotplug events phrased in terms of physical devices, which the appliance kit can use to easily present informative messages to the user.

Finally, I needed to get disk replacement to work like everyone expects it to: remove a faulted disk, put in a new one, and walk away. The genesis of this functionality was putback to ON long ago as the autoreplace pool property. In Solaris, this functionality only works with disks that have static device paths (namely SATA). In the world of multipathed SAS devices, the device path is really a scshi_vhci node identified by the device WWN. If we remove a disk and insert a new one, it will appear as a new device with no way to correlate it to the previous instance, preventing us from replacing the correct vdev. What we need is physical slot information, which happens to be provided by the FMRI we are already storing with the vdev for FMA purposes. When we receive a sysevent for a new device addition, we look at the latest libtopo snapshot and take the FMRI of the newly inserted device. By looking at the current vdev FRU information, we can then associate this with the vdev that was previously in the slot, and automatically trigger the replacement.

This process took a lot longer than I would have hoped, and has many more subtleties too boring even for a technical blog entry, but it is nice to sit back and see a user experience that is intuitive, informative, and straightforward – the hallmarks of an integrated appliance solution.

Since our initial product was going to be a NAS appliance, we knew early on that
storage configuration would be a critical part of the initial Fishworks experience. Thanks to the power
of ZFS storage pools, we have the ability to present a radically simplified interface,
where the storage “just works” and the administrator doesn’t need to worry about
choosing RAID stripe widths or statically provisioning volumes. The first decision
was to create a single storage pool (or really one per head in a cluster)1,
which means that the administrator only needs to make this decision once, and
doesn’t have to worry about it every time they create a filesystem or LUN.

Within a storage pool, we didn’t want the user to be in charge of making
decisions about RAID stripe widths, hot spares, or allocation of devices. This
was primarily to avoid this complexity, but also represents the fact that we
(as designers of the system) know more about its characteristics than you.
RAID stripe width affects performance in ways that are not immediately
obvious. Allowing for JBOD failure requires careful selection of stripe widths.
Allocation of devices can take into account environmental factors (balancing
HBAs, fan groups, backplance distribution) that are unknown to the user.
To make this easy for the user, we pick several different profiles that
define parameters that are then applied to the current configuration to figure
out how the ZFS pool should be laid out.

Before selecting a profile, we ask the user to verify the storage that
they want to configure. On a standalone system, this is just a check
to make sure nothing is broken. If there is a broken or missing disk, we
don’t let you proceed without explicit confirmation. The reason we do
this is that once the storage pool is configured, there is no way to add those
disks to the pool without changing the RAS and performance characteristics
you specified during configuration. On a 7410 with multiple JBODs, this verification step is slightly
more complicated, as we allow adding of whole or half JBODs. This step is
where you can choose to allocate half or all of
the JBOD to a pool, allowing you to split storage in a cluster or reserve
unused storage for future clustering options.

Fundamentally, the choice of redundancy is a business decision. There is
a set of tradeoffs that express your tolerance of risk and relative cost. As
Jarod told us very early on in the project: “fast, cheap, or reliable – pick two.”
We took this to heart, and our profiles are displayed in a table with
qualitative ratings on performance, capacity, and availability. To further
help make a decision, we provide a human-readable description of the
layout, as well as a pie chart showing the way raw storage will be used
(data, parity, spares, or reserved). The last profile parameter is called
“NSPF,” for “no single point of failure.” If you are on a 7410 with multiple
JBODs, some profiles can be applied across JBODs such that the loss
of any one JBOD cannot cause data loss2. This often forces arbitrary stripe
widths (with 6 JBODs your only choice is 10+2) and can result in
less capacity, but with superior RAS characteristics.

This configuration takes just two quick steps, and for the common case
(where all the hardware is working and the user wants double parity RAID),
it just requires clicking on the “DONE” button twice. We also support adding
additional storage (on the 7410), as well as unconfiguring and importing
storage. I’ll leave a complete description of the storage configuration
screen for a future entry.


[1] A common question we get is “why allow only one storage pool?” The actual
implementation clearly allows it (as in the failed over active-active cluster), so it’s
purely an issue of complexity. There is never a reason to create multiple
pools that share the same redundancy profile – this provides no additional value
at the cost of significant complexity. We do acknowledge that mirroring
and RAID-Z provide different performance characteristics, but we hope that with the
ability to turn on and off readzilla and (eventually) logzilla usage on a per-share basis,
this will be less of an issue. In the future, you may see support for multiple pools, but
only in a limited fashion (i.e. enforcing different redundancy profiles).

[2] It’s worth noting that all supported configurations of the 7410 have
multiple paths to all JBODs across multiple HBAs. So even without NSPF, we
have the ability to survive HBA, cable, and JBOD controller failure.

With any product, there is always some talk from the enthusiasts about how they
could do it faster, cheaper, or simpler. Inevitably, there’s a little bit of truth to
both sides. Enthusiasts have been doing homebrew NAS for as long as free software
has been around, but it takes far more
work to put together a complete, polished solution that stands up under the stress
of an enterprise environment.

One of the amusing things I like to do is to look
back at the total amount of source code we wrote. Lines of source code by itself
is obviously not a measure of complexity – it’s possible to write complex software
with very few lines of source, or simple software that’s over engineered – but it’s
an interesting measure nonetheless. Below is the current output of a little script
I wrote to count lines of code1 in our fish-gate. This does not
include the approximately 40,000 lines of change made to
the ON (core Solaris) gate, most of which we’ll be putting back gradually over
the next few months.

C (libak)                 185386        # The core of the appliance kit
C (lib)                    12550        # Other libraries
C (fcc)                    11167        # A compiler adapted from dtrace
C (cmd)                    12856        # Miscellaneous utilities
C (uts)                     4320        # clustron driver
-----------------------   ------
Total C                   226279
JavaScript (web)           69329        # Web UI
JavaScript (shell)         24227        # CLI shell
JavaScript (common)         9354        # Shared javascript
JavaScript (crazyolait)     2714        # Web transport layer (adapted from jsolait)
JavaScript (tst)           40991        # Automated test code
-----------------------   ------
Total Javascript          146615
Shell (lib)                 4179        # Support scripts (primarily SMF methods)
Shell (cmd)                 5295        # Utilities
Shell (tools)               6112        # Build tools
Shell (tst)                 6428        # Automated test code
-----------------------   ------
Total Shell                22014
Python (tst)               34106        # Automated test code
XML (metadata)             16975        # Internal metadata
CSS                         6124        # Stylesheets

[1] This is a raw line count. It includes blank lines and comments, so
interpret it as you see fit.

It’s hard to believe that this day has finally come. After more
than two and a half years, our first Fishworks-based product has been
released. You can keep up to date with the latest info at the
Fishworks blog.

For my first technical post, I’d thought I’d give an introduction to
the chassis subsystem at the heart of our hardware integration strategy. This
subsystem is responsible for gathering, cataloging, and presenting a unified
view of the hardware topology. It
underwent two major rewrites (one by myself and one by Keith) but the
fundamental design has remained the same. While it
may not be the most glamorous feature (no one’s going to purchase a box
because they can get model information on their DIMMs), I found it an
interesting cross-section of disparate technologies and awash in subtle
complexity. You can find a video of myself talking about and
demonstrating this feature
here.

libtopo discovery

At the heart of the chassis subsystem is the FMA topology as
exported by
libtopo.
This library is already
capable of enumerating hardware in a physically meaningful manner, and
FMRIs (fault managed resource identifiers) form the basis of
FMA fault diagnosis. This alone provides us the following basic
capabilities:

  • Discover external storage enclosures
  • Identify bays and disks
  • Identify CPUs
  • Identify power supplies and fans
  • Manage LEDs
  • Identify PCI functions beneath a particular slot

Much of this requires platform-specific XML files, or leverages IPMI
behind the scenes, but this minimal integration work is common to Solaris. Any
platform supported by Solaris is supported by the FishWorks software
stack.

Additional metadata

Unfortunately, this falls short of a complete picture:

  • No way to identify absent CPUs, DIMMs, or empty PCI slots
  • DIMM enumeration not supported on all platforms
  • Human-readable labels often wrong or missing
  • No way to identify complete PCI cards
  • No integration with visual images of the chassis

To address these limitations (most of which lie outside the purview of
libtopo), we leverage additional metadata for each supported
chassis. This metadata identifies all physical slots (even those that
may not be occupied), cleans up various labels, and includes visual
information about the chassis and its components. And
we can identify physical cards based on devinfo properties extracted
from firmware and/or the pattern of PCI functions and their attributes
(a process worthy of its own blog entry). Combined with libtopo, we have
images that we can assemble into a
complete view based on the current physical layout, highlight
components within the image, and respond to user mouse clicks.

Supplemental information

However, we are still missing many of
the component details. Our goal is to be able to provide complete
information for every FRU on the system. With just libtopo, we can get
this for disks but not much else. We need to look to alternate
sources of information.


kstat

For CPUs, there is a rather rich set of information available via
traditional kstat interfaces. While we use libtopo to identify CPUs
(it lets us correlate physical CPUs), the
bulk of the information comes from kstats. This is used to get model,
speed, and the number of cores.

libdevinfo

The device tree snapshot provides additional information for PCI
devices that can only be retrieved by private driver interfaces.
Despite the existence of a VPD (Vital Product Data)
standard, effectively no vendors implement it. Instead, it is read by some firmware-specific
mechanism private to the driver. By exporting these as properties in
the devinfo snapshot, we can transparently pull in dynamic FRU
information for PCI cards. This is used to get model, part, and
revision information for HBAs and 10G NICs.

IPMI

IPMI (Intelligent Platform Management Interface) is used to
communicate with the service processor on most enterprise class
systems. It is used within libtopo for power supply and fan
enumeration in libtopo as well as LED management. But IPMI
also supports FRU data, which includes a lot of juicy tidbits
that only the SP knows. We reference this FRU information directly to
get model and part information for power supplies and DIMMs.

SMBIOS

Even with IPMI, there are bits of information that exist only in SMBIOS,
a standard is supposed to provide information about the physical
resources on the system. Sadly, it does not provide enough information
to correlate OS-visible abstractions with their underlying physical
counterparts. With metadata, however, we can use SMBIOS to make this
correlation. This is used to enumerate DIMMs on platforms not
supported by libtopo, and to supplement DIMM information with data
available only via SMBIOS.

Metadata

Last but not least, there is chassis-specific metadata. Some
components simply don’t have FRUID information, either because they are
too simple (fans) or there exists no mechanism to get the information
(most PCI cards). In this situation, we use metadata to provide
vendor, model, and part information as that is generally static for a
particular component within the system. We cannot get information
specific to the component (such as a serial number), but at least the
user will be able to know what it is and know how to order another
one.

Putting it all together

With all of this information tied together under one subsystem, we
can finally present the user complete information about their hardware,
including images showing the physical layout of the system. In addition,
this also forms the basis for reporting problems and analytics (using
labels from metadata), manipulating chassis state (toggling LEDs, setting
chassis identifiers), and making programmatic distinctions about the
hardware (such as whether external HBAs are present). Over the
next few weeks I hope to expound on some of these details in further
blog posts.

Recent Posts

April 21, 2013
February 28, 2013
August 14, 2012
July 28, 2012

Archives