Eric Schrock's Blog


When ZFS was first developed, the engineering team had the notion that pooled storage would make filesystems cheap and plentiful, and we’d move away from the days of /export1, /export2, ad infinitum. From the ZFS perspective, they are cheap. It’s very easy to create dozens or hundreds of filesystems, each which functions as an administrative control point for various properties. However, we quickly found that other parts of the system start to break down once you get past 1,000 or 10,000 filesystems. Mounting and sharing filesystems takes longer, browsing datasets takes longer, and managing automount maps (for those without NFSv4 mirror mounts) quickly spirals out of control.

For most users this isn’t a problem – a few hundred filesystems is more than enough to manage disparate projects and groups on a single machine. There was one class of users, however, where a few hundred filesystems wasn’t enough. These users were university or other home directory environments with 20,000 or more users, each which needed to have a quota to guarantee that they couldn’t run amok on the system. The traditional ZFS solution, creating a filesystem for each user and assigning a quota, didn’t scale. After thinking about it for a while, Matt developed a fairly simple architecture to provide this functionality without introducing pathological complexity into the bowels of ZFS. In build 114 of Solaris Nevada, he pushed the following:

PSARC 2009/204 ZFS user/group quotas & space accounting

This provides full support for user and group quotas on ZFS, as well as the ability to track usage on a per-user or per-group basis within a dataset.

This was later integrated into the 2009.Q3 software release, with an additional UI layer. From the ‘general’ tab of a share, you can query usage and set quotas for individual users or groups quickly. The CLI allows for automated batch operations. Requesting a single user or group is significantly faster than requesting all the current usage, but you an also get a list of the current usage for a project or share. With integrated identity management, users and groups can be specified either by UNIX username or Windows name.

There are some significant differences between user and group quotas and traditional ZFS quotas. The following is an excerpt from the on-line documentation on the subject:

  • User and group quotas can only be applied to filesystems.
  • User and group quotas are implemented using delayed enforcement. This means that users will be able to exceed their quota for a short period of time before data is written to disk. Once the data has been pushed to disk, the user will receive an error on new writes, just as with the filesystem-level quota case.
  • User and group quotas are always enforced against referenced data. This means that snapshots do not affect any quotas, and a clone of a snapshot will consume the same amount of effective quota, even though the underlying blocks are shared.
  • User and group reservations are not supported.
  • User and group quotas, unlike data quotas, are stored with the regular filesystem data. This means that if the filesystem is out of space, you will not be able to make changes to user and group quotas. You must first make additional space available before modifying user and group quotas.
  • User and group quotas are sent as part of any remote replication. It is up to the administrator to ensure that the name service environments are identical on the source and destination.
  • NDMP backup and restore of an entire share will include any user or group quotas. Restores into an existing share will not affect any current quotas. (There is currently a bug preventing this from working in the initial release, which will be fixed in a subsequent minor release.)

This feature will hopefully allow the Sun Storage 7000 series to function in environments where it was previously impractical to do so. Of course, the real person to thank is Matt and the ZFS team – it was a very small amount of work to provide an interface on top of the underlying ZFS infrastructure.

In the Sun Storage 7000 2009.Q3 software release, one of the major new features I worked on was the addition of what we termed “shadow migration.” When we launched the product, there was no integrated way to migrate data from existing systems to the new systems. This resulted in customers rolling it by hand (rsync, tar, etc), or paying for professional services to do the work for them. We felt that we could present a superior model that would provide for a more integrated experience as well as let the customer leverage the investment in the system even before the migration was complete.

The idea in and of itself is not new, and various prototypes of this have been kicking around inside of Sun under various monikers (“brain slug”, “hoover”, etc) without ever becoming a complete product. When Adam and myself sat down shortly before the initial launch of the product, we decide we could do this without too much work by integrating the functionality directly in the kernel. The basic design requirements we had were:

  • We must be able to migrate over standard data protocols (NFS) from arbitrary data sources without the need to have special software running on the source system.
  • Migrated data must be available before the entire migration is complete, and must be accessible with native performance.
  • All the data to migrate the filesystem must be stored within the filesystem itself, and must not rely on an external database to ensure consistency.

With these requirements in hand, our key insight was that we could create a “shadow” filesystem that could pull data from the original source if necessary, but then fall through to the native filesystem for reads and writes one the file has been migrated. What’s more, we could leverage the NFS client on Solaris and do this entirely at the VFS (virtual filesystem) layer, allowing us to migrate data between shares locally or (eventually) over other protocols as well without changing the interpositioning layer. The other nice thing about this architecture is that the kernel remains ignorant of the larger migration process. Both synchronous requests (from clients) and background requests (from the management daemon) appear the same. This allows us to control policy within the userland software stack, without pushing that complexity into the kernel. It also allows us to write a very comprehensive automated test suite that runs entirely on local filesystems without need a complex multi-system environment.

So what’s better (and worse) about shadow migration compared to other migration strategies? For that, I’ll defer to the documentation, which I’ve reproduced here for those who don’t have a (virtual) system available to run the 2009.Q3 release:


Migration via synchronization

This method works by taking an active host X and migrating data to the new host Y while X remains active. Clients still read and write to the original host while this migration is underway. Once the data is initially migrated, incremental changes are repeatedly sent until the delta is small enough to be sent within a single downtime window. At this point the original share is made read-only, the final delta is sent to the new host, and all clients are updated to point to the new location. The most common way of accomplishing this is through the rsync tool, though other integrated tools exist. This mechanism has several drawbacks:

  • The anticipated downtime, while small, is not easily quantified. If a user commits a large amount of change immediately before the scheduled downtime, this can increase the downtime window.
  • During migration, the new server is idle. Since new servers typically come with new features or performance improvements, this represents a waste of resources during a potentially long migration period.
  • Coordinating across multiple filesystems is burdensome. When migrating dozens or hundreds of filesystems, each migration will take a different amount of time, and downtime will have to be scheduled across the union of all filesystems.

Migration via external interposition

This method works by taking an active host X and inserting a new appliance M that migrates data to a new host Y. All clients are updated at once to point to M, and data is automatically migrated in the background. This provides more flexibility in migration options (for example, being able to migrate to a new server in the future without downtime), and leverages the new server for already migrated data, but also has significant drawbacks:

  • The migration appliance represents a new physical machine, with associated costs (initial investment, support costs, power and cooling) and additional management overhead.
  • The migration appliance represents a new point of failure within the system.
  • The migration appliance interposes on already migrated data, incurring extra latency, often permanently. These appliances are typically left in place, though it would be possible to schedule another downtime window and decommission the migration appliance.

Shadow migration

Shadow migration uses interposition, but is integrated into the appliance and doesn’t require a separate physical machine. When shares are created, they can optionally “shadow” an existing directory, either locally (see below) or over NFS. In this scenario, downtime is scheduled once where the source appliance X is placed into read-only mode, a share is created with the shadow property set, and clients are updated to point to the new share on the Sun Storage 7000 appliance. Clients can then access the appliance in read-write mode.

Once the shadow property is set, data is transparently migrated in the background from the source appliance locally. If a request comes from a client for a file that has not yet been migrated, the appliance will automatically migrate this file to the local server before responding to the request. This may incur some initial latency for some client requests, but once a file has been migrated all accesses are local to the appliance and have native performance. It is often the case that the current working set for a filesystem is much smaller than the total size, so once this working set has been migrated, regardless of the total native size on the source, there will be no perceived impact on performance.

The downside to shadow migration is that it requires a commitment before the data has finished migrating, though this is the case with any interposition method. During the migration, portions of the data exists in two locations, which means that backups are more complicated, and snapshots may be incomplete and/or exist only on one host. Because of this, it is extremely important that any migration between two hosts first be tested thoroughly to make sure that identity management and access controls are setup correctly. This need not test the entire data migration, but it should be verified that files or directories that are not world readable are migrated correctly, ACLs (if any) are preserved, and identities are properly represented on the new system.

Shadow migration implemented using on-disk data within the filesystem, so there is no external database and no data stored locally outside the storage pool. If a pool is failed over in a cluster, or both system disks fail and a new head node is required, all data necessary to continue shadow migration without interruption will be kept with the storage pool.


In a subsequent post, I’ll discuss some of the thorny implementation detail we had to solve, as well as provide some screenshots of migration in progress. In the meantime, I suggest folks download the simulator and upgrade to the latest software to give it a try.

Last week, we announced the Sun Storage 7310 system. At the same time, a less significant but still notable change was made to the Sun Storage product line. The Sun Storage 7210 system is now expandable via J4500 JBODs. These JBODs have the same form factor as the 7210 – 48 drives in a top loading 4u form factor. Up to two J4500s can be added to a 7210 system via a single HBA, resulting in up to 142 TB of storage in 12 RU of space. JBODs with 500G or 1T drives are supported.

For customers looking for maximum density without high availability, the combination of 7210 with J4500 provides a perfect solution.

In the past, I’ve discussed the evolution of disk FMA. Much has been accomplished in the past year, but there are still several gaps when it comes to ZFS and disk faults. In Solaris today, a fault diagnosed by ZFS (a device failing to open, too many I/O errors, etc) is reported as a pool name and 64-bit vdev GUID. This description leaves something to be desired, referring the user to run zpool status to determine exactly what went wrong. But the user is still has to know how to go from a cXtYdZ Solaris device name to a physical device, and when they do locate the physical device they need to manually issue a zpool replace command to initiate the replacement.

While this is annoying in the Solaris world, it’s completely untenable in an appliance environment, where everything needs to “just work”. With that in mind, I set about to plug the last few holes in the unified plan:

  • ZFS faults must be associated with a physical disk, including the human-readable label
  • A disk fault (ZFS or SMART failure) must turn on the associated fault LED
  • Removing a disk (faulted or otherwise) and replacing it with a new disk must automatically trigger a replacement

While these seem like straightforward tasks, as usual they are quite difficult to get right in a truly generic fashion. And for an appliance, there can be no Solaris commands or additional steps for the user. To start with, I needed to push the FRU information (expressed as a FMRI in the hc libtopo scheme) into the kernel (and onto the ZFS disk label) where it would be available with each vdev. While it is possible to do this correlation after the fact, it simplifies the diagnosis engine and is required for automatic device replacement. There are some edge conditions around moving and swapping disks, but was relatively straightforward. Once the FMRI was there, I could include the FRU in the fault suspect list, and using Mike’s enhancements to libfmd_msg, dynamically insert the FRU label into the fault message. Traditional FMA libtopo labels do not include the chassis label, so in the Fishworks stack we go one step further and re-write the label on receipt of a fault event with the user-defined chassis name as well as the physical slot. This message is then used when posting alerts and on the problems page. We can also link to the physical device from the problems page, and highlight the faulty disk in the hardware view.

With the FMA plumbing now straightened out, I needed a way to light the fault LED for a disk, regardless of whether it was in the system chassis or an external enclosure. Thanks to Rob’s sensor work, libtopo already presents a FMRI-centric view of indicators in a platform agnostic manner. So I rewrote the disk-monitor module (or really, deleted everything and created a new fru-monitor module) that would both poll for FRU hotplug events, as well as manage the fault LEDs for components. When a fault is generated, the FRU monitor looks through the suspect list, and turns on the fault LED for any component that has a supported indicator. This is then turned off when the corresponding repair event is generated. This also had the side benefit of generating hotplug events phrased in terms of physical devices, which the appliance kit can use to easily present informative messages to the user.

Finally, I needed to get disk replacement to work like everyone expects it to: remove a faulted disk, put in a new one, and walk away. The genesis of this functionality was putback to ON long ago as the autoreplace pool property. In Solaris, this functionality only works with disks that have static device paths (namely SATA). In the world of multipathed SAS devices, the device path is really a scshi_vhci node identified by the device WWN. If we remove a disk and insert a new one, it will appear as a new device with no way to correlate it to the previous instance, preventing us from replacing the correct vdev. What we need is physical slot information, which happens to be provided by the FMRI we are already storing with the vdev for FMA purposes. When we receive a sysevent for a new device addition, we look at the latest libtopo snapshot and take the FMRI of the newly inserted device. By looking at the current vdev FRU information, we can then associate this with the vdev that was previously in the slot, and automatically trigger the replacement.

This process took a lot longer than I would have hoped, and has many more subtleties too boring even for a technical blog entry, but it is nice to sit back and see a user experience that is intuitive, informative, and straightforward – the hallmarks of an integrated appliance solution.

Since our initial product was going to be a NAS appliance, we knew early on that
storage configuration would be a critical part of the initial Fishworks experience. Thanks to the power
of ZFS storage pools, we have the ability to present a radically simplified interface,
where the storage “just works” and the administrator doesn’t need to worry about
choosing RAID stripe widths or statically provisioning volumes. The first decision
was to create a single storage pool (or really one per head in a cluster)1,
which means that the administrator only needs to make this decision once, and
doesn’t have to worry about it every time they create a filesystem or LUN.

Within a storage pool, we didn’t want the user to be in charge of making
decisions about RAID stripe widths, hot spares, or allocation of devices. This
was primarily to avoid this complexity, but also represents the fact that we
(as designers of the system) know more about its characteristics than you.
RAID stripe width affects performance in ways that are not immediately
obvious. Allowing for JBOD failure requires careful selection of stripe widths.
Allocation of devices can take into account environmental factors (balancing
HBAs, fan groups, backplance distribution) that are unknown to the user.
To make this easy for the user, we pick several different profiles that
define parameters that are then applied to the current configuration to figure
out how the ZFS pool should be laid out.

Before selecting a profile, we ask the user to verify the storage that
they want to configure. On a standalone system, this is just a check
to make sure nothing is broken. If there is a broken or missing disk, we
don’t let you proceed without explicit confirmation. The reason we do
this is that once the storage pool is configured, there is no way to add those
disks to the pool without changing the RAS and performance characteristics
you specified during configuration. On a 7410 with multiple JBODs, this verification step is slightly
more complicated, as we allow adding of whole or half JBODs. This step is
where you can choose to allocate half or all of
the JBOD to a pool, allowing you to split storage in a cluster or reserve
unused storage for future clustering options.

Fundamentally, the choice of redundancy is a business decision. There is
a set of tradeoffs that express your tolerance of risk and relative cost. As
Jarod told us very early on in the project: “fast, cheap, or reliable – pick two.”
We took this to heart, and our profiles are displayed in a table with
qualitative ratings on performance, capacity, and availability. To further
help make a decision, we provide a human-readable description of the
layout, as well as a pie chart showing the way raw storage will be used
(data, parity, spares, or reserved). The last profile parameter is called
“NSPF,” for “no single point of failure.” If you are on a 7410 with multiple
JBODs, some profiles can be applied across JBODs such that the loss
of any one JBOD cannot cause data loss2. This often forces arbitrary stripe
widths (with 6 JBODs your only choice is 10+2) and can result in
less capacity, but with superior RAS characteristics.

This configuration takes just two quick steps, and for the common case
(where all the hardware is working and the user wants double parity RAID),
it just requires clicking on the “DONE” button twice. We also support adding
additional storage (on the 7410), as well as unconfiguring and importing
storage. I’ll leave a complete description of the storage configuration
screen for a future entry.


[1] A common question we get is “why allow only one storage pool?” The actual
implementation clearly allows it (as in the failed over active-active cluster), so it’s
purely an issue of complexity. There is never a reason to create multiple
pools that share the same redundancy profile – this provides no additional value
at the cost of significant complexity. We do acknowledge that mirroring
and RAID-Z provide different performance characteristics, but we hope that with the
ability to turn on and off readzilla and (eventually) logzilla usage on a per-share basis,
this will be less of an issue. In the future, you may see support for multiple pools, but
only in a limited fashion (i.e. enforcing different redundancy profiles).

[2] It’s worth noting that all supported configurations of the 7410 have
multiple paths to all JBODs across multiple HBAs. So even without NSPF, we
have the ability to survive HBA, cable, and JBOD controller failure.

With any product, there is always some talk from the enthusiasts about how they
could do it faster, cheaper, or simpler. Inevitably, there’s a little bit of truth to
both sides. Enthusiasts have been doing homebrew NAS for as long as free software
has been around, but it takes far more
work to put together a complete, polished solution that stands up under the stress
of an enterprise environment.

One of the amusing things I like to do is to look
back at the total amount of source code we wrote. Lines of source code by itself
is obviously not a measure of complexity – it’s possible to write complex software
with very few lines of source, or simple software that’s over engineered – but it’s
an interesting measure nonetheless. Below is the current output of a little script
I wrote to count lines of code1 in our fish-gate. This does not
include the approximately 40,000 lines of change made to
the ON (core Solaris) gate, most of which we’ll be putting back gradually over
the next few months.

C (libak)                 185386        # The core of the appliance kit
C (lib)                    12550        # Other libraries
C (fcc)                    11167        # A compiler adapted from dtrace
C (cmd)                    12856        # Miscellaneous utilities
C (uts)                     4320        # clustron driver
-----------------------   ------
Total C                   226279
JavaScript (web)           69329        # Web UI
JavaScript (shell)         24227        # CLI shell
JavaScript (common)         9354        # Shared javascript
JavaScript (crazyolait)     2714        # Web transport layer (adapted from jsolait)
JavaScript (tst)           40991        # Automated test code
-----------------------   ------
Total Javascript          146615
Shell (lib)                 4179        # Support scripts (primarily SMF methods)
Shell (cmd)                 5295        # Utilities
Shell (tools)               6112        # Build tools
Shell (tst)                 6428        # Automated test code
-----------------------   ------
Total Shell                22014
Python (tst)               34106        # Automated test code
XML (metadata)             16975        # Internal metadata
CSS                         6124        # Stylesheets

[1] This is a raw line count. It includes blank lines and comments, so
interpret it as you see fit.

Recent Posts

April 21, 2013
February 28, 2013
August 14, 2012
July 28, 2012

Archives