Data Replication: Building a better NDMP
In my previous post I outlined some of the challenges faced when building a data replication solution, how the first Delphix implementation missed the mark, and how we set out to build something better the second time around.
The first thing that became clear after starting on the new replication subsystem was that we needed a better NDMP implementation. A binary-only separate daemon with poor error semantics that routinely left the system in an inconsistent state was not going to cut it. NDMP is a protocol built for a singular purpose: backing up files using a file-specific format (dump or tar) over arbitrary topologies (direct attached tape, 3-way restore, etc). By being both simultaneously so specific in the data semantics but so general in the control protocol, we end up with the worst of both worlds: baked-in concepts (such as file history, complete with inode numbers) that prevent us from adequately expressing Delphix concepts, and a limited control protocol (lacking multiple streams or resumable streams) with terrible error semantics. While we will ultimately replace NDMP for replication, we knew that we still needed it for backup, and that we didn’t have the time to replace both the implementation and the data protocol for the current release.
Illumos, the open source operating system our distribution is based on, provides an NDMP implementation, one that I had previously dealt with while at Fishworks (though Dave Pacheo was the one who did the actual NDMP integration). I spent some time looking at the implementation and came to the conclusion that it suffered from a number of fatal flaws:
- Poor error semantics – The strategy was “log everything and worry about it later”. For an implementation shipped with a roll-your-own
OS this was not a terrible strategy, but it was a deal breaker for an appliance implementation. We needed clear, concise failure modes that appeared
integrated with our UI. - Embedded data semantics – The notion of tar as a backup format (or raw zfs send) was built very deeply into the architecture. We needed our own data protocol, but replacing the data operations without major surgery was out of the question. While raw ZFS send seems appealing, it is still assumes ownership and control of the filesystem namespace, something that wouldn’t fly in the Delphix world.
- Unused code – There was tons of dead code, ranging from protocol varieties that were unnecessary (NDMPv2) to swaths of device handling
code that did nothing. - Standalone daemon – A standalone daemon makes it difficult to exchange data across the process boundary, and introduces new complex failure modes.
With this in mind I looked at the ndmp.org SDK implementation, and found it to suffer from the same pathologies (and a much worse implementation to boot). It was clear that the Solaris implementation was derived from the SDK, and that there was no mythical “great NDMP implementation” waiting to be found. I was going to have to suck it up and get back to my Solaris roots to eviscerate this beast.
The first thing I did was recast the daemon as a library, elminating any code that deal with daemonizing, running a door server to report statistics, and
existing Solaris commands that communicated with the server. This allowed me to add a set of client-provided callback vector and configuration options to control state. With this library in place, we could use JNA to easily call into C code from our java management stack without having to worry about marshaling data to and from an external daemon.
The next step was to rip out all the data-handling functionality, instead creating a set of callback vectors in the library registration mechanism to start and stop backup. This left the actual implementation of the over-the-wire format up to the consumer. The sheer amount of code used to support tar and zfs send was staggering, and it had its tendrils all across the implementation. As I started to pull on the thread, more and more started to unravel. Data-specific operations would call into the “tape library management” code (which had very little to do with tape library management) that would then call back into common NDMP code, that would then do nothing.
With the data operations gone, I then had to finally address the hard part: making the code usable. The old error semantics were terrible. I had to go through every log call and non-zero return value, analyze its purpose, and restructure it to use the consumer-provided vector so that we could log such messages natively in the Delphix stack. While doing generic code cleanup, this led me to rip out huge swaths of unused code, from buffer management to NDMPv2 support (v3 has been in common use for more than a decade). This was rather painful, but the result has been quite a usable product. While the old Delphix implementation would have reported “NDMP server error CRITICAL: consult server log for details” (of course, there was no way for the customer to get to the “server log”), we would now get much more helpful messages like “NDMP client reported an error during data operation: out of space”.
The final piece of the puzzle was something that surprised me. By choosing NDMP as the replication protocol (again, a temporary choice), we needed a way to drive the 3-way restore operation from within the Delphix stack. This meant that we wanted to act as a DMA. As I looked at the unbelievable awful ‘ndmpcopy’ implementation shipped with the NDMP SDK, I noticed a lot of similarity to what we needed on the client and what we had on the server (processing requests was identical, even if the set of expected requests was quite different). Rather than build an entirely separate implementation, I converted libndmp such that it could act as a server or a client. This allowed us to build an NDMP copy operation in Java, as well as simulate a remote DMA (an invaluable testing tool).
It took more than a month of solid hard work and several more months of cleanup here and there, but the result was worth it. The new implementation clocks in at just over 11,000 lines of code, while the original was a staggering 43,000 lines of code. Our implementation doesn’t include any actual data handling, so it’s perhaps an unfair comparison. But we also include the ability to act as a full-featured DMA client, something the illumos implementation lacks.
The results of this effort will be available on github as soon as we release the next Delphix version (within a few weeks). While interesting, it’s unlikely to be useful to the general masses, and certainly not something that we’ll try to push upstream. I encourage others looking for an open-source embedded NDMP implementation to fork and improve what we have in Delphix – it’s a very flexible NDMP implementation that can be adopted for a variety of non-traditional NDMP scenarios. But with no built-in data processing, and no standalone daemon implementation, it’s a long way from replacing what can be found in illumos. If someone was so inspired, you could build a daemon on top of the current library – one that provides support for tar, dump, ZFS, and whatever other formats are supported by the current illumos implementation. It would not be a small amount of work, but I am happy to lend advice (if not code) to anyone interested.
Next up will be a post whose working title is “Data Replication: Metadata + Data = Crazy Pain in My Ass”.
Data Replication: Approaching the Problem
With our next Delphix release just around the corner, I wanted to spend some time discussion the engineering process behind one of the major new features: data replication between servers. The current Delphix version already has a replication solution, so how does this constitute a “new feature”? The reason is that it’s an entirely new system, the result of an endeavor to create a more reliable, maintainable, and extensible system. How we got here makes for an interesting tale of business analysis, architecture, and implementation.
Where did we come from?
Before we begin looking at the current implementation, we need to understand why we started with a blank sheet of paper when we already had a shipping solution. The short answer is that what we had was unusable: it was unreliable, undebuggable, and unmaintainable. And when you’re in charge of data consistency for disaster recovery, “unreliable” is not an acceptable state. While I had not written any of the replication infrastructure at Fishworks (my colleagues Adam Leventhal and Dave Pacheco deserve the credit for that), I had spent a lot of time in discussions with them, as well as thinking about how to build a distributed data architecture at Fishworks. So it seemed natural for me to take on this project at Delphix. As I started to unwind our current state, I found a series of decisions that, in hindsight, led to the untenable state we were in today.
- Analysis of the business problem – At the core of the current replication system was the notion that its purpose was for disaster recovery. This is indeed a major use case of replication, but it’s not the only one (geographical distribution of data being another strong contender). While picking one major problem to tackle first is a reasonable approach to constrain scope, by not correctly identifying future opportunities we ended up with a solution that could only be used for active/passive disaster recovery.
- Data protocol choice – There is another problem that is very similar to replication: offline backup/restore. Clearly, we want to leverage the same data format and serialization process, but do we want to use the same protocol? NDMP is the industry standard for backups, but it’s tailored to a very specific use case (files and filesystems). By choosing to use NDMP for replication, we sacrificed features (resumable operations, multiple streams) and usability (poor error semantics) and maintainability (unnecessarily complicated operation).
- Outsourcing of work – At the time this architecture was created, it was decided that NDMP was not part of the company’s core competency, and we should contract with a third party to provide the NDMP solution. I’m a firm believer that engineering work should never be outsourced unless it’s known ahead of time that the result will be thrown away. Otherwise, you’re inevitably saddled with a part of your product that you have limited ability to change, debug, and support. In our case, this was compounded by the fact that the deliverable was binary objects – we didn’t even have source available.
- Architectural design – By having a separate NDMP daemon we were forced to have an arcane communication mechanism (local HTTP) that lost information with each transition, resulting in a non-trivial amount of application logic resting in a binary we didn’t control. This made it difficult to adapt to core changes in the underlying abstractions.
- Algorithmic design – There was a very early decision made that replication would be done on a per-group basis (Delphix arranges databases into logical groups). This was divorced from the reality of the underlying ZFS data dependencies, resulting a numerous oddities such as being unable to replicate non self-contained groups or cyclic dependencies between groups. This abstraction was deeply baked into the architecture such that it was impossible to fix in the original architecture.
- Implementation – The implementation itself was built to be “isolated” of any other code in the system. When one is replicating the core representation of system metadata, this results in an unmaintainable and brittle mess. We had a completely separate copy of our object model that had to be maintained and updated along with the core model, and changes elsewhere in the system (such as deleting objects while replication was ongoing) could lead to obscure errors. The most egregious problems led to unrecoverable state – the target and source could get out of sync such that the only resolution was a new full replication from scratch.
- Test infrastructure – There was no unit test infrastructure, no automated functional test infrastructure, and no way to test the majority of functionality without manually setting up multi-machine replication or working with a remote DMA. As a result only the most basic functionality worked, and even then it was unreliable most of the time.
Ideals for a new system
Given this list of limitations, I (later joined by Matt) sat down with a fresh sheet of paper. The following were some of the core ideals we set forth as we built this new system:
- Separation of mechanism from protocol – Whatever choices we make in terms of protocol and replication topologies, we want the core serialization infrastructure to be entirely divorced from the protocol used to transfer the data.
- Support for arbitrary topologies – We should be able to replicate from a host to any number of other hosts and vice versa, as well as provision from replicated objects.
- Robust test infrastructure – We should be able to run protocol-level tests, simulate failures, and perform full replication within a single-system unit test framework.
- Integrated with core object model – There should be one place where object definitions are maintained, such that the replication system can’t get out of sync with the primary source code.
- Resilient to failure – No matter what, the system must be maintain consistent state in the face of failure. This includes both catastrophic system failure, as well as ongoing changes to the system (i.e. objects being created and deleted). At any point, we must be able to resume replication from a previously known good state without user intervention.
- Clear error messages – Failures, when they do occur, must present a clear indication of the nature of the problem and what actions must be taken by the user, if any, to fix the underlying problem.
At the same time, we were forced to limit the scope of the project so we could deliver something in an appropriate timeframe. We stuck with NDMP as a protocol despite its inherent problems, as we needed to fix our backup/restore implementation as well. And we kept the active/passive deployment model so that we did not require any significant changes to the GUI.
Next, I’ll discuss the first major piece of work: building a better NDMP implementation.
Your MDB fell into my DTrace!
Yesterday, several of us from Delphix, Nexenta, Joyent, and elsewhere, convened before the OpenStorage summit as part of an illumos hackathon. The idea was to get a bunch of illumos coders in a room, brainstorm a bunch of small project ideas, and then break off to go implement them over the course of the day. That was the idea, at least – in reality we didn’t know what to expect or how it would turn out. Suffice to say that the hackathon was an amazing success. There were a lot of cool ideas, and a lot of great mentors in the room that could lead people through unfamiliar territory.
For my mini-project (suggested by ahl), I implemented MDB’s ::print functionality in DTrace via a new print() action. Today, we have the trace() action, but the result is somewhat less than useful when dealing with structs, as it degenerates into tracemem():
# dtrace -qn 'BEGIN{trace(`p0); exit(0)}'
0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef
0: 00 00 00 00 00 00 00 00 60 02 c3 fb ff ff ff ff ........`.......
10: c8 c9 c6 fb ff ff ff ff 00 00 00 00 00 00 00 00 ................
20: b0 ad 14 c6 00 ff ff ff 00 00 00 00 02 00 00 00 ................
...
The results aren’t pretty, and we end up throwing away all that useful proc_t type information. With a little tweaks to dtrace, and some cribbing from mdb_print.c, we can do much better:
# dtrace -qn 'BEGIN{print(`p0); exit(0)}'
proc_t {
struct vnode *p_exec = 0
struct as *p_as = 0xfffffffffbc30260
struct plock *p_lockp = 0xfffffffffbc6c9c8
kmutex_t p_crlock = {
void *[1] _opaque = [ 0 ]
}
struct cred *p_cred = 0xffffff00c614adb0
int p_swapcnt = 0
char p_stat = '02'
....
Much better! Now, how did we get there from here? The answer was an interesting journey through libdtrace, the kernel dtrace implementation, CTF, and the horrors of bitfields.
To action or not to action?
The first question I set out to answer is what the user-visible interface should be. It seemed clear that this should be an operation on the same level as trace(), allowing arbitrary D expressions, but simply preserving the type of the result and pretty-printing it later. After briefly considering printt() (for “print type”), I decided upon just print(), since this seemed like a logical My first inclination was to create a new DTRACEACT_PRINT, but after some discussion with Adam, we decided this was extraneous – the behavior was identical to DTRACEACT_DIFEXPR (the internal name for trace), but just with type information.
Through the looking glass with types and formats
The real issue is that what we compile (dtrace statements) and what we consume (dtrace epids and records) are two very different things, and never the twain shall meet. At the time we go to generate the DIFEXPR statement in dt_cc.c, we have the CTF data in hand. We don’t want to change the DIF we generate, simply do post-processing on the other side, so we just need some way to get back to that type information in dt_consume_cpu(). We can’t simply hang it off our dtrace statement, as that would break anonymous tracing (and violate the rest of the DTrace architecture to boot).
Thankfully, this problem had already been solved for printf() (and related actions) because we need to preserve the original format string for the exact same reason. To do this, we take the action-specific integer argument, and use it to point into the DOF string table, where we stash the original format string. I simply had to hijack dtrace_dof_create() and have it do the same thing for the type information, right?
If only it could be so simple. There were two complications here: there is a lot of code that explicitly treats these as printf strings, and parses them into internal argv-style representations. Pretending our types were just format strings would cause all kinds of problems in this code. So I had to modify libdtrace to treat this more explicitly as raw ‘string data’ that is (optionally) used with the DIFEXPR action. Even with that in place, the formats I was sending down were not making it back out of the kernel. Because the argument is action-specific, the kernel needed to be modified to recognize this new argument in dtrace_ecb_action_add. With that change in place, I was able to get the format string back in userland when consuming the CPU buffers.
Bitfields, or why the D compiler cost me an hour of my life
With the trace data and type string in hand, I then proceeded to copy the mdb ::print code, first from apptrace (which turned out to be complete garbage) and then fixing it up bit by bit. Finally, after tweaking the code for an hour or two, I had it looking pretty much like identical ::print output. But when I fed it a klwp_t structure, I found that the user_desc_t structure bitfields weren’t being printed correctly:
# dtrace -n 'BEGIN{print(*((user_desc_t*)0xffffff00cb0a4d90)); exit(0)}'
dtrace: description 'BEGIN' matched 1 probe
CPU ID FUNCTION:NAME
0 1 :BEGIN user_desc_t {
unsigned long usd_lolimit = 0xcff3000000ffff
unsigned long usd_lobase = 0xcff3000000
unsigned long usd_midbase = 0xcff300
unsigned long usd_type = 0xcff3
unsigned long usd_dpl :64 = 0xcff3
unsigned long usd_p :64 = 0xcff3
unsigned long usd_hilimit = 0xcf
unsigned long usd_avl :64 = 0xcf
unsigned long usd_long :64 = 0xcf
unsigned long usd_def32 :64 = 0xcf
unsigned long usd_gran :64 = 0xcf
unsigned long usd_hibase = 0
}
I spent an hour trying to debug this, only to find that the CTF IDs weren’t matching what I expected from the underlying object. I finally tracked it down to the fact that the D compiler, by virtue of processing the /usr/lib/dtrace files, pulls in its own version of klwp_t from the system header files. But it botches the bitfields, leaving the user with a subtly incorrect data. Switching the type to be genunix`user_desc_t fixed the problem.
What’s next
Given the usefulness of this feature, the next steps are to clean up the code, get it reviewed, and push to the illumos gate. It should hopefully be finding its way to an illumos distribution near you soon. Here’s a final print() invocation to leave you with:
# dtrace -n 'zio_done:entry{print(*args[0]); exit(0)}'
dtrace: description 'zio_done:entry' matched 1 probe
CPU ID FUNCTION:NAME
0 42594 zio_done:entry zio_t {
zbookmark_t io_bookmark = {
uint64_t zb_objset = 0
uint64_t zb_object = 0
int64_t zb_level = 0
uint64_t zb_blkid = 0
}
zio_prop_t io_prop = {
enum zio_checksum zp_checksum = ZIO_CHECKSUM_INHERIT
enum zio_compress zp_compress = ZIO_COMPRESS_INHERIT
dmu_object_type_t zp_type = DMU_OT_NONE
uint8_t zp_level = 0
uint8_t zp_copies = 0
uint8_t zp_dedup = 0
uint8_t zp_dedup_verify = 0
}
zio_type_t io_type = ZIO_TYPE_NULL
enum zio_child io_child_type = ZIO_CHILD_VDEV
int io_cmd = 0
uint8_t io_priority = 0
uint8_t io_reexecute = 0
uint8_t [2] io_state = [ 0x1, 0 ]
uint64_t io_txg = 0
spa_t *io_spa = 0xffffff00c6806580
blkptr_t *io_bp = 0
blkptr_t *io_bp_override = 0
blkptr_t io_bp_copy = {
dva_t [3] blk_dva = [
dva_t {
uint64_t [2] dva_word = [ 0, 0 ]
},
...
Delphix illumos sources posted to github
With our first illumos-based distribution (2.6) out the door, we’ve posted the illumos-derived sources to github:
https://github.com/delphix/delphix-os-2.6
This repository contains the following types of changes from the illumos gate:
- Changes that are complete and generally useful to the illumos community. We have been (and will continue to be) proactive about pushing these changes to the illumos trunk ourselves. We missed a few this time around, so we’ll be going back through to pick up anything we missed.
- Changes that are sufficient to meet the needs of our product, but are not complete or generally useful for the larger community. Our hope is that by pushing these changes to github, others can pick up such pieces of work and integrate them in a form that is acceptable to the illumos community at large.
- Changes that represent distro-specific changes unique to our product. It is unlikely that these will be of interest to anyone except the morbidly curious.
We will post updates with each release of the software. This allows us to make sure the code is fully baked and tested, while still allowing us to proactively push complete pieces of work more frequently.
If you have questions about any particular change, feel free to email the author for more information. You can also find us on the illumos developer mailing list and the #illumos IRC channel on freenode.
Beyond Oracle
It’s been a little over six months since I left Oracle to join Delphix. I’m not here to dwell on the reasons for my departure, as I think the results speak for themselves.
It is with a sad heart, however, that I look at the work so many put into making OpenSolaris what it was, only to see it turned into the next HP-UX – a commercially relevant but ultimately technologically uninteresting operating system. This is not to denigrate the work of those at Oracle working on Solaris 11, but I personally believe that a truly innovative OS requires an engaged developer base interacting with the source code, and unique technologies that are adopted across multiple platforms. With no one inside or outside of Oracle believing the unofficial pinky swear to release source code at some abstract future date, many may wonder what will happen to the bevy of cutting edge technologies that made up OpenSolaris.
The good news is that those technologies are alive and well in the illumos project, and many of us who left Oracle have joined companies such as Delphix, Joyent, and Nexenta that are building innovative solutions on top of the enterprise-quality OS at the core of illumos. Combined with those dedicated souls who continue to tirelessly work on the source in their free time, the community continues to grow and evolve. We are here today because we stand on the shoulders of giants, and we will continue to improve the source and help the community make the platform successful in whatever form it may take in the future.
And the contributions continue to pour in. There are nasty DTrace bugs and new features, new COMSTAR protocol support, TCP/IP stability and scalability fixes, ZFS data corruption and performance improvements, and much more. And there is a ZFS working group spanning multiple platforms and companies with a diverse set of interests helping to coordinate future ZFS development.
Suffice to say that OpenSolaris is alive and well outside the walls of Oracle, so give it a spin and get involved!
Moving On
In my seven years at Sun and Oracle, I’ve had the opportunity to work with some incredible people on some truly amazing technology. But the time has come for me to move on, and today will be my last day at Oracle.
When I started in the Solaris kernel group in 2003, I had no idea that I was entering a truly unique environment – a culture of innovation and creativity that is difficult to find anywhere, let alone working on a system as old and complex as Solaris in a company as large as Sun. While there, I worked with others to reshape the operating system landscape through technologies like Zones, SMF, DTrace, and FMA, and fortunate to be part of the team that created ZFS, one of the most innovative filesystems ever. From there I became a member of the Fishworks team that created the Sun Storage 7000 series; working with such a close-knit talented team to create a groundbreaking integrated product like that was an experience that I will never forget.
I learned so much and grew in so many ways thanks to the people I had the chance to meet and work with over the past seven years. I would not be the person I am today without your help and guidance. While I am leaving Oracle, I will always be part of the community, and I look forward to our paths crossing in the future.
Despite not updating this blog as much as I’d like, I do hope to blog in the future at my new home: http://dtrace.org/blogs/eschrock.

Thanks for all the memories.
Multiple pools in 2010.Q1
When the Sun Storage 7000 was first introduced, a key design decision was to allow only a single ZFS storage pool per host. This forces users to fully take advantage of the ZFS pool storage model, and prevents them from adopting ludicrous schemes such as “one pool per filesystem.” While RAID-Z has non-trivial performance implications for IOPs-bound workloads, the hope was that by allowing logzilla and readzilla devices to be configured per-filesystem, users could adjust relative performance and implement different qualities of service on a single pool.
While this works for the majority of workloads, there are still some that benefit from mirrored performance even in the presence of cache and log devices. As the maximum size of Sun Storage 7000 systems increases, it became apparent that we needed a way to allow pools with different RAS and performance characteristics in the same system. With this in mind, we relaxed the “one pool per system” rule1 with the 2010.Q1 release.

The storage configuration user experience is relatively unchanged. Instead of having a single pool (or two pools in a cluster), and being able to configure one or the other, you can simply click the ‘+’ button and add pools as needed. When creating a pool, you can now specify a name for the pool. When importing a pool, you can either accept the existing name or give it a new one at the time you select the pool. Ownership of pools in a cluster is now managed exclusively through the Configuration -> Cluster screen, as with other shared resources.

When managing shares, there is a new dropdown menu at the top left of the navigation bar. This controls which shares are shown in the UI. In the CLI, the equivalent setting is the ‘pool’ property at the ‘shares’ node.
While this gives some flexibility in storage configuration, it also allows users to create poorly constructed storage topologies. The intent is to allow the user to create pools with different RAS and performance characteristics, not to create dozens of different pools with the same properties. If you attempt to do this, the UI will present a warning summarizing the drawbacks if you were to continue:
- Wastes system resources that could be shared in a single pool.
- Decreases overall performance
- Increases administrative complexity.
- Log and cache devices can be enabled on a per-share basis.
You can still commit the operation, but such configurations are discouraged. The exception is when configuring a second pool on one head in a cluster.
We hope this feature will allow users to continue to consolidate storage and expand use of the Sun Storage 7000 series in more complicated environments.
- Clever users figured out that this mechanism could be circumvented in a cluster to have two pools active on the same host in an active/passive configuration.
Shadow Migration Internals
In my previous entry, I described the overall architecture of shadow migration. This post will dive into the details of how it’s actually implemented, and the motivation behind some of the original design decisions.
VFS interposition
A very early desire was that we wanted something that could migrate data from many different sources. And while ZFS is the primary filesystem for Solaris, we also wanted to allow for arbitrary local targets. For this reason, the shadow migration infrastructure is implemented entirely at the VFS (Virtual FileSystem) layer. At the kernel level, there is a new ‘shadow’ mountpoint option, which is the path to another filesystem on the system. The kernel has no notion of whether a source filesystem is local or remote, and doesn’t differentiate between synchronous access and background migration. Any filesystem access, whether it is local or over some other protocol (CIFS, NFS, etc) will use the VFS interfaces and therefore be fed through our interposition layer.
The only other work the kernel does when mounting a shadow filesystem is check to see if the root directory is empty. If it is empty, we create a blank SUNWshadow extended attribute on the root directory. Once set, this will trigger all subsequent migration as long as the filesystem is always mounted with the ‘shadow’ attribute. Each VFS operation first checks to see if the filesystem is shadowing another (a quick check), and then whether the file or directory has the SUNWshadow attribute set (slightly more expensive, but cached with the vnode). If the attribute isn’t present, then we fall through to the local filesystem. Otherwise, we migrate the file or directory and then fall through to the local filesystem.
Migration of directories
In order to migrate a directory, we have to migrate all the entriest. When migrating an entry for a file, we don’t want to migrate the complete contents until the file is accessed, but we do need to migrate enough metadata such that access control can be enforced. We start by opening the directory on the remote side whose relative path is indicated by the SUNWshadow attribute. For each directory entry, we create a sparse file with the appropriate ownership, ACLs, system attributes and extended attributes.
Once the entry has been migrated, we then set a SUNWshadow attribute that is the same as the parent but with “/” appended where “name” is the directory entry name. This attribute always represents the relative path of the unmigrated entity on the source. This allows files and directories to be arbitrarily renamed without losing track of where they are located on the source. It also allows the source to change (i.e. restored to a different host) if needed. Note that there are also types of files (symlinks, devices, etc) that do not have contents, in which case we simply migrate the entire object at once.
Once the diretory has been completely migrated, we remove the SUNWshadow attribute so that future accesses all use the native filesystem. If the process is interrupted (system reset, etc), then the attribute will still be on the parent directory so we will migrate it again when the user (or background process) tries to access it.
Migration of files
Migrating a plain file is much simpler. We use the SUNWshadow attribute to locate the file on the source, and then read the source file and write the corresponding data to the local filesystem. In the current software version, this happens all at once, meaning the first access of a large file will have to wait for the entire file to be migrated. Future software versions will remove this limitation and migrate only enough data to satisfy the request, as well as allowing concurrent accesses to the file. Once all the data is migrated, we remove the SUNWshadow attribute and future accesses will go straight to the local filesystem.
Dealing with hard links
One issue we knew we’d have to deal with is hard links. Migrating a hard link requires that the same file reference appear in multiple locations within the filesystem. Obviously, we do not know every reference to a file in the source filesystem, so we need to migrate these links as we discover them. To do this, we have a special directory in the root of the filesystem where we can create files named by their source FID. A FID (file ID) is a unique identifier for the file within the filesystem. We create the file in this hard link directory with a name derived from its FID. Each time we encounter a file with a link count greater than 1, we lookup the source FID in our special directory. If it exists, we create a link to the directory entry instead of migrating a new instance of the file. This way, it doesn’t matter if files are moved around, removed from the local filesystem, or additional links created. We can always recreate a link to the original file. The one wrinkle is that we can migrate from nested source filesystems, so we also need to track the FSID (filesystem ID) which, while not persistent, can be stored in a table and reconstructed using source path information.
Completing migration
A key feature of the shadow migration design is that it treats all accesses the same, and allows background migration of data to be driven from userland, where it’s easier to control policy. The downside is that we need the ability to know when we have finished migrating every single file and directory on the source. Because the local filesystem is actively being modified while we are traversing, it’s impossible to know whether you’ve visited every object based only on walking the directory hierarchy. To address this, we keep track of a “pending” list of files and directories with shadow attributes. Every object with a shadow attribute must be present in this list, though this list can contain objects without shadow attributes, or non-existant objects. This allows us to be synchronous when it counts (appending entries) and lazy when it doesn’t (rewriting file with entries removed). Most of the time, we’ll find all the objects during traversal, and the pending list will contain no entries at all. In the case we missed an object, we can issue an ioctl() to the kernel to do the work for us. When that list is empty we know that we are finished and can remove the shadow setting.
ZFS integration
The ability to specify the shadow mount option for arbitrary filesystems is useful, but is also difficult to manage. It must be specified as an absolute path, meaning that the remote source of the mount must be tracked elsewhere, and has to be mounted before the filesystem itself. To make this easier, a new ‘shadow’ property was added for ZFS filesystems. This can be set using an abstract URI syntax (“nfs://host/path”), and libzfs will take care of automatically mounting the shadow filesystem and passing the correct absolute path to the kernel. This way, the user can manage a semantically meaningful relationship without worrying about how the internal mechanisms are connected. It also allows us to expand the set of possible sources in the future in a compatible fashion.
Hopefully this provides a reasonable view into how exactly shadow migration works, and the design decisions behind it. The goal is to eventually have this available in Solaris, at which point all the gritty details should be available to the curious.
User and Group Quotas in the 7000 Series

When ZFS was first developed, the engineering team had the notion that pooled storage would make filesystems cheap and plentiful, and we’d move away from the days of /export1, /export2, ad infinitum. From the ZFS perspective, they are cheap. It’s very easy to create dozens or hundreds of filesystems, each which functions as an administrative control point for various properties. However, we quickly found that other parts of the system start to break down once you get past 1,000 or 10,000 filesystems. Mounting and sharing filesystems takes longer, browsing datasets takes longer, and managing automount maps (for those without NFSv4 mirror mounts) quickly spirals out of control.
For most users this isn’t a problem – a few hundred filesystems is more than enough to manage disparate projects and groups on a single machine. There was one class of users, however, where a few hundred filesystems wasn’t enough. These users were university or other home directory environments with 20,000 or more users, each which needed to have a quota to guarantee that they couldn’t run amok on the system. The traditional ZFS solution, creating a filesystem for each user and assigning a quota, didn’t scale. After thinking about it for a while, Matt developed a fairly simple architecture to provide this functionality without introducing pathological complexity into the bowels of ZFS. In build 114 of Solaris Nevada, he pushed the following:
PSARC 2009/204 ZFS user/group quotas & space accounting

This provides full support for user and group quotas on ZFS, as well as the ability to track usage on a per-user or per-group basis within a dataset.
This was later integrated into the 2009.Q3 software release, with an additional UI layer. From the ‘general’ tab of a share, you can query usage and set quotas for individual users or groups quickly. The CLI allows for automated batch operations. Requesting a single user or group is significantly faster than requesting all the current usage, but you an also get a list of the current usage for a project or share. With integrated identity management, users and groups can be specified either by UNIX username or Windows name.
There are some significant differences between user and group quotas and traditional ZFS quotas. The following is an excerpt from the on-line documentation on the subject:
- User and group quotas can only be applied to filesystems.
- User and group quotas are implemented using delayed enforcement. This means that users will be able to exceed their quota for a short period of time before data is written to disk. Once the data has been pushed to disk, the user will receive an error on new writes, just as with the filesystem-level quota case.
- User and group quotas are always enforced against referenced data. This means that snapshots do not affect any quotas, and a clone of a snapshot will consume the same amount of effective quota, even though the underlying blocks are shared.
- User and group reservations are not supported.
- User and group quotas, unlike data quotas, are stored with the regular filesystem data. This means that if the filesystem is out of space, you will not be able to make changes to user and group quotas. You must first make additional space available before modifying user and group quotas.
- User and group quotas are sent as part of any remote replication. It is up to the administrator to ensure that the name service environments are identical on the source and destination.
- NDMP backup and restore of an entire share will include any user or group quotas. Restores into an existing share will not affect any current quotas. (There is currently a bug preventing this from working in the initial release, which will be fixed in a subsequent minor release.)
This feature will hopefully allow the Sun Storage 7000 series to function in environments where it was previously impractical to do so. Of course, the real person to thank is Matt and the ZFS team – it was a very small amount of work to provide an interface on top of the underlying ZFS infrastructure.
What is Shadow Migration?
In the Sun Storage 7000 2009.Q3 software release, one of the major new features I worked on was the addition of what we termed “shadow migration.” When we launched the product, there was no integrated way to migrate data from existing systems to the new systems. This resulted in customers rolling it by hand (rsync, tar, etc), or paying for professional services to do the work for them. We felt that we could present a superior model that would provide for a more integrated experience as well as let the customer leverage the investment in the system even before the migration was complete.
The idea in and of itself is not new, and various prototypes of this have been kicking around inside of Sun under various monikers (“brain slug”, “hoover”, etc) without ever becoming a complete product. When Adam and myself sat down shortly before the initial launch of the product, we decide we could do this without too much work by integrating the functionality directly in the kernel. The basic design requirements we had were:
- We must be able to migrate over standard data protocols (NFS) from arbitrary data sources without the need to have special software running on the source system.
- Migrated data must be available before the entire migration is complete, and must be accessible with native performance.
- All the data to migrate the filesystem must be stored within the filesystem itself, and must not rely on an external database to ensure consistency.

With these requirements in hand, our key insight was that we could create a “shadow” filesystem that could pull data from the original source if necessary, but then fall through to the native filesystem for reads and writes one the file has been migrated. What’s more, we could leverage the NFS client on Solaris and do this entirely at the VFS (virtual filesystem) layer, allowing us to migrate data between shares locally or (eventually) over other protocols as well without changing the interpositioning layer. The other nice thing about this architecture is that the kernel remains ignorant of the larger migration process. Both synchronous requests (from clients) and background requests (from the management daemon) appear the same. This allows us to control policy within the userland software stack, without pushing that complexity into the kernel. It also allows us to write a very comprehensive automated test suite that runs entirely on local filesystems without need a complex multi-system environment.
So what’s better (and worse) about shadow migration compared to other migration strategies? For that, I’ll defer to the documentation, which I’ve reproduced here for those who don’t have a (virtual) system available to run the 2009.Q3 release:
Migration via synchronization
This method works by taking an active host X and migrating data to the new host Y while X remains active. Clients still read and write to the original host while this migration is underway. Once the data is initially migrated, incremental changes are repeatedly sent until the delta is small enough to be sent within a single downtime window. At this point the original share is made read-only, the final delta is sent to the new host, and all clients are updated to point to the new location. The most common way of accomplishing this is through the rsync tool, though other integrated tools exist. This mechanism has several drawbacks:
- The anticipated downtime, while small, is not easily quantified. If a user commits a large amount of change immediately before the scheduled downtime, this can increase the downtime window.
- During migration, the new server is idle. Since new servers typically come with new features or performance improvements, this represents a waste of resources during a potentially long migration period.
- Coordinating across multiple filesystems is burdensome. When migrating dozens or hundreds of filesystems, each migration will take a different amount of time, and downtime will have to be scheduled across the union of all filesystems.
Migration via external interposition
This method works by taking an active host X and inserting a new appliance M that migrates data to a new host Y. All clients are updated at once to point to M, and data is automatically migrated in the background. This provides more flexibility in migration options (for example, being able to migrate to a new server in the future without downtime), and leverages the new server for already migrated data, but also has significant drawbacks:
- The migration appliance represents a new physical machine, with associated costs (initial investment, support costs, power and cooling) and additional management overhead.
- The migration appliance represents a new point of failure within the system.
- The migration appliance interposes on already migrated data, incurring extra latency, often permanently. These appliances are typically left in place, though it would be possible to schedule another downtime window and decommission the migration appliance.
Shadow migration
Shadow migration uses interposition, but is integrated into the appliance and doesn’t require a separate physical machine. When shares are created, they can optionally “shadow” an existing directory, either locally (see below) or over NFS. In this scenario, downtime is scheduled once where the source appliance X is placed into read-only mode, a share is created with the shadow property set, and clients are updated to point to the new share on the Sun Storage 7000 appliance. Clients can then access the appliance in read-write mode.
Once the shadow property is set, data is transparently migrated in the background from the source appliance locally. If a request comes from a client for a file that has not yet been migrated, the appliance will automatically migrate this file to the local server before responding to the request. This may incur some initial latency for some client requests, but once a file has been migrated all accesses are local to the appliance and have native performance. It is often the case that the current working set for a filesystem is much smaller than the total size, so once this working set has been migrated, regardless of the total native size on the source, there will be no perceived impact on performance.
The downside to shadow migration is that it requires a commitment before the data has finished migrating, though this is the case with any interposition method. During the migration, portions of the data exists in two locations, which means that backups are more complicated, and snapshots may be incomplete and/or exist only on one host. Because of this, it is extremely important that any migration between two hosts first be tested thoroughly to make sure that identity management and access controls are setup correctly. This need not test the entire data migration, but it should be verified that files or directories that are not world readable are migrated correctly, ACLs (if any) are preserved, and identities are properly represented on the new system.
Shadow migration implemented using on-disk data within the filesystem, so there is no external database and no data stored locally outside the storage pool. If a pool is failed over in a cluster, or both system disks fail and a new head node is required, all data necessary to continue shadow migration without interruption will be kept with the storage pool.
In a subsequent post, I’ll discuss some of the thorny implementation detail we had to solve, as well as provide some screenshots of migration in progress. In the meantime, I suggest folks download the simulator and upgrade to the latest software to give it a try.
