What is Shadow Migration?

In the Sun Storage 7000 2009.Q3 software release, one of the major new features I worked on was the addition of what we termed “shadow migration.” When we launched the product, there was no integrated way to migrate data from existing systems to the new systems. This resulted in customers rolling it by hand (rsync, tar, etc), or paying for professional services to do the work for them. We felt that we could present a superior model that would provide for a more integrated experience as well as let the customer leverage the investment in the system even before the migration was complete.

The idea in and of itself is not new, and various prototypes of this have been kicking around inside of Sun under various monikers (“brain slug”, “hoover”, etc) without ever becoming a complete product. When Adam and myself sat down shortly before the initial launch of the product, we decide we could do this without too much work by integrating the functionality directly in the kernel. The basic design requirements we had were:

With these requirements in hand, our key insight was that we could create a “shadow” filesystem that could pull data from the original source if necessary, but then fall through to the native filesystem for reads and writes one the file has been migrated. What’s more, we could leverage the NFS client on Solaris and do this entirely at the VFS (virtual filesystem) layer, allowing us to migrate data between shares locally or (eventually) over other protocols as well without changing the interpositioning layer. The other nice thing about this architecture is that the kernel remains ignorant of the larger migration process. Both synchronous requests (from clients) and background requests (from the management daemon) appear the same. This allows us to control policy within the userland software stack, without pushing that complexity into the kernel. It also allows us to write a very comprehensive automated test suite that runs entirely on local filesystems without need a complex multi-system environment.

So what’s better (and worse) about shadow migration compared to other migration strategies? For that, I’ll defer to the documentation, which I’ve reproduced here for those who don’t have a (virtual) system available to run the 2009.Q3 release:

Migration via synchronization

This method works by taking an active host X and migrating data to the new host Y while X remains active. Clients still read and write to the original host while this migration is underway. Once the data is initially migrated, incremental changes are repeatedly sent until the delta is small enough to be sent within a single downtime window. At this point the original share is made read-only, the final delta is sent to the new host, and all clients are updated to point to the new location. The most common way of accomplishing this is through the rsync tool, though other integrated tools exist. This mechanism has several drawbacks:

Migration via external interposition

This method works by taking an active host X and inserting a new appliance M that migrates data to a new host Y. All clients are updated at once to point to M, and data is automatically migrated in the background. This provides more flexibility in migration options (for example, being able to migrate to a new server in the future without downtime), and leverages the new server for already migrated data, but also has significant drawbacks:

Shadow migration

Shadow migration uses interposition, but is integrated into the appliance and doesn’t require a separate physical machine. When shares are created, they can optionally “shadow” an existing directory, either locally (see below) or over NFS. In this scenario, downtime is scheduled once where the source appliance X is placed into read-only mode, a share is created with the shadow property set, and clients are updated to point to the new share on the Sun Storage 7000 appliance. Clients can then access the appliance in read-write mode.

Once the shadow property is set, data is transparently migrated in the background from the source appliance locally. If a request comes from a client for a file that has not yet been migrated, the appliance will automatically migrate this file to the local server before responding to the request. This may incur some initial latency for some client requests, but once a file has been migrated all accesses are local to the appliance and have native performance. It is often the case that the current working set for a filesystem is much smaller than the total size, so once this working set has been migrated, regardless of the total native size on the source, there will be no perceived impact on performance.

The downside to shadow migration is that it requires a commitment before the data has finished migrating, though this is the case with any interposition method. During the migration, portions of the data exists in two locations, which means that backups are more complicated, and snapshots may be incomplete and/or exist only on one host. Because of this, it is extremely important that any migration between two hosts first be tested thoroughly to make sure that identity management and access controls are setup correctly. This need not test the entire data migration, but it should be verified that files or directories that are not world readable are migrated correctly, ACLs (if any) are preserved, and identities are properly represented on the new system.

Shadow migration implemented using on-disk data within the filesystem, so there is no external database and no data stored locally outside the storage pool. If a pool is failed over in a cluster, or both system disks fail and a new head node is required, all data necessary to continue shadow migration without interruption will be kept with the storage pool.

In a subsequent post, I’ll discuss some of the thorny implementation detail we had to solve, as well as provide some screenshots of migration in progress. In the meantime, I suggest folks download the simulator and upgrade to the latest software to give it a try.

Posted on September 16, 2009 at 5:06 pm by eschrock · Permalink
In: Fishworks

13 Responses

Subscribe to comments via RSS

  1. Written by Jason
    on September 16, 2009 at 8:40 pm

    Do this with FC and you have a lot of interesting possibilities… I will not that certain competitors sell linux boxes that are mainly interposers for fc for a _very_ hefty markup (for storage ‘virtualization’ and migration). That alone could be a separate product (as well as part of an array).

  2. Written by Joerg M.
    on September 16, 2009 at 11:02 pm

    I wonder if Solaris Engineering could use this work for a CacheFS follow-on …. at the end this would be not much more than a perpetual shadow migration plus the LRU-stuff.

  3. Written by Anonymous (for now..)
    on September 17, 2009 at 8:15 am

    This release absolutely rocks.
    Live migration came just in time, as we have a large migration ahead.
    BTW: Did you also improve the gzip algorithm in this release? Seems much faster now…

  4. Written by Eric Schrock
    on September 17, 2009 at 9:21 am

    @Jason – The current migration only works with filesystems, but definitely on my list of things for a future release is migration of LUNs. While the basic idea is the same, the mechanism is quite different. We don’t have the same ability to store metadata with devices, so it will have to be baked into ZFS, as opposed to the generic VFS level.
    @Joerg – We do have some crazy ideas about possible future directions for the technology. We’ll see where it leads.
    @Anonymous – There were no specific changes to the gzip algorithm, but depending on what release you’re coming from, you may be noticing the effects of 6586537, which dedicates more threads to the task on larger systems.

  5. Written by Eli Kleinman
    on September 17, 2009 at 1:49 pm

    Will this also preserve the UFS extended ACL permission converting them to ZFS compatible permission (clients are solaris10 hosts)?

  6. Written by Eric Schrock
    on September 17, 2009 at 1:53 pm

    @Eli – No, it does not do any ACL conversion. It will preserve basic UNIX permissions, as well as NFSv4 ACLs. One strategy might be to mount the shares over NFSv4 and have the server do the conversion before it goes over the wire. I don’t know if the Solaris 10 server supports this.

  7. Written by Charles Soto
    on September 18, 2009 at 6:29 am

    This is starting to look a lot like an archiving filesystem (with the "old" server acting as a backing store), albeit only archiving for the duration of a migration. Very interesting…

  8. Written by Don MacAskill
    on September 19, 2009 at 6:40 pm

    This sounds like a dream come true for us, with one possible gotcha: When you say "If a request comes from a client for a file that has not yet been migrated, the appliance will automatically migrate this file to the local server before responding to the request." do you mean "whole file" ?
    I ask because our typical S7410 usage is with databases, where the files are hundreds of gigabytes apiece. So "some initial latency", in our case, is actually a lot.
    Also, can the shadow migration be throttled? We have issues with rsync’ing from one S7410 to another and it saturating the disks such that other clients get wrecked.

  9. Written by Eric Schrock
    on September 20, 2009 at 7:41 am

    @Don -
    Currently, files are migrated all at once. I am working on adding partial migration in the 2009.Q4 release for exactly the reasons you describe.
    The background migration can be controlled by specifying the number of threads devoted to the task. Any given file (synchronous or background) is always migrated at the maximum speed of a single thread. More aggressive throttling will most likely be done through IP QoS controls in a future software release.

  10. Written by John
    on September 23, 2009 at 5:06 pm

    Any chance of turning this around? It could make a nice archiving solution.
    I’m thinking along these lines:
    Specify the size of the front end FS (quota)
    Specify the back end FS
    Specify the migration rule (lru, etc)
    The main advantage is that backups of the full front end will be much faster since it is smaller and there are fewer files. The front end could also exist on higher end hardware like a 7410 since it is dealing with the ‘hot’ data while a 7210 could serve up the static content.
    Of course, the devil is in the details :-)

  11. Written by Alessandro Gervaso
    on September 30, 2009 at 10:54 am

    Hi Eric, is there any particular requirement for the source nfs server in order to have this working?
    I’ve tried several times with my linux nfs servers (where the data to migrate resides) but the process always fails.
    I’ve also created a post in the forums http://forums.sun.com/thread.jspa?threadID=5409602&tstart=0
    Great work btw, i’m also in the process of building a new storage server based on opensolaris/zfs.

  12. Written by Eric Schrock
    on September 30, 2009 at 11:03 am

    @Alessandro -
    Do you have a support case open? This is almost certainly because you have ‘.zfs/snapshot’ set to ‘visible’ at the project level, a bug which is fixed in the upcoming minor release. If you set it to ‘hidden’ it should work. If not, please work the issue through the support channels.

  13. Written by Alessandro Gervaso
    on September 30, 2009 at 11:11 am

    Thank you, that was it. I’ve set .zfs to "hidden" and it started to copy the data, great! Keep up with the good work on this innovative storage approach :-)

Subscribe to comments via RSS