Eric Schrock's Blog

Month: February 2012

In my previous post I outlined some of the challenges faced when building a data replication solution, how the first Delphix implementation missed the mark, and how we set out to build something better the second time around.

The first thing that became clear after starting on the new replication subsystem was that we needed a better NDMP implementation. A binary-only separate daemon with poor error semantics that routinely left the system in an inconsistent state was not going to cut it. NDMP is a protocol built for a singular purpose: backing up files using a file-specific format (dump or tar) over arbitrary topologies (direct attached tape, 3-way restore, etc). By being both simultaneously so specific in the data semantics but so general in the control protocol, we end up with the worst of both worlds: baked-in concepts (such as file history, complete with inode numbers) that prevent us from adequately expressing Delphix concepts, and a limited control protocol (lacking multiple streams or resumable streams) with terrible error semantics. While we will ultimately replace NDMP for replication, we knew that we still needed it for backup, and that we didn’t have the time to replace both the implementation and the data protocol for the current release.

Illumos, the open source operating system our distribution is based on, provides an NDMP implementation, one that I had previously dealt with while at Fishworks (though Dave Pacheo was the one who did the actual NDMP integration). I spent some time looking at the implementation and came to the conclusion that it suffered from a number of fatal flaws:

  • Poor error semantics – The strategy was “log everything and worry about it later”. For an implementation shipped with a roll-your-own
    OS this was not a terrible strategy, but it was a deal breaker for an appliance implementation. We needed clear, concise failure modes that appeared
    integrated with our UI.
  • Embedded data semantics – The notion of tar as a backup format (or raw zfs send) was built very deeply into the architecture. We needed our own data protocol, but replacing the data operations without major surgery was out of the question. While raw ZFS send seems appealing, it is still assumes ownership and control of the filesystem namespace, something that wouldn’t fly in the Delphix world.
  • Unused code – There was tons of dead code, ranging from protocol varieties that were unnecessary (NDMPv2) to swaths of device handling
    code that did nothing.
  • Standalone daemon – A standalone daemon makes it difficult to exchange data across the process boundary, and introduces new complex failure modes.

With this in mind I looked at the ndmp.org SDK implementation, and found it to suffer from the same pathologies (and a much worse implementation to boot). It was clear that the Solaris implementation was derived from the SDK, and that there was no mythical “great NDMP implementation” waiting to be found. I was going to have to suck it up and get back to my Solaris roots to eviscerate this beast.

The first thing I did was recast the daemon as a library, elminating any code that deal with daemonizing, running a door server to report statistics, and
existing Solaris commands that communicated with the server. This allowed me to add a set of client-provided callback vector and configuration options to control state. With this library in place, we could use JNA to easily call into C code from our java management stack without having to worry about marshaling data to and from an external daemon.

The next step was to rip out all the data-handling functionality, instead creating a set of callback vectors in the library registration mechanism to start and stop backup. This left the actual implementation of the over-the-wire format up to the consumer. The sheer amount of code used to support tar and zfs send was staggering, and it had its tendrils all across the implementation. As I started to pull on the thread, more and more started to unravel. Data-specific operations would call into the “tape library management” code (which had very little to do with tape library management) that would then call back into common NDMP code, that would then do nothing.

With the data operations gone, I then had to finally address the hard part: making the code usable. The old error semantics were terrible. I had to go through every log call and non-zero return value, analyze its purpose, and restructure it to use the consumer-provided vector so that we could log such messages natively in the Delphix stack. While doing generic code cleanup, this led me to rip out huge swaths of unused code, from buffer management to NDMPv2 support (v3 has been in common use for more than a decade). This was rather painful, but the result has been quite a usable product. While the old Delphix implementation would have reported “NDMP server error CRITICAL: consult server log for details” (of course, there was no way for the customer to get to the “server log”), we would now get much more helpful messages like “NDMP client reported an error during data operation: out of space”.

The final piece of the puzzle was something that surprised me. By choosing NDMP as the replication protocol (again, a temporary choice), we needed a way to drive the 3-way restore operation from within the Delphix stack. This meant that we wanted to act as a DMA. As I looked at the unbelievable awful ‘ndmpcopy’ implementation shipped with the NDMP SDK, I noticed a lot of similarity to what we needed on the client and what we had on the server (processing requests was identical, even if the set of expected requests was quite different). Rather than build an entirely separate implementation, I converted libndmp such that it could act as a server or a client. This allowed us to build an NDMP copy operation in Java, as well as simulate a remote DMA (an invaluable testing tool).

It took more than a month of solid hard work and several more months of cleanup here and there, but the result was worth it. The new implementation clocks in at just over 11,000 lines of code, while the original was a staggering 43,000 lines of code. Our implementation doesn’t include any actual data handling, so it’s perhaps an unfair comparison. But we also include the ability to act as a full-featured DMA client, something the illumos implementation lacks.

The results of this effort will be available on github as soon as we release the next Delphix version (within a few weeks). While interesting, it’s unlikely to be useful to the general masses, and certainly not something that we’ll try to push upstream. I encourage others looking for an open-source embedded NDMP implementation to fork and improve what we have in Delphix – it’s a very flexible NDMP implementation that can be adopted for a variety of non-traditional NDMP scenarios. But with no built-in data processing, and no standalone daemon implementation, it’s a long way from replacing what can be found in illumos. If someone was so inspired, you could build a daemon on top of the current library – one that provides support for tar, dump, ZFS, and whatever other formats are supported by the current illumos implementation. It would not be a small amount of work, but I am happy to lend advice (if not code) to anyone interested.

Next up will be a post whose working title is “Data Replication: Metadata + Data = Crazy Pain in My Ass”.

With our next Delphix release just around the corner, I wanted to spend some time discussion the engineering process behind one of the major new features: data replication between servers. The current Delphix version already has a replication solution, so how does this constitute a “new feature”? The reason is that it’s an entirely new system, the result of an endeavor to create a more reliable, maintainable, and extensible system. How we got here makes for an interesting tale of business analysis, architecture, and implementation.

Where did we come from?

Before we begin looking at the current implementation, we need to understand why we started with a blank sheet of paper when we already had a shipping solution. The short answer is that what we had was unusable: it was unreliable, undebuggable, and unmaintainable. And when you’re in charge of data consistency for disaster recovery, “unreliable” is not an acceptable state. While I had not written any of the replication infrastructure at Fishworks (my colleagues Adam Leventhal and Dave Pacheco deserve the credit for that), I had spent a lot of time in discussions with them, as well as thinking about how to build a distributed data architecture at Fishworks. So it seemed natural for me to take on this project at Delphix. As I started to unwind our current state, I found a series of decisions that, in hindsight, led to the untenable state we were in today.

  • Analysis of the business problem – At the core of the current replication system was the notion that its purpose was for disaster recovery. This is indeed a major use case of replication, but it’s not the only one (geographical distribution of data being another strong contender). While picking one major problem to tackle first is a reasonable approach to constrain scope, by not correctly identifying future opportunities we ended up with a solution that could only be used for active/passive disaster recovery.
  • Data protocol choice – There is another problem that is very similar to replication: offline backup/restore. Clearly, we want to leverage the same data format and serialization process, but do we want to use the same protocol? NDMP is the industry standard for backups, but it’s tailored to a very specific use case (files and filesystems). By choosing to use NDMP for replication, we sacrificed features (resumable operations, multiple streams) and usability (poor error semantics) and maintainability (unnecessarily complicated operation).
  • Outsourcing of work – At the time this architecture was created, it was decided that NDMP was not part of the company’s core competency, and we should contract with a third party to provide the NDMP solution. I’m a firm believer that engineering work should never be outsourced unless it’s known ahead of time that the result will be thrown away. Otherwise, you’re inevitably saddled with a part of your product that you have limited ability to change, debug, and support. In our case, this was compounded by the fact that the deliverable was binary objects – we didn’t even have source available.
  • Architectural design – By having a separate NDMP daemon we were forced to have an arcane communication mechanism (local HTTP) that lost information with each transition, resulting in a non-trivial amount of application logic resting in a binary we didn’t control. This made it difficult to adapt to core changes in the underlying abstractions.
  • Algorithmic design – There was a very early decision made that replication would be done on a per-group basis (Delphix arranges databases into logical groups). This was divorced from the reality of the underlying ZFS data dependencies, resulting a numerous oddities such as being unable to replicate non self-contained groups or cyclic dependencies between groups. This abstraction was deeply baked into the architecture such that it was impossible to fix in the original architecture.
  • Implementation – The implementation itself was built to be “isolated” of any other code in the system. When one is replicating the core representation of system metadata, this results in an unmaintainable and brittle mess. We had a completely separate copy of our object model that had to be maintained and updated along with the core model, and changes elsewhere in the system (such as deleting objects while replication was ongoing) could lead to obscure errors. The most egregious problems led to unrecoverable state – the target and source could get out of sync such that the only resolution was a new full replication from scratch.
  • Test infrastructure – There was no unit test infrastructure, no automated functional test infrastructure, and no way to test the majority of functionality without manually setting up multi-machine replication or working with a remote DMA. As a result only the most basic functionality worked, and even then it was unreliable most of the time.

Ideals for a new system

Given this list of limitations, I (later joined by Matt) sat down with a fresh sheet of paper. The following were some of the core ideals we set forth as we built this new system:

  • Separation of mechanism from protocol – Whatever choices we make in terms of protocol and replication topologies, we want the core serialization infrastructure to be entirely divorced from the protocol used to transfer the data.
  • Support for arbitrary topologies – We should be able to replicate from a host to any number of other hosts and vice versa, as well as provision from replicated objects.
  • Robust test infrastructure – We should be able to run protocol-level tests, simulate failures, and perform full replication within a single-system unit test framework.
  • Integrated with core object model – There should be one place where object definitions are maintained, such that the replication system can’t get out of sync with the primary source code.
  • Resilient to failure – No matter what, the system must be maintain consistent state in the face of failure. This includes both catastrophic system failure, as well as ongoing changes to the system (i.e. objects being created and deleted). At any point, we must be able to resume replication from a previously known good state without user intervention.
  • Clear error messages – Failures, when they do occur, must present a clear indication of the nature of the problem and what actions must be taken by the user, if any, to fix the underlying problem.

At the same time, we were forced to limit the scope of the project so we could deliver something in an appropriate timeframe. We stuck with NDMP as a protocol despite its inherent problems, as we needed to fix our backup/restore implementation as well. And we kept the active/passive deployment model so that we did not require any significant changes to the GUI.

Next, I’ll discuss the first major piece of work: building a better NDMP implementation.

Recent Posts

April 21, 2013
February 28, 2013
August 14, 2012
July 28, 2012

Archives