Eric Schrock's Blog

Category: Uncategorized

Over the past several months, one of the new features I’ve been working on for the next release is the development of the new CLI for our appliance. While the CLI is the epitome of a checkbox item to most users, as a completely different client-side consumer of web APIs it can have a staggering maintenance cost. Our upcoming release introduces a number of new concepts that required that we gut our web services and built them  from the ground up – retrofitting the old CLI was simply not an option.

What we ended up building was local node.js CLI that takes programmatically defined web service API definitions and dynamically constructs a structured environment for manipulating state through those APIs. Users can log in with their Delphix credentials over SSH or the console and be presented with this CLI via custom PAM integration. Whenever I describe this to people, however, I get more than a few confused looks and questions:

  • Isn’t node.js for writing massively scalable cloud applications?
  • Does anyone care about a CLI?
  • Are you high?

Yes node.js is so hot right now.  Yes, a bunch of Joyeurs (the epicenter of node.js) are former Fishworks colleagues. But the path here, as with all engineering at Delphix, is ultimately data-driven based on real problems, and this was simply the best solution to those problems. I hope to have more blog posts about the architecture in the future, but as I was writing up a README for our own developers, I thought the content would make a reasonable first post. What follows is an engineering FAQ. From here I hope to describe our schemas and some of the basic structure of the CLI, stay tuned.

Why a local CLI?

The previous Delphix CLI was a client-side java application written in groovy. Running the CLI on the client both incurs a cost to users (they need to download an manage additional software) as well as making it a nightmare to manage different versions across different servers. Nearly every appliance shipped today has a SSH interface; doing something different just increases customer confusion. The purported benefit (there is no native Windows SSH client) has shown to be insignificant in practice, and there are other more scalable solutions to this problem (such as distributing a java SSH client).

Why Javascript?

We knew that we were going to need to be manipulating a lot of dynamic state, and the scope of the CLI would remain relatively small. A dynamic scripting language makes for a far more compelling development environment for rapid development, at the cost of needing a more robust unit test framework to catch what would otherwise be compile time errors in a strongly typed statically compiled language. We explicitly chose javascript because our new GUI will be built in javascript, and this both keeps the number of languages and environments used in production to a minimum, as well allowing these clients to share code where applicable.

Why node.js?

We knew v8 was the best-of-breed runtime when it comes to javascript, and we actually started with a custom v8 wrapper. As a single threaded environment, this was pretty straightforward. But once we started considering background tasks we knew we’d need to move to an asynchronous model. Between the cost of building infrastructure already provided by node (HTTP request handling, file manipulation, etc) and the desire to support async activity, node.js was the clear choice of runtime.

Why auto-generated?

Historically, the cost of maintaining the CLI at Delphix (and elsewhere) has been very high. CLI features lag behind the GUI, and developers face additional burden to port their APIs to multiple clients. We wanted to build a CLI that would be almost completely generated from shared schema. When developers change the schema in one place, we auto-generate both backend infrastructure (java objects and constants), GUI data model bindings, and the complete CLI experience.

Why a modal hierarchy?

The look and feel of the Delphix CLI is in many ways inspired by the Fishworks CLI. As engineers and users of many (bad) CLIs, our experience has led to the belief that a CLI with integrated help, tab completion, and a filesystem-like hierarchy promotes exploration and is more natural than a CLI with dozens of commands each with dozens of options. It also makes for a better representation of the actual web service APIs (and hence easier auto-generation), with user operations (list, select, update) that mirror the underlying CRUD API operations.

In my previous post I outlined some of the challenges faced when building a data replication solution, how the first Delphix implementation missed the mark, and how we set out to build something better the second time around.

The first thing that became clear after starting on the new replication subsystem was that we needed a better NDMP implementation. A binary-only separate daemon with poor error semantics that routinely left the system in an inconsistent state was not going to cut it. NDMP is a protocol built for a singular purpose: backing up files using a file-specific format (dump or tar) over arbitrary topologies (direct attached tape, 3-way restore, etc). By being both simultaneously so specific in the data semantics but so general in the control protocol, we end up with the worst of both worlds: baked-in concepts (such as file history, complete with inode numbers) that prevent us from adequately expressing Delphix concepts, and a limited control protocol (lacking multiple streams or resumable streams) with terrible error semantics. While we will ultimately replace NDMP for replication, we knew that we still needed it for backup, and that we didn’t have the time to replace both the implementation and the data protocol for the current release.

Illumos, the open source operating system our distribution is based on, provides an NDMP implementation, one that I had previously dealt with while at Fishworks (though Dave Pacheo was the one who did the actual NDMP integration). I spent some time looking at the implementation and came to the conclusion that it suffered from a number of fatal flaws:

  • Poor error semantics – The strategy was “log everything and worry about it later”. For an implementation shipped with a roll-your-own
    OS this was not a terrible strategy, but it was a deal breaker for an appliance implementation. We needed clear, concise failure modes that appeared
    integrated with our UI.
  • Embedded data semantics – The notion of tar as a backup format (or raw zfs send) was built very deeply into the architecture. We needed our own data protocol, but replacing the data operations without major surgery was out of the question. While raw ZFS send seems appealing, it is still assumes ownership and control of the filesystem namespace, something that wouldn’t fly in the Delphix world.
  • Unused code – There was tons of dead code, ranging from protocol varieties that were unnecessary (NDMPv2) to swaths of device handling
    code that did nothing.
  • Standalone daemon – A standalone daemon makes it difficult to exchange data across the process boundary, and introduces new complex failure modes.

With this in mind I looked at the ndmp.org SDK implementation, and found it to suffer from the same pathologies (and a much worse implementation to boot). It was clear that the Solaris implementation was derived from the SDK, and that there was no mythical “great NDMP implementation” waiting to be found. I was going to have to suck it up and get back to my Solaris roots to eviscerate this beast.

The first thing I did was recast the daemon as a library, elminating any code that deal with daemonizing, running a door server to report statistics, and
existing Solaris commands that communicated with the server. This allowed me to add a set of client-provided callback vector and configuration options to control state. With this library in place, we could use JNA to easily call into C code from our java management stack without having to worry about marshaling data to and from an external daemon.

The next step was to rip out all the data-handling functionality, instead creating a set of callback vectors in the library registration mechanism to start and stop backup. This left the actual implementation of the over-the-wire format up to the consumer. The sheer amount of code used to support tar and zfs send was staggering, and it had its tendrils all across the implementation. As I started to pull on the thread, more and more started to unravel. Data-specific operations would call into the “tape library management” code (which had very little to do with tape library management) that would then call back into common NDMP code, that would then do nothing.

With the data operations gone, I then had to finally address the hard part: making the code usable. The old error semantics were terrible. I had to go through every log call and non-zero return value, analyze its purpose, and restructure it to use the consumer-provided vector so that we could log such messages natively in the Delphix stack. While doing generic code cleanup, this led me to rip out huge swaths of unused code, from buffer management to NDMPv2 support (v3 has been in common use for more than a decade). This was rather painful, but the result has been quite a usable product. While the old Delphix implementation would have reported “NDMP server error CRITICAL: consult server log for details” (of course, there was no way for the customer to get to the “server log”), we would now get much more helpful messages like “NDMP client reported an error during data operation: out of space”.

The final piece of the puzzle was something that surprised me. By choosing NDMP as the replication protocol (again, a temporary choice), we needed a way to drive the 3-way restore operation from within the Delphix stack. This meant that we wanted to act as a DMA. As I looked at the unbelievable awful ‘ndmpcopy’ implementation shipped with the NDMP SDK, I noticed a lot of similarity to what we needed on the client and what we had on the server (processing requests was identical, even if the set of expected requests was quite different). Rather than build an entirely separate implementation, I converted libndmp such that it could act as a server or a client. This allowed us to build an NDMP copy operation in Java, as well as simulate a remote DMA (an invaluable testing tool).

It took more than a month of solid hard work and several more months of cleanup here and there, but the result was worth it. The new implementation clocks in at just over 11,000 lines of code, while the original was a staggering 43,000 lines of code. Our implementation doesn’t include any actual data handling, so it’s perhaps an unfair comparison. But we also include the ability to act as a full-featured DMA client, something the illumos implementation lacks.

The results of this effort will be available on github as soon as we release the next Delphix version (within a few weeks). While interesting, it’s unlikely to be useful to the general masses, and certainly not something that we’ll try to push upstream. I encourage others looking for an open-source embedded NDMP implementation to fork and improve what we have in Delphix – it’s a very flexible NDMP implementation that can be adopted for a variety of non-traditional NDMP scenarios. But with no built-in data processing, and no standalone daemon implementation, it’s a long way from replacing what can be found in illumos. If someone was so inspired, you could build a daemon on top of the current library – one that provides support for tar, dump, ZFS, and whatever other formats are supported by the current illumos implementation. It would not be a small amount of work, but I am happy to lend advice (if not code) to anyone interested.

Next up will be a post whose working title is “Data Replication: Metadata + Data = Crazy Pain in My Ass”.

With our next Delphix release just around the corner, I wanted to spend some time discussion the engineering process behind one of the major new features: data replication between servers. The current Delphix version already has a replication solution, so how does this constitute a “new feature”? The reason is that it’s an entirely new system, the result of an endeavor to create a more reliable, maintainable, and extensible system. How we got here makes for an interesting tale of business analysis, architecture, and implementation.

Where did we come from?

Before we begin looking at the current implementation, we need to understand why we started with a blank sheet of paper when we already had a shipping solution. The short answer is that what we had was unusable: it was unreliable, undebuggable, and unmaintainable. And when you’re in charge of data consistency for disaster recovery, “unreliable” is not an acceptable state. While I had not written any of the replication infrastructure at Fishworks (my colleagues Adam Leventhal and Dave Pacheco deserve the credit for that), I had spent a lot of time in discussions with them, as well as thinking about how to build a distributed data architecture at Fishworks. So it seemed natural for me to take on this project at Delphix. As I started to unwind our current state, I found a series of decisions that, in hindsight, led to the untenable state we were in today.

  • Analysis of the business problem – At the core of the current replication system was the notion that its purpose was for disaster recovery. This is indeed a major use case of replication, but it’s not the only one (geographical distribution of data being another strong contender). While picking one major problem to tackle first is a reasonable approach to constrain scope, by not correctly identifying future opportunities we ended up with a solution that could only be used for active/passive disaster recovery.
  • Data protocol choice – There is another problem that is very similar to replication: offline backup/restore. Clearly, we want to leverage the same data format and serialization process, but do we want to use the same protocol? NDMP is the industry standard for backups, but it’s tailored to a very specific use case (files and filesystems). By choosing to use NDMP for replication, we sacrificed features (resumable operations, multiple streams) and usability (poor error semantics) and maintainability (unnecessarily complicated operation).
  • Outsourcing of work – At the time this architecture was created, it was decided that NDMP was not part of the company’s core competency, and we should contract with a third party to provide the NDMP solution. I’m a firm believer that engineering work should never be outsourced unless it’s known ahead of time that the result will be thrown away. Otherwise, you’re inevitably saddled with a part of your product that you have limited ability to change, debug, and support. In our case, this was compounded by the fact that the deliverable was binary objects – we didn’t even have source available.
  • Architectural design – By having a separate NDMP daemon we were forced to have an arcane communication mechanism (local HTTP) that lost information with each transition, resulting in a non-trivial amount of application logic resting in a binary we didn’t control. This made it difficult to adapt to core changes in the underlying abstractions.
  • Algorithmic design – There was a very early decision made that replication would be done on a per-group basis (Delphix arranges databases into logical groups). This was divorced from the reality of the underlying ZFS data dependencies, resulting a numerous oddities such as being unable to replicate non self-contained groups or cyclic dependencies between groups. This abstraction was deeply baked into the architecture such that it was impossible to fix in the original architecture.
  • Implementation – The implementation itself was built to be “isolated” of any other code in the system. When one is replicating the core representation of system metadata, this results in an unmaintainable and brittle mess. We had a completely separate copy of our object model that had to be maintained and updated along with the core model, and changes elsewhere in the system (such as deleting objects while replication was ongoing) could lead to obscure errors. The most egregious problems led to unrecoverable state – the target and source could get out of sync such that the only resolution was a new full replication from scratch.
  • Test infrastructure – There was no unit test infrastructure, no automated functional test infrastructure, and no way to test the majority of functionality without manually setting up multi-machine replication or working with a remote DMA. As a result only the most basic functionality worked, and even then it was unreliable most of the time.

Ideals for a new system

Given this list of limitations, I (later joined by Matt) sat down with a fresh sheet of paper. The following were some of the core ideals we set forth as we built this new system:

  • Separation of mechanism from protocol – Whatever choices we make in terms of protocol and replication topologies, we want the core serialization infrastructure to be entirely divorced from the protocol used to transfer the data.
  • Support for arbitrary topologies – We should be able to replicate from a host to any number of other hosts and vice versa, as well as provision from replicated objects.
  • Robust test infrastructure – We should be able to run protocol-level tests, simulate failures, and perform full replication within a single-system unit test framework.
  • Integrated with core object model – There should be one place where object definitions are maintained, such that the replication system can’t get out of sync with the primary source code.
  • Resilient to failure – No matter what, the system must be maintain consistent state in the face of failure. This includes both catastrophic system failure, as well as ongoing changes to the system (i.e. objects being created and deleted). At any point, we must be able to resume replication from a previously known good state without user intervention.
  • Clear error messages – Failures, when they do occur, must present a clear indication of the nature of the problem and what actions must be taken by the user, if any, to fix the underlying problem.

At the same time, we were forced to limit the scope of the project so we could deliver something in an appropriate timeframe. We stuck with NDMP as a protocol despite its inherent problems, as we needed to fix our backup/restore implementation as well. And we kept the active/passive deployment model so that we did not require any significant changes to the GUI.

Next, I’ll discuss the first major piece of work: building a better NDMP implementation.

Yesterday, several of us from Delphix, Nexenta, Joyent, and elsewhere, convened before the OpenStorage summit as part of an illumos hackathon.  The idea was to get a bunch of illumos coders in a room, brainstorm a bunch of small project ideas, and then break off to go implement them over the course of the day.  That was the idea, at least – in reality we didn’t know what to expect or how it would turn out.  Suffice to say that the hackathon was an amazing success.  There were a lot of cool ideas, and a lot of great mentors in the room that could lead people through unfamiliar territory.

For my mini-project (suggested by ahl), I implemented MDB’s ::print functionality in DTrace via a new print() action. Today, we have the trace() action, but the result is somewhat less than useful when dealing with structs, as it degenerates into tracemem():

# dtrace -qn 'BEGIN{trace(`p0); exit(0)}'
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 00 00 00 00 00 00 00 00 60 02 c3 fb ff ff ff ff  ........`.......
        10: c8 c9 c6 fb ff ff ff ff 00 00 00 00 00 00 00 00  ................
        20: b0 ad 14 c6 00 ff ff ff 00 00 00 00 02 00 00 00  ................
        ...

The results aren’t pretty, and we end up throwing away all that useful proc_t type information. With a little tweaks to dtrace, and some cribbing from mdb_print.c, we can do much better:

# dtrace -qn 'BEGIN{print(`p0); exit(0)}'
proc_t {
    struct vnode *p_exec = 0
    struct as *p_as = 0xfffffffffbc30260
    struct plock *p_lockp = 0xfffffffffbc6c9c8
    kmutex_t p_crlock = {
        void *[1] _opaque = [ 0 ]
    }
    struct cred *p_cred = 0xffffff00c614adb0
    int p_swapcnt = 0
    char p_stat = '02'
    ....

Much better! Now, how did we get there from here? The answer was an interesting journey through libdtrace, the kernel dtrace implementation, CTF, and the horrors of bitfields.

To action or not to action?

The first question I set out to answer is what the user-visible interface should be. It seemed clear that this should be an operation on the same level as trace(), allowing arbitrary D expressions, but simply preserving the type of the result and pretty-printing it later. After briefly considering printt() (for “print type”), I decided upon just print(), since this seemed like a logical My first inclination was to create a new DTRACEACT_PRINT, but after some discussion with Adam, we decided this was extraneous – the behavior was identical to DTRACEACT_DIFEXPR (the internal name for trace), but just with type information.

Through the looking glass with types and formats

The real issue is that what we compile (dtrace statements) and what we consume (dtrace epids and records) are two very different things, and never the twain shall meet. At the time we go to generate the DIFEXPR statement in dt_cc.c, we have the CTF data in hand. We don’t want to change the DIF we generate, simply do post-processing on the other side, so we just need some way to get back to that type information in dt_consume_cpu(). We can’t simply hang it off our dtrace statement, as that would break anonymous tracing (and violate the rest of the DTrace architecture to boot).

Thankfully, this problem had already been solved for printf() (and related actions) because we need to preserve the original format string for the exact same reason. To do this, we take the action-specific integer argument, and use it to point into the DOF string table, where we stash the original format string. I simply had to hijack dtrace_dof_create() and have it do the same thing for the type information, right?

If only it could be so simple. There were two complications here: there is a lot of code that explicitly treats these as printf strings, and parses them into internal argv-style representations. Pretending our types were just format strings would cause all kinds of problems in this code. So I had to modify libdtrace to treat this more explicitly as raw ‘string data’ that is (optionally) used with the DIFEXPR action. Even with that in place, the formats I was sending down were not making it back out of the kernel. Because the argument is action-specific, the kernel needed to be modified to recognize this new argument in dtrace_ecb_action_add. With that change in place, I was able to get the format string back in userland when consuming the CPU buffers.

Bitfields, or why the D compiler cost me an hour of my life

With the trace data and type string in hand, I then proceeded to copy the mdb ::print code, first from apptrace (which turned out to be complete garbage) and then fixing it up bit by bit. Finally, after tweaking the code for an hour or two, I had it looking pretty much like identical ::print output. But when I fed it a klwp_t structure, I found that the user_desc_t structure bitfields weren’t being printed correctly:

# dtrace -n 'BEGIN{print(*((user_desc_t*)0xffffff00cb0a4d90)); exit(0)}'
dtrace: description 'BEGIN' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0      1                           :BEGIN user_desc_t {
    unsigned long usd_lolimit = 0xcff3000000ffff
    unsigned long usd_lobase = 0xcff3000000
    unsigned long usd_midbase = 0xcff300
    unsigned long usd_type = 0xcff3
    unsigned long usd_dpl :64 = 0xcff3
    unsigned long usd_p :64 = 0xcff3
    unsigned long usd_hilimit = 0xcf
    unsigned long usd_avl :64 = 0xcf
    unsigned long usd_long :64 = 0xcf
    unsigned long usd_def32 :64 = 0xcf
    unsigned long usd_gran :64 = 0xcf
    unsigned long usd_hibase = 0
}

I spent an hour trying to debug this, only to find that the CTF IDs weren’t matching what I expected from the underlying object. I finally tracked it down to the fact that the D compiler, by virtue of processing the /usr/lib/dtrace files, pulls in its own version of klwp_t from the system header files. But it botches the bitfields, leaving the user with a subtly incorrect data. Switching the type to be genunix`user_desc_t fixed the problem.

What’s next

Given the usefulness of this feature, the next steps are to clean up the code, get it reviewed, and push to the illumos gate. It should hopefully be finding its way to an illumos distribution near you soon. Here’s a final print() invocation to leave you with:

# dtrace -n 'zio_done:entry{print(*args[0]); exit(0)}'
dtrace: description 'zio_done:entry' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0  42594                   zio_done:entry zio_t {
    zbookmark_t io_bookmark = {
        uint64_t zb_objset = 0
        uint64_t zb_object = 0
        int64_t zb_level = 0
        uint64_t zb_blkid = 0
    }
    zio_prop_t io_prop = {
        enum zio_checksum zp_checksum = ZIO_CHECKSUM_INHERIT
        enum zio_compress zp_compress = ZIO_COMPRESS_INHERIT
        dmu_object_type_t zp_type = DMU_OT_NONE
        uint8_t zp_level = 0
        uint8_t zp_copies = 0
        uint8_t zp_dedup = 0
        uint8_t zp_dedup_verify = 0
    }
    zio_type_t io_type = ZIO_TYPE_NULL
    enum zio_child io_child_type = ZIO_CHILD_VDEV
    int io_cmd = 0
    uint8_t io_priority = 0
    uint8_t io_reexecute = 0
    uint8_t [2] io_state = [ 0x1, 0 ]
    uint64_t io_txg = 0
    spa_t *io_spa = 0xffffff00c6806580
    blkptr_t *io_bp = 0
    blkptr_t *io_bp_override = 0
    blkptr_t io_bp_copy = {
        dva_t [3] blk_dva = [ 
            dva_t {
                uint64_t [2] dva_word = [ 0, 0 ]
            }, 
    ...

With our first illumos-based distribution (2.6) out the door, we’ve posted the illumos-derived sources to github:

https://github.com/delphix/delphix-os-2.6

This repository contains the following types of changes from the illumos gate:

  • Changes that are complete and generally useful to the illumos community.  We have been (and will continue to be) proactive about pushing these changes to the illumos trunk ourselves.  We missed a few this time around, so we’ll be going back through to pick up anything we missed.
  • Changes that are sufficient to meet the needs of our product, but are not complete or generally useful for the larger community.  Our hope is that by pushing these changes to github, others can pick up such pieces of work and integrate them in a form that is acceptable to the illumos community at large.
  • Changes that represent distro-specific changes unique to our product.  It is unlikely that these will be of interest to anyone except the morbidly curious.

We will post updates with each release of the software.  This allows us to make sure the code is fully baked and tested, while still allowing us to proactively push complete pieces of work more frequently.

If you have questions about any particular change, feel free to email the author for more information.  You can also find us on the illumos developer mailing list and the #illumos IRC channel on freenode.

Recent Posts

April 21, 2013
February 28, 2013
August 14, 2012
July 28, 2012

Archives