Inside Manta: Distributing the Unix shell

Today, Joyent has launched Manta: our internet-facing object store with compute as a first class operation. This is the culmination of over a year’s effort on the part of the whole engineering team, and I’m personally really excited to be able to share this with the world. There’s plenty of documentation on how to use Manta, so in this post I want to talk about the guts of my favorite part: the compute engine.

The super-quick crash course on Manta: it’s an object store, which means you can use HTTP PUT/GET/DELETE to store arbitrary byte streams called objects. This is similar to other HTTP-based object stores, with a few notable additions: Unix-like directory semantics, strong read-after-write consistency, and (most significantly) a Unixy compute engine.

Computation in Manta

There’s a terrific Getting Started tutorial already, so I’m going to jump straight to a non-trivial job and explain how it runs under the hood.

At the most basic level, Manta’s compute engine runs arbitrary shell commands on objects in the object store. Here’s my example job:

$ mfind -t o /dap/stor/snpp | mjob create -qom 'grep poochy'

This job enumerates all the objects under /dap/stor/snpp (using the mfind client tool, analogous to Unix find(1)), then creates a job that runs “grep poochy” on each one, waits for the job to finish, and prints the outputs.

I can run this one-liner from my laptop to search thousands of objects for the word “poochy”. Instead of downloading each file from the object store, running “grep” on it, and saving the result back, Manta just runs “grep poochy” inside the object store. The data never gets copied.

Notice that our Manta invocation of “grep” didn’t specify a filename at all. This works because Manta redirects stdin from an object, and grep reads input from stdin if you don’t give it a filename. (There’s no funny business about tacking the filename on to the end of the shell script, as though you’d run ‘grep poochy FILENAME’, though you can do that if you want using environment variables.) This model extends naturally to cover “reduce” operations, where you may want to aggregate over enormous data streams that don’t fit on a single system’s disks.

One command, many tasks

What does it actually mean to run grep on 100 objects? Do you get one output or 100? What if some of these commands succeed, but others fail?

In keeping with the Unix tradition, Manta aims for simple abstractions that can be combined to support more sophisticated use cases. In the example above, Manta does the obvious thing: if the directory has 100 objects, it runs 100 separate invocations of “grep”, each producing its own output object, and each with its own success or failure status. Unlike with a single shell command, a one-phase map job can have any number of inputs, outputs, and errors. You can build more sophisticated pipelines that combine output from multiple phases, but that’s beyond the scope of this post.1

How does it work?

Manta’s compute engine hinges on three SmartOS (illumos) technologies:

  • Zones: OS-based virtualization, which allows us to run thousands of these user tasks concurrently in lightweight, strongly isolated environments. Each user’s program runs as root in its own zone, and can do whatever it wants there, but processes in the zone have no visibility into other zones or the rest of the system.
  • ZFS: ZFS’s copy-on-write semantics and built-in snapshots allow us to completely reset zones between users. Your program can scribble all over the filesystem, and when it’s done we roll it back to its pristine state for the next user. (We also leverage ZFS clones for the filesystem image: we have one image with tens of gigabytes of software installed, and each zone’s filesystem is a clone of that single image, for almost zero disk space overhead per zone.)
  • hyprlofs: a filesystem we developed specifically for Manta, hyprlofs allows us to mount read-only copies of files from one filesystem into another. The difference between hyprlofs and traditional lofs is that hyprlofs supports commands to map and unmap files on-demand, and those files can be backed by arbitrary other filesystems. More on this below.

In a nutshell: each copy of a Manta object is stored as a flat file in ZFS. On the same physical servers where these files are stored, there are a bunch of compute zones for running jobs.

As you submit the names of input objects, Manta locates the storage servers containing a copy of each object and dispatches tasks to one server for each object. That server uses hyprlofs to map a read-only copy of the object into one of the compute zones. Then it runs the user’s program in that zone and uses zfs rollback to reset the zone for the next tenant. (There’s obviously a lot more to making this system scale and survive component failures, but that’s the basic idea.)

What’s next?

In this post I’ve explained the basics of how Manta’s compute engine works under the hood, but this is a very simple example. Manta supports more sophisticated distributed computation, including reducers (including multiple reducers) and multi-phase jobs (e.g., map-map-map).

Because Manta uses the Unix shell as the primitive abstraction for computation, it’s very often trivial to turn common shell invocations that you usually run sequentially on a few files at a time into Manta jobs that run in parallel on thousands of objects. For tasks beyond the complexity a shell script, you can always execute a program in some other language — that’s, after all, what the shell does best. We’ve used it for applications ranging from converting image files to generating aggregated reports on activity logs. (In fact, we use the jobs facility internally to implement metering, garbage collection of unreferenced objects, and our own reports.) My colleague Bill has already used it to analyze transit times on SF Muni. Be on the lookout for a rewrite of kartlytics based on Manta.

We’re really excited about Manta, and we’re looking forward to seeing what people do with it!

1 Manta’s “map” is like the traditional functioning programming primitive that performs a transformation on each of its inputs. This is similar but not the same as the Map in MapReduce environments, which specifically operates on key-value pairs. You can do MapReduce with Manta by having your program parse key-value pairs from objects and emit key-value pairs as output, but you can also do other transformations that aren’t particularly well-suited to key-value representation (like video transcoding, for example).

7 thoughts on “Inside Manta: Distributing the Unix shell

  1. Pingback: The Observation Deck » Manta: From revelation to product

  2. Pingback: Brendan's blog » Manta: Unix Meets Map Reduce

  3. “In a nutshell: each copy of a Manta object is stored as a flat file in ZFS. On the same physical servers where these files are stored, there are a bunch of compute zones for running jobs.”

    What happens when the underlying hardware kicks the bucket?

    “As you submit the names of input objects, Manta locates the storage servers containing a copy of each object and dispatches tasks to one server for each object.”

    How is the replication of data between servers achieved? Is it realtime? Is it synchronous? Is it multimaster?

  4. @UX-admin:

    “What happens when the underlying hardware kicks the bucket?”

    That’s a simple question with a complicated answer.

    Hardware failure falls into one of three buckets: component failures that can be fixed without service interruption (even of a single node), like single disk failure or power supply failure; component failures that require a short period of removing the box from service, like a motherboard failure; or multi-component failures resulting in loss of the storage pool, which would require rebuilding the storage pool on a new storage node from data stored on the other nodes. (The ZFS storage configuration includes both redundant disks and hot spares, so this last failure mode is extremely unlikely.) Removing a single storage node from service while we address any of these failures has no availability impact on the rest of the system unless you’ve explicitly stored objects with durability level 1, in which case the object is unavailable while its storage node is out of service.

    Failures in the metadata tier are a little different. As with above, the failure modes are very likely to be transient, and the system survives single failures with very transient impact to availability. (This is a little vague, but that’s more complicated and worthy of a follow-up post!)

    “How is the replication of data between servers achieved? Is it realtime? Is it synchronous? Is it multimaster?”

    Data servers don’t replicate to each other. When you make a PUT request, we select N storage nodes (N being the durability level, which defaults to 2) and funnel the contents of the object from the frontend to both of the storage nodes. The request doesn’t complete until it’s safely on disk. There’s no additional replication in the storage tier.

    The metadata tier uses synchronous master-slave replication with at least one asynchronous standby.

  5. @ux-admin:

    from reading the docs, data duplications are most likely avoided because of hhyperlofs…

    am guessing it is a little like creating a zone with lofs filesystems, except that the zfs lofs datasets in conventional zone configs are created during zoneadm times, and the zones need to be rebooted to “import” any new lofs datasets.

    in manta, most likely the compute nodes are created, and hyperlofs is used to “bring in” the required object store (dataset) dynamically, for in-situ processing.

    the tutorial mentioned:

    > the code you want to run on objects is brought to the physical server
    > that holds the object(s), rather than transferring data to a processing
    > host

    bringing the mountain to mohammed ;-)

  6. Thanks Dave, for the reference.

    indeed, am curious about the map/unmap ops you mentioned, my friends and i will be doing a meetup, to discuss about Manta ;-)

    Thanks to the Joyent team, for yet another interesting technology!

    Even more interesting than what was described in the early MapReduce paper, where MapReduce result sets need to move from node to node,
    in the google’s compute farm !

Comments are closed.