Inside Manta: Distributing the Unix shell

Today, Joyent has launched Manta: our internet-facing object store with compute as a first class operation. This is the culmination of over a year’s effort on the part of the whole engineering team, and I’m personally really excited to be able to share this with the world. There’s plenty of documentation on how to use Manta, so in this post I want to talk about the guts of my favorite part: the compute engine.

The super-quick crash course on Manta: it’s an object store, which means you can use HTTP PUT/GET/DELETE to store arbitrary byte streams called objects. This is similar to other HTTP-based object stores, with a few notable additions: Unix-like directory semantics, strong read-after-write consistency, and (most significantly) a Unixy compute engine.

Computation in Manta

There’s a terrific Getting Started tutorial already, so I’m going to jump straight to a non-trivial job and explain how it runs under the hood.

At the most basic level, Manta’s compute engine runs arbitrary shell commands on objects in the object store. Here’s my example job:

$ mfind -t o /dap/stor/snpp | mjob create -qom 'grep poochy'

This job enumerates all the objects under /dap/stor/snpp (using the mfind client tool, analogous to Unix find(1)), then creates a job that runs “grep poochy” on each one, waits for the job to finish, and prints the outputs.

I can run this one-liner from my laptop to search thousands of objects for the word “poochy”. Instead of downloading each file from the object store, running “grep” on it, and saving the result back, Manta just runs “grep poochy” inside the object store. The data never gets copied.

Notice that our Manta invocation of “grep” didn’t specify a filename at all. This works because Manta redirects stdin from an object, and grep reads input from stdin if you don’t give it a filename. (There’s no funny business about tacking the filename on to the end of the shell script, as though you’d run ‘grep poochy FILENAME’, though you can do that if you want using environment variables.) This model extends naturally to cover “reduce” operations, where you may want to aggregate over enormous data streams that don’t fit on a single system’s disks.

One command, many tasks

What does it actually mean to run grep on 100 objects? Do you get one output or 100? What if some of these commands succeed, but others fail?

In keeping with the Unix tradition, Manta aims for simple abstractions that can be combined to support more sophisticated use cases. In the example above, Manta does the obvious thing: if the directory has 100 objects, it runs 100 separate invocations of “grep”, each producing its own output object, and each with its own success or failure status. Unlike with a single shell command, a one-phase map job can have any number of inputs, outputs, and errors. You can build more sophisticated pipelines that combine output from multiple phases, but that’s beyond the scope of this post.1

How does it work?

Manta’s compute engine hinges on three SmartOS (illumos) technologies:

  • Zones: OS-based virtualization, which allows us to run thousands of these user tasks concurrently in lightweight, strongly isolated environments. Each user’s program runs as root in its own zone, and can do whatever it wants there, but processes in the zone have no visibility into other zones or the rest of the system.
  • ZFS: ZFS’s copy-on-write semantics and built-in snapshots allow us to completely reset zones between users. Your program can scribble all over the filesystem, and when it’s done we roll it back to its pristine state for the next user. (We also leverage ZFS clones for the filesystem image: we have one image with tens of gigabytes of software installed, and each zone’s filesystem is a clone of that single image, for almost zero disk space overhead per zone.)
  • hyprlofs: a filesystem we developed specifically for Manta, hyprlofs allows us to mount read-only copies of files from one filesystem into another. The difference between hyprlofs and traditional lofs is that hyprlofs supports commands to map and unmap files on-demand, and those files can be backed by arbitrary other filesystems. More on this below.

In a nutshell: each copy of a Manta object is stored as a flat file in ZFS. On the same physical servers where these files are stored, there are a bunch of compute zones for running jobs.

As you submit the names of input objects, Manta locates the storage servers containing a copy of each object and dispatches tasks to one server for each object. That server uses hyprlofs to map a read-only copy of the object into one of the compute zones. Then it runs the user’s program in that zone and uses zfs rollback to reset the zone for the next tenant. (There’s obviously a lot more to making this system scale and survive component failures, but that’s the basic idea.)

What’s next?

In this post I’ve explained the basics of how Manta’s compute engine works under the hood, but this is a very simple example. Manta supports more sophisticated distributed computation, including reducers (including multiple reducers) and multi-phase jobs (e.g., map-map-map).

Because Manta uses the Unix shell as the primitive abstraction for computation, it’s very often trivial to turn common shell invocations that you usually run sequentially on a few files at a time into Manta jobs that run in parallel on thousands of objects. For tasks beyond the complexity a shell script, you can always execute a program in some other language — that’s, after all, what the shell does best. We’ve used it for applications ranging from converting image files to generating aggregated reports on activity logs. (In fact, we use the jobs facility internally to implement metering, garbage collection of unreferenced objects, and our own reports.) My colleague Bill has already used it to analyze transit times on SF Muni. Be on the lookout for a rewrite of kartlytics based on Manta.

We’re really excited about Manta, and we’re looking forward to seeing what people do with it!

1 Manta’s “map” is like the traditional functioning programming primitive that performs a transformation on each of its inputs. This is similar but not the same as the Map in MapReduce environments, which specifically operates on key-value pairs. You can do MapReduce with Manta by having your program parse key-value pairs from objects and emit key-value pairs as output, but you can also do other transformations that aren’t particularly well-suited to key-value representation (like video transcoding, for example).