Stopping a broken program in its tracks

Last week I debugged a Node issue where under some conditions, fork() failure could result in some file descriptors being closed, leading shortly to bedlam. In the best cases, the program would immediately crash, but often it would just do the wrong thing. This happens a lot, of course, and in many ways it’s worse than a crash because you have no core dump nor even a stack trace to go on.

In my write-up of the bug, I glossed over a key step to debugging the problem:

"Using DTrace and reproducing on SmartOS, I stopped the
process at the moment close() returned EBADF, and caught
this stack trace ..."

I want to explain this technique in more detail because it’s very useful for getting a stack trace or core dump when a program starts going south, even if the program isn’t actually going to crash. (This technique isn’t specific to Node.js or SmartOS, either, though it’s easier with Node.js and on SmartOS.)

In this case, I was debugging a production problem that I reproduced in development, which made things much easier. I was free to kill the process to create a core dump without worrying about disrupting any users. Since the Node program was blowing up on a “close, EBADF” error, I used this DTrace one-liner to make the program dump core precisely when that happened (instead of barreling on with the wrong file descriptor):

# dtrace -w -n 'syscall::close:return
  /execname == "node" && arg1 == -1 && errno == EBADF/
  { trace(pid); raise(SIGABRT); exit(0); }'

Taking apart the script:

  • dtrace -w: The “-w” flag to “dtrace” enables destructive actions. Without “-w”, DTrace won’t let you do anything that can significantly change the behavior of running programs (like raising a signal, in this case).
  • syscall::close:return: We’re tracing returns from the “close” system call. We’re tracing the kernel, not a particular process, so this will catch all closes in all processes.
  • /execname == "node" && arg1 == -1 && errno == EBADF/: We’re only going to take action for processes called “node”, where the return value from “close” is -1, and the kernel errno is EBADF.
  • { trace(pid); raise(SIGABRT); exit(0); }: The operative piece here is raise(SIGABRT), which immediately sends SIGABRT to the current process. (The reason for trace(pid) is so that you know what process was killed, and the reason for exit(0) is to end the D script as soon as this happens. If you don’t exit and you get the predicate wrong, you can accidentally do a lot of damage!)

The default behavior on receiving SIGABRT is to dump core immediately (see signal.h). It’s possible to alter this behavior, but very few programs do, so this technique almost always works.

Once I had a core file, I fired up MDB to get a JavaScript stack trace, and from there it was easy to figure out what had gone wrong. For details, see the Node issue.

What about production?

I mentioned that in this case, I’d reproduced the problem in development, so I was free to kill the process. What if (as so often happens) the first time we saw this problem was in production, and we wanted to get a core dump at exactly the right time, but without causing the process to crash? That’s not actually much harder, at least on systems with a program like gcore(1):

# dtrace -w -n 'syscall::close:return
  /execname == "node" && arg1 == -1 && errno == EBADF/
  { trace(pid); stop(); system("gcore %d; prun %d", pid, pid); exit(0); }'

This one is very similar to the previous script, except that instead of raising SIGABRT, we use stop() to stop the process in its tracks, and asynchronously run gcore PID (to save the core dump) followed by prun PID (to start the process running again). This is still technically destructive, because we’re stopping the process for however long it takes to save the core dump, but in most cases this isn’t too long and the disruption is insignificant.

To recap: you can use DTrace for more than just data collection, including debugging non-fatal buggy program behavior. DTrace can instrument very specific events on the system (like “the close syscall in a Node process returned -1 with EBADF” or any application-level event), and you can use that to take action like stopping the process to take a core dump and then resuming it again. Putting these pieces together makes it possible to gather all kinds of debugging information from a program that’s behaving badly.

Node.js in production: runtime log snooping

This post also appeared on the Joyeur blog.


This post is the one of several about how we operate Node.js in production at Joyent. Most of my experience comes from working on our Manta Storage Service, which currently comprises a few thousand Node processes across three datacenters.

Logging is one of the most primitive but most valuable forms of postmortem debugging. Logs let you figure out why a program crashed (or did the wrong thing) after it’s already done so. In this post I’ll describe a critical feature of the Bunyan logging library that we use daily to help debug failures in production: runtime log snooping.

The problem

In many production systems, there’s a nasty tradeoff: the more verbose your logs, the more information you have to debug issues, but the greater the overhead of constructing, recording, transmitting, storing, and analyzing these logs as well. Log frameworks typically manage this with log levels: you use “debug” in development, which spews all kinds of information about what the program is doing, but you use “info” or some higher level in production, which filters only the “important” messages.

But it’s often in production that you really need the debug messages that are too expensive to emit in production. It’s when you’ve been paged in the middle of the night because some service’s error count has exceeded some threshold. If you’re lucky, the failure is simple enough that the program can tell you exactly what’s happened, but more often, particularly in distributed systems, where failure modes aren’t so crisp, the error will be somewhat vague. You’ll need more information about what the program was doing in order to figure out what went wrong.

In many of these cases, restarting the service will bring it back to life, but it will also eliminate any hope of root causing the problem. You can already see yourself getting paged again a few nights later for the same issue…


Trent Mick at Joyent wrote a Node module called Bunyan that we use in all of our services to log everything from free-form text intended for humans to programmatically consumable events that trip alarms or feed into our billing system. Bunyan logs are just newline-separated JSON records. A few fields have special meaning, like “time” and “msg”, but programs can log whatever other JSON structures they want for interpretation by humans or other programs. (In Manta, we use Bunyan log records for billing. These log entries include structured payloads that describe the billable operation, the user, the result, and so on.1) The “bunyan” command-line tool facilitates easy filtering, colorizing output, and so on, while programmatic consumers can parse each line as a JSON object.

There are lots of logging libraries out there, but Bunyan’s killer feature is the ability to snoop debug-level logs in production, without restarting (or otherwise impacting) the service. When an application calls log.debug("something went wrong"), and the log level is “info” (i.e., higher than “debug”), the bunyan library drops the message, as you’d expect. When you want to get at these debug messages, you fire up “bunyan -p *”, which is just sugar for:

$ dtrace -Z -x strsize=4k -x switchrate=10hz \ -q -n 'bunyan*:::log-*{ printf("%s\n", copyinstr(arg0)); }' | bunyan

When you run this, you’re enabling DTrace probes built into Bunyan. Enabling these probes actually modifies your Node program’s behavior to fire the probe just before dropping the log message. Firing the probe traps into the kernel, where the log message is buffered and eventually funneled up to the “dtrace” process and printed out to your terminal.

This is a subtle point, but it’s very important: enabling these probes doesn’t cause the debug messages to appear in the regular log, but rather it causes these messages to be collected, buffered by the kernel, and then made available to your “dtrace” process. That means your service is still running at near full speed, not going through the code paths that write entries to the log. Importantly, if the volume of messages gets too high, DTrace will drop them rather than slow down the process. (You can tune the “switchrate” and buffer sizes to control that.)

When you don’t need the extra logging any more, CTRL-C the “bunyan” (or “dtrace”) process. The program modifications are undone, and your program goes on dropping debug log entries.

Voilá! By using Bunyan as a logging library, you get the ability to walk up to a production service and enable debug logging without restarting it. This has proven critical for a number of cases where it’s too expensive to log debug messages all the time, but we had a live service behaving badly, and restarting the service would likely have cleared the issue without giving us a chance to root-cause it.

Try it yourself

I ran these examples on OS X because it’s convenient, but all of our production services run on SmartOS (an illumos distribution). This facility works on any system that supports DTrace USDT.

If you’ve got Node installed already, it’s ridiculously easy to try this out. First, create a temporary directory and install bunyan locally:

dap@sharptooth $ npm install bunyan

then start a Node REPL, load up bunyan, and create a logger at level “info”:

dap@sharptooth $ node
> var r = require('bunyan')
> var l = new r({ 'name': 'test', 'level': 'info' })

At this point, you can log “info” messages, but “debug” messages will be hidden:

>'test info message')
{"name":"test","hostname":"sharptooth.local","pid":1076,"level":30,"msg":"test info message","time":"2013-09-27T18:13:00.820Z","v":0}
> l.debug('something went wrong')

Now in another shell, use “bunyan -p”:

# In another shell:
dap@sharptooth $ sudo bunyan -p '*'   # sudo required for DTrace

That will block waiting for something to happen. Back in the first shell, log that debug message again:

> l.debug('something went wrong')

and over in the second shell, you should see something like this:

[2013-09-27T18:16:19.679Z] DEBUG: test/1076 on sharptooth.local: something went wrong

Success! Because we used bunyan -p '*', we’ll see log messages from all Node processes on the system. You can also specify a specific pid instead of ‘*’ to see only messages from that process.

Dynamic probes for your application

Bunyan’s dynamic probes are implemented using Chris Andrews’s node-dtrace-provider module, which makes it very easy to add custom probes to any application. (The whole Bunyan integration with DTrace is about 20 lines of code.)

Another great example that we also use heavily is node-restify, which creates DTrace probes for the beginning and end of each HTTP request, as well as each of the phases of request processing (auth, parsing, logging, and so on). You can enable these probes in production to get a fine-grained look at request latency, including what parts of each request are taking a long time.


In this post, I described Bunyan, the Node.js logging library we use at Joyent. I explained how we use DTrace probes built into Bunyan to extract more debug information from a production service than we can normally extract without disrupting the service itself. We focus on tools for production because the hardest issues often show up only in production deployments, and the production environment is far more constrained than development. By building tools designed for production use, we’re able to debug failures both in the field and in development, in many cases root-causing them the first time we see them.

1 Manta service logs are a good example of both the value of structured logs and Manta’s ability to process these logs. Manta service logs include both debug messages and more structured messages used to generate user’s usage reports and bills. These logs are uploaded periodically to Manta. To generate each user’s report, we run a map-reduce Manta job over these logs that parses each record and tallies up reports per-user. Similarly, when we want to see how often some particular error message has been seen, we run a Manta job to grep over the logs (literally just “grep PATTERN”). For more on using Manta for log analysis, check out Wanelo’s post on business data analytics using Manta.

Kartlytics: Applying Big Data Analytics to Mario Kart 64

This post also appears on the Joyeur blog.

If you missed it, Joyent recently launched Manta, a web-facing object store with compute as a first-class operation. Manta makes it easy to crunch on Big Data in the cloud, and we’ve seen it used by both ourselves and others and others to solve real business problems involving Big Data. But it’s not just for user behavior and crash dump analysis: Manta has profoundly changed the way we operate here at Joyent’s SF office.

Mario Kart 64

On a typical Friday afternoon at Joyent’s San Francisco office, it’s been a long week and the engineers are getting restless. I glance over at Bill, who mimes working a video game controller with both hands and nods towards the projector. The war dance has begun, and the trash talking will soon follow. Factions form, and soon everyone’s agreed on only one thing: it’s time to play Kart.

Playing Kart
(photo: Joshua Clulow)

Sound familiar? I know we’re not the only office that brings the same intensity to office video games that we bring to our work. Kart is the perfect game for serious office competition because it’s dead simple to learn but takes a long time to master. It’s a standard arcade-style racing game with cartoonish weapons, encouraging the best of friends to abuse and taunt each other mercilessly.

Kart clip

But it’s not just about the competition. In our years playing Kart, we’ve seen a lot of quirky game behavior, and we’ve always wondered: is it true that the game handicaps successful players by giving them less powerful weapons? Given that, how much does the first lap really matter? And which characters are more likely to win: heavyweights or lightweights?

We’ve also had lots of time to discuss strategy: how much does power sliding matter? How much time do you lose for each banana peel slip, shell hit, bomb impact, or fall off the track?

As serious intellectuals often do, we spent hours discussing these questions, what data we would want to collect to answer them, and even how we might go about collecting it. It sounded like a fun project, so I wrote a program that takes video captures of our Mario Kart 64 sessions and picks out when each race starts, which character is in each box on the screen, the rank of each player as the race progresses, and finally when the race finishes. Then I built a web client that lets us upload videos, record who played which character in each race, and browse the aggregated stats. The result is called Kartlytics, and now contains videos of over 230 races from over the last year and change.

This worked pretty well, but as we collected more videos, the time to process them grew prohibitive. Often I’d want to tweak the program to fix detection of, say, Yoshi Valley, but I couldn’t be sure that the change wouldn’t cause it to do the wrong thing on some other track. To test that, I’d have to rerun the process on all the videos, but that took nearly a full day on my laptop.

Enter Manta

This changed significantly about a month ago when Joyent launched Manta. We like to say that we built Manta to solve all kinds of problems from log analysis to video transcoding, but the reality is now clear: Manta exists to compute analytics on Mario Kart 64 sessions. Immediately after we stood up our production Manta service, I loaded all of the videos we’ve recorded to date. If you’ve set up the toolkit, you can list these videos with:

$ mls -l /dap/public/kartlytics/videos | grep -v json
-rwxr-xr-x 1 dap      71246499 Jun 3 18:35
-rwxr-xr-x 1 dap      66418215 Jun 3 17:25
-rwxr-xr-x 1 dap      67641347 Jun 3 18:37

In recent weeks, I’ve converted kartlytics to use Manta for all of the video processing. Here’s the job to process each video, producing a JSON file for each one describing what happened in all races in the video:

$ export KARTLYTICS_OUTDIR=/dap/public/kartlytics/generated
$ mfind -t o -n '.*.mov$' /dap/public/kartlytics/videos |
    mjob create -w
    -s /dap/public/kartlytics/kartlytics.tgz
    -s /dap/public/kartlytics/bin/video-transcribe
    --init "cd /var/tmp && tar xzf /assets/dap/public/kartlytics/kartlytics.tgz"
    -m "/assets/dap/public/kartlytics/bin/video-transcribe /var/tmp/kartlytics $KARTLYTICS_OUTDIR "'$MANTA_INPUT_FILE'

This job uses a common pattern for Manta jobs, which is to throw an existing program (kartlytics) into a tarball in Manta, specify it as an asset for a job, and run the bundled program inside the job. In this case, the kartlytics.tgz tarball contains a built copy of the kartlytics github repo, which contains the C program and support files for doing the video processing. There’s a separate job for creating this tarball, which is just:

$ export KARTLYTICS_TARBALL=/dap/public/kartlytics/kartlytics.tgz
$ mjob create -w
    -s /dap/public/kartlytics/bin/make-tarball
    -r "/assets/dap/public/kartlytics/bin/make-tarball $KARTLYTICS_TARBALL" < /dev/null

The “make-tarball” script just clones the repo, builds it, and saves the tarball back to Manta.

Processing all these videos used to take nearly a day on my laptop. It takes just 5 minutes on Manta, and I don’t have to think about spinning up compute instances and then spinning them down when it’s done. This makes a serious difference because I can get feedback on algorithm tweaks in minutes rather than overnight.

There are two other Manta jobs involved: one that takes the race transcripts produced by the first job and aggregates them with the human-entered data about who played which characters in each race, producing a single aggregated JSON file; and one that takes the same race transcripts and produces smaller web-quality videos for each race that you can play directly on the website. For details, see the the run_all script in the repo.

Try it: Since these videos are public, you can run kartlytics yourself on our collection of videos. You can run these two jobs exactly as written, replacing KARTLYTICS_OUTDIR and KARTLYTICS_TARBALL with paths in your own storage areas (e.g., /$MANTA_USER/public/kartlytics). The above “run_all” script is also parametrized so that anyone can run it on our videos.


The front page of shows some summary information, including the tracks we play a lot:


We start every session with Luigi Raceway, which is why that track has the most plays. After the first race, the winner chooses the next track, so the distribution diverges after that.

Next on the page we show races from the latest session, then the “wildest races”. This is a proxy for “slugfests”, which are races with just the right circumstances (wide open track, a tight pack, lucky weapons, etc.) for a weapons-based free-for-all. The “wildest” ones listed here are those with the most number of player position changes per minute.

Wildest races

Below that, we have an example of a hotly-debated question: how often does a player go from 1st to 4th in less than 5 seconds? We call that a “Keithing”, and the front page shows all of them. You can sort by person to see how many times it’s happened to each player (but keep in mind that not all players have played the same number of races).

You can click on the date/time of any race to get to the race details page, which summarizes the results, shows a playable video of just that race, and then the all important race transcript, showing not only everything that happened in the race, but with screenshots to prove it:

Event summary

That’s really important, both for debugging and to silence criticism about kartlytics “making stuff up”.

You can also click any player to see that player’s races, how many times they’ve played each track, and which characters (and character classes — lightweight, middleweight, or heavyweight) they tend to play:

Player details


We’ve really just scratched the surface. We’ve only answered a few of the questions I mentioned above. We’ve discussed tracking lots more data, like power slides (to see how much they matter), weapons (to see how the distribution changes with rank), and various mistakes (to prove [or disprove] claims like “I keep running into banana peels today”).

We’d also love to add videos from you! It’s surprisingly easy. To record video, I purchased an iGrabber and a small power amplifier (so we could split the video signal without losing too much quality — you may not need this if you’re playing on a TV instead of a projector). The iGrabber comes with software to record the video, and kartlytics does the rest. As I showed above, you can run the software on your own jobs.

The algorithm behind Kartlytics is really pretty simplistic (see the README in the repo for details), but it can probably be applied to other cult classics that have relatively simple on-screen game status. I know our Seattle office is big on Street Fighter II; surely Street Fighter-lytics can’t be far away?

Fault tolerance in Manta

Since launching Manta last week, we’ve seen a lot of excitement. Thoughtful readers quickly got to burning questions about the system’s fault tolerance: what happens when backend Manta components go down? In this post, I’ll discuss fault tolerance in Manta in more detail. If anything seems left out, please ask: it’s either an oversight or just seemed too arcane to be interesting.

This is an engineering internals post for people interested in learning how the system is built. If you just want to understand the availability and durability of objects stored in Manta, the simpler answer is that the system is highly available (i.e., service survives system and availability zone outages) and durable (i.e., data survives multiple system and component failures).

First principles

First, Manta is strongly consistent. If you PUT an object, a subsequent GET for the same path will immediately return the object you just PUT. In terms of CAP, that means Manta sacrifices availability for consistency in the face of a complete network partition. We feel this is the right choice for an object store: while other object stores remain available in a technical sense, they can emit 404s or stale data both when partitioned and in steady-state. The client code to deal with eventual consistency is at least as complex as dealing with unavailability, except that there’s no way to distinguish client error from server inconsistency — both show up as 404 or stale data. We feel transparency about the state of the system is more valuable here. If you get a 404, that’s because the object’s not there. If the system’s partitioned and we cannot be sure of the current state, you’ll get a 503, indicating clearly that the service is not available and you should probably retry your request later. Moreover, if desired, it’s possible to build an eventually consistent caching layer on top of Manta’s semantics for increased availability.

While CAP tells us that the system cannot be both available and consistent in the face of an extreme network event, that doesn’t mean the system fails in the face of minor failures. Manta is currently deployed in three availability zones in a single region (us-east), and it’s designed to survive any single inter-AZ partition or a complete AZ loss. As expected, availability zones in the region share no physical components (including power) except for being physically close to one another and connected by a high-bandwidth, low-latency interconnect.

Like most complex distributed systems, Manta is itself made up of several different internal services. The only public-facing services are the loadbalancers, which proxy directly to the API services. The API services are clients of other internal services, many of which make use of still other internal services, and so on.

Stateless services

Most of these services are easy to reason about because they’re stateless. These include the frontend loadbalancers, the API servers, authentication caches, job supervisors, and several others.

For each stateless service, we deploy multiple instances in each AZ, each instance on a different physical server. Incoming requests are distributed across the healthy instances using internal DNS with aggressive TTLs. The most common failure here is a software crash. SMF restarts the service, which picks up where it left off.

For the internal services, more significant failures (like a local network partition, power loss, or kernel panic) result in the DNS record expiring and the instance being taken out of service.

Stateful services

Statelessness just pushes the problem around: there must be some service somewhere that ultimately stores the state of the system. In Manta, that lives in two tiers:

  • the storage tier, which stores the contents of users’ objects
  • the metadata tier, which maps user object names to the servers where the data is stored

The storage tier

The availability and durability characteristics of an object are determined in part by its “durability level“. From an API perspective, this indicates the number of independent copies of the object that you want Manta to store. You pay for each copy, and the default value is 2. Copies are always stored on separate physical servers, and the first 3 copies are stored in separate availability zones.

Durability of a single copy: Objects in the storage tier are stored on raidz2 storage pools with two hot spares. The machine has to sustain at least three concurrent disk failures before losing any data, and could survive as many as eight. The use of hot spares ensures that the system can begin resilvering data from failed disks onto healthy ones immediately, in order to reduce the likelihood of a second or third concurrent failure. Keith discusses our hardware choices in more depth on his blog.

Object durability: Because of the above, it’s very unlikely for even a single copy of an object to be lost as a result of storage node failure. If the durability level is greater than 1 (recall that it’s 2 by default), all copies would have to be lost for the object’s data to be lost.

Object availability: When making a request for an object, Manta selects one of the copies and tries to fetch the object from the corresponding storage node. If that node is unavailable, Manta tries another node that has a copy of the object, and it continues doing this until it either finds an available copy or runs out of copies to try. As a result, the object is available as long as the frontend can reach any storage node hosting a copy of the object. As described above, any storage node failure (transient or otherwise) or AZ loss would not affect object availability for objects with at least two copies, though such failures may increase latency as Manta tries to find available copies. Similarly, in the event of an AZ partition, the partitioned AZ’s loadbalancers would be removed from DNS, and the other AZs would be able to service requests for all objects with at least two copies.

Since it’s much more likely for a single storage node to be temporarily unavailable than for data to be lost, it may be more useful to think of “durability level” as “availability level”. (This property also impacts the amount of concurrency you can get for an object — see Compute Jobs below.)

Metadata tier

The metadata tier records the physical storage nodes where each object is stored. The object namespace is partitioned into several completely independent shards, each of which is designed to survive the usual failure modes (individual component failure, AZ loss, and single-AZ partition).

Each shard is backed by a postgres database using postgres-based replication from the master to both a synchronous slave and an asynchronous standby. Each database instance within the shard (master, sync slave, and async slave) is located in a different AZ, and we use Zookeeper for election of the master.

The shard requires only one peer to be available for read availability, and requires both master and synchronous slave for write availability. Individual failures (or partitions) of the master or synchronous slave can result in transient outages as the system elects a new leader.

The mechanics of this component are complex and interesting (as in, we learned a lot of lessons in building this). Look for a subsequent blog post from the team about the metadata tier.

Compute Jobs

Manta’s compute engine is built on top of the same metadata and storage tiers. Like the other supporting services, the services are effectively stateless and the real state lives in the metadata tier. It’s subject to the availability characteristics of the metadata tier, but it retries internal operations as needed to survive the transient outages described above.

If a given storage node becomes unavailable when there are tasks running on it, those tasks will be retried on a node storing another copy of the object. (That’s one source of the “retries” counter you see in “mjob get”.) Manta makes sure that the results of only one of these instances of the task are actually used.

The durability level of an object affects not only its availability for compute (for the same reasons as it affects availability over HTTP as described above), but also the amount of concurrency you can get on an object. That’s because Manta schedules compute tasks to operate on a random copy of an object. All things being equal, if you have two copies of an object instead of one, you can have twice as many tasks operating on the object concurrently (on twice as many physical systems).

Final notes

You’ll notice that I’ve mainly been talking about transient failures, either of software, hardware, or the network. The only non-transient failure in the system is loss of a ZFS storage pool; any other failure mode is recoverable by replacing the affected components. Objects with durability of at least two would be recoverable in the event of pool loss from the other copies, while objects with durability of one that were stored on that pool would be lost. (But remember: storage pool loss as a result of any normal hardware failure, even multiple failures, is unlikely.)

I also didn’t mention anything about replication in the storage tier. That’s because there is none. When you PUT a new object, we dynamically select the storage nodes that will store that object and then funnel the incoming data stream to both nodes synchronously (or more, if durability level is greater than 2). If we lose a ZFS storage pool, we would have to replicate objects to other pools, but that’s not something that’s triggered automatically in response to failure since it’s not appropriate for most failures.

Whether in the physical world or the cloud, infrastructure services have to be highly available. We’re very up front about how Manta works, the design tradeoffs we made, and how it’s designed to survive software failure, hardware component failure, physical server failure, AZ loss, and network partitions. With a three AZ model, if all three AZs became partitioned, the system chooses strong consistency over availability, which we believe provides a significantly more useful programming model than the alternatives.

For more detail on how we think about building distributed systems, see Mark Cavage’s ACM Queue article “There’s Just No Getting Around It: You’re Building a Distributed System.”

Inside Manta: Distributing the Unix shell

Today, Joyent has launched Manta: our internet-facing object store with compute as a first class operation. This is the culmination of over a year’s effort on the part of the whole engineering team, and I’m personally really excited to be able to share this with the world. There’s plenty of documentation on how to use Manta, so in this post I want to talk about the guts of my favorite part: the compute engine.

The super-quick crash course on Manta: it’s an object store, which means you can use HTTP PUT/GET/DELETE to store arbitrary byte streams called objects. This is similar to other HTTP-based object stores, with a few notable additions: Unix-like directory semantics, strong read-after-write consistency, and (most significantly) a Unixy compute engine.

Computation in Manta

There’s a terrific Getting Started tutorial already, so I’m going to jump straight to a non-trivial job and explain how it runs under the hood.

At the most basic level, Manta’s compute engine runs arbitrary shell commands on objects in the object store. Here’s my example job:

$ mfind -t o /dap/stor/snpp | mjob create -qom 'grep poochy'

This job enumerates all the objects under /dap/stor/snpp (using the mfind client tool, analogous to Unix find(1)), then creates a job that runs “grep poochy” on each one, waits for the job to finish, and prints the outputs.

I can run this one-liner from my laptop to search thousands of objects for the word “poochy”. Instead of downloading each file from the object store, running “grep” on it, and saving the result back, Manta just runs “grep poochy” inside the object store. The data never gets copied.

Notice that our Manta invocation of “grep” didn’t specify a filename at all. This works because Manta redirects stdin from an object, and grep reads input from stdin if you don’t give it a filename. (There’s no funny business about tacking the filename on to the end of the shell script, as though you’d run ‘grep poochy FILENAME’, though you can do that if you want using environment variables.) This model extends naturally to cover “reduce” operations, where you may want to aggregate over enormous data streams that don’t fit on a single system’s disks.

One command, many tasks

What does it actually mean to run grep on 100 objects? Do you get one output or 100? What if some of these commands succeed, but others fail?

In keeping with the Unix tradition, Manta aims for simple abstractions that can be combined to support more sophisticated use cases. In the example above, Manta does the obvious thing: if the directory has 100 objects, it runs 100 separate invocations of “grep”, each producing its own output object, and each with its own success or failure status. Unlike with a single shell command, a one-phase map job can have any number of inputs, outputs, and errors. You can build more sophisticated pipelines that combine output from multiple phases, but that’s beyond the scope of this post.1

How does it work?

Manta’s compute engine hinges on three SmartOS (illumos) technologies:

  • Zones: OS-based virtualization, which allows us to run thousands of these user tasks concurrently in lightweight, strongly isolated environments. Each user’s program runs as root in its own zone, and can do whatever it wants there, but processes in the zone have no visibility into other zones or the rest of the system.
  • ZFS: ZFS’s copy-on-write semantics and built-in snapshots allow us to completely reset zones between users. Your program can scribble all over the filesystem, and when it’s done we roll it back to its pristine state for the next user. (We also leverage ZFS clones for the filesystem image: we have one image with tens of gigabytes of software installed, and each zone’s filesystem is a clone of that single image, for almost zero disk space overhead per zone.)
  • hyprlofs: a filesystem we developed specifically for Manta, hyprlofs allows us to mount read-only copies of files from one filesystem into another. The difference between hyprlofs and traditional lofs is that hyprlofs supports commands to map and unmap files on-demand, and those files can be backed by arbitrary other filesystems. More on this below.

In a nutshell: each copy of a Manta object is stored as a flat file in ZFS. On the same physical servers where these files are stored, there are a bunch of compute zones for running jobs.

As you submit the names of input objects, Manta locates the storage servers containing a copy of each object and dispatches tasks to one server for each object. That server uses hyprlofs to map a read-only copy of the object into one of the compute zones. Then it runs the user’s program in that zone and uses zfs rollback to reset the zone for the next tenant. (There’s obviously a lot more to making this system scale and survive component failures, but that’s the basic idea.)

What’s next?

In this post I’ve explained the basics of how Manta’s compute engine works under the hood, but this is a very simple example. Manta supports more sophisticated distributed computation, including reducers (including multiple reducers) and multi-phase jobs (e.g., map-map-map).

Because Manta uses the Unix shell as the primitive abstraction for computation, it’s very often trivial to turn common shell invocations that you usually run sequentially on a few files at a time into Manta jobs that run in parallel on thousands of objects. For tasks beyond the complexity a shell script, you can always execute a program in some other language — that’s, after all, what the shell does best. We’ve used it for applications ranging from converting image files to generating aggregated reports on activity logs. (In fact, we use the jobs facility internally to implement metering, garbage collection of unreferenced objects, and our own reports.) My colleague Bill has already used it to analyze transit times on SF Muni. Be on the lookout for a rewrite of kartlytics based on Manta.

We’re really excited about Manta, and we’re looking forward to seeing what people do with it!

1 Manta’s “map” is like the traditional functioning programming primitive that performs a transformation on each of its inputs. This is similar but not the same as the Map in MapReduce environments, which specifically operates on key-value pairs. You can do MapReduce with Manta by having your program parse key-value pairs from objects and emit key-value pairs as output, but you can also do other transformations that aren’t particularly well-suited to key-value representation (like video transcoding, for example).