Manta: Unix Meets Map Reduce

Today Joyent launched Manta: an object store built upon ZFS and Zones, the SmartOS platform, and with a familiar Unix interface as the API. This supports compute jobs such as map reduce, and provides high performance by co-locating zones with the object storage. We also have extensive DTrace instrumentation throughout the product, which we’ve been using in development to help tune performance and respond to performance issues.

As a quick demonstration of Manta, I have over 40 Gbytes of performance trace data, captured using a DTrace script across 204 production servers. This trace data is of ZFS I/O latency, including high resolution latency measurements. It is traced at the VFS level, which reflects the actual performance felt by the applications. Although, this also makes for verbose trace files as it includes cache hits (which I could filter, but haven’t in this case).

To start with, I’d like to know how many lines my trace files contain in total. I’ve put them in a directory called “zilt01″, and the following Manta job will count the lines:

$ mfind /brendan/stor/zilt01 | \
    mjob create -o -m 'wc -l' -r 'awk "{ s += \$1 } END { print s }"'
added 204 inputs to 17f7b3e9-6264-4839-ad8d-50f9a7b44ead

mfind lists files (objects) from the directory, like the find command. This is piped to mjob, which creates the compute job with that input (204 inputs, one for each server trace file).

The output, 725421604, shows that I have over 725 million lines. The file format is one line per I/O, so this means I have over 725 million I/O captured for study.

The ‘m’ commands, mfind and mjob, are from Manta. The rest, ‘|’, wc, and awk, are Unix.

This Manta job consists of two stages:

Other Unix commands can be used as well, although awk is very capable if your objects are text files.

More awk

Now I’ll determine the highest latency event from my pool of data. The format of the files is:

8085 64 write 8192 2b8a2c4f6915 postgres
8133 22 write 8192 2b8a2c4f6915 postgres
8165 22 write 8192 2b8a2c4f6915 postgres
8191 19 write 8192 2b8a2c4f6915 postgres
40563 19 getattr 0 2b8a2c4f6915 postgres
40607 5 getattr 0 2b8a2c4f6915 postgres

The latency is the second column, which can be examined using another Manta job and more awk:

$ mfind /brendan/stor/zilt01 | mjob create -o \
    -m 'awk "NR > 1 && \$2 > max { max = \$2 } END { print max }"' \
    -r 'awk "\$1 > max { max = \$1 } END { print max }"'
added 204 inputs to 7223d241-6a53-40fc-bac3-ab52e1adff8c

The highest latency I/O in my dataset was: 3.04 seconds.

You don’t need to code everything in awk (although I like to try), as there are many other utilities available. These include an maggr command, which simplifies common steps such as sum and max.

Scripting, Too

More sophisticated compute can be performed by writing programs, including shell scripts. I’ve done this for generating frequency trail waterfall plots, which I showed in a previous post.

These involve R to generate a frequency trail per-server on a transparent background, and then ImageMagick to crop, tidy, and then merge the final images. Both R and ImageMagick are already available in Manta.

The image at the top of this post is an example, generated from a subset of my 40 Gbyte data pool, and taking less than a minute to render on Manta. The x-axis scale is from 0.5 to 15 ms, and the y-axis spans 100 of those 204 servers.

The command I used looked like:

$ mfind /brendan/stor/zilt01 | mjob create -o -s /brendan/stor/ \
added 204 inputs to b54778ee-6117-4696-84a8-bfd51e5966ab

I put the R program and ImageMagick logic in a shell script, (I’ll need to tidy up the script before sharing it). The point here is that we can move onto scripting, just like Unix, when things get more complex.

There’s much more to Manta. Hear from the lead architect Mark Cavage and other engineers at Joyent: Bryan Cantrill, Keith Wesolowski, Dave Pacheco, and Josh Clulow. Also see the Manta documentation, and the 3.5 minute screencast demonstrating usage.

I’m looking forward to new discoveries now that I can store and process large amounts of performance data. The ZFS probes I used here are just a handful: there are millions of probes to trace, and even larger datasets to process.

Print Friendly
Posted on June 25, 2013 at 6:32 am by Brendan Gregg · Permalink
In: Filesystem · Tagged with: ,