Adam Leventhal's blog

Search
Close this search box.

Tag: ZFS

Since Noms dropped last week the dev community has seemed into it. “Git for data” — it simultaneously evokes something very familiar and yet unconstrained. Something that hasn’t been well-noted is how much care the team has taken to make Noms fun to build with, and it is.

[youtube_sc url=”https://www.youtube.com/watch?v=3R_4Pdb7ev4″ title=”nomfs%20in%20action%20(video%20for%20non-readers)” width=”300″ class=”alignright”]Noms is a content-addressable, decentralized, append-only database. It borrows concepts from a variety of interesting data systems. Obviously databases are represented: Noms is a persistent, transactional data repository. You can also see the fundamentals of git and other decentralized source code management tools. Noms builds up a chain of commits; those chains can be extended, forked, and shared, while historical data are preserved. Noms shares much in common with modern filesystems such as ZFS, btrfs, or Apple’s forthcoming APFS. Like those filesystems, Noms uses copy-on-write, never modifying data in situ; it forms a self-validating hierarchy of data; and it (intrinsically) supports snapshots and efficient movement of data between snapshots.

After learning a bit about Noms I thought it would be interesting to use it as the foundation for a filesystem. I wanted to learn about Noms, and contribute a different sort of example that might push the project in new and valuable ways. The Noms founders, Aaron and Raf, were fired up so I dove in…

What’s Modern?

When people talk about a filesystem being “modern” there’s a list of features that they often have in mind. Let’s look at how the Noms database stacks up:

Snapshots

A filesystem snapshot preserves the state of the filesystem for some future use — typically data recovery or fast cloning. Since Noms is append-only, every version is preserved. Snapshots are, therefore, a natural side effect. You can make a Noms “snapshot” — any commit in a dataset’s history — writeable by syncing it to a new dataset. Easy.

Dedup

The essence of dedup is that unique data should be stored exactly once. If you duplicate a file, a folder, or an entire filesystem the storage consumption should be close to zero. Noms is content addressable, unique data is only ever stored once. Every Noms dataset intrinsically removes duplicated data.

Consistency

A feature of a filesystem — arguably the feature of a filesystem — is that it shouldn’t ever lose or corrupt your data. One common technique to ensure consistency is to write new data to a new location rather than overwriting old data — so called copy-on-write (COW). Noms is append-only, it doesn’t throw out (or overwrite) old data; copying modified is required and explicit. Noms also recursively checksums all data — a feature of ZFS and btrfs, notably absent from APFS.

Backup

The ability to backup your data from a filesystem is almost as important as keeping it intact in the first place. ZFS, for example, lets you efficiently serialize and send the latest changes between systems. When pulling or pushing changes git also efficiently serializes just the changed data. Noms does something similar with its structured data. Data differences are efficiently computed to optimize for minimal data transfer.

Noms has all the core components of a modern filesystem. My goal was to write the translation layer to expose filesystem semantics on top of the Noms interfaces.

Designing a Schema

Initially, Badly

It’s in the name: Noms eats all the data. Feed it whatever data you like, and let Noms infer a schema as you go. For a filesystem though I wanted to define a fixed structure. I started with a schema modeled on a simplified ZFS. Filesystems keep track of files and directories with a structure called an “inode” each of which has a unique integer identifier, the “inode number”. ZFS keeps track of files and directories with DMU objects named by their integer ID. The schema would use a Map<number, Inode> to serve the same function (spoiler: read on and don’t copy this!):

struct Filesystem {
     inodes: Map<Number, struct Inode {
          attr: struct Attr { /* e.g. permissions, modification time, etc. */ }
          contents: Union {
               struct File { data: Ref /* Noms pile of bytes */ } |
               struct Directory { contents: Map<String, Number> }
          }
     }>
     rootInode: Number
     maxInode: Number
}

Nice and simple. Files are just Noms Blobs represented by a sequence of bytes. Directories are a Map of strings (the name of the directory entry) to the inode number; the inode number can be used to find the actual content by looking in the top-level map.

Schema philosophy

This made sense for a filesystem. Did it make sense for Noms? I wasn’t trying to put the APFS team out of work, rather I was creating a portal from the shell or Finder into Noms. To evaluate the schema, I had the benefit of direct access to the Noms team (and so can all developers at http://slack.noms.io/). I learned two guiding principles for data in Noms:

Noms data should be self-validating. As much as possible the application should rely on noms to ensure consistency. My schema failed this test because the relationship between inode numbers contained in directories and the entires of the inodes map was something my code alone could maintain and validate.

Noms data should be deterministic. A given collection of data should have a single representation; the Noms structures should be path and history independent. Two apparently identical datasets should be identical in the eyes of Noms to support efficient storage and transfer, and identical data should produce an identical hash at the root of the dataset. Here, again, my schema fell short because the inode number assigned to a given file or directory depended on how other objects were created. Two different users with two identical filesystems wouldn’t be able to simply sync the way they would with two identical git branches.

A Better Schema

My first try made for a fine filesystem, just not a Noms filesystem. With a better understanding of the principles, and with help from the Noms team, I built this schema:

struct Filesystem {
     root: struct Inode {
          attr: struct Attr { /* e.g. permissions, modification time, etc. */ }
          contents: Union {
               struct File { data: Ref<Blob> /* Noms pile of bytes */ } |
               struct Directory: { contents: Map<string, Cycle<1>> }
          }
     }
}

Obviously simpler; the thing that bears explanation is the use of so-called “Cycle” types. A Cycle is a means of expressing a recursive relationship within Noms types. The integer parameter specifies the ancestor struct to which the cycle refers. Consider a linked list type:

struct LinkedList {
    data: Blob
    next: Cycle<0>
}

The “next” field refers to immediately containing struct, LinkedList. In our filesystem schema, a Directory’s contents are represented by a map of strings (directory entry names) to Inodes, Cycle<1> referring to the struct “above” or “containing” the Directory type. (Read on for a visualization of this.)

Writing It

To build the filesystem I picked a FUSE binding for Go, dug into the Noms APIs, and wrestled my way through some Go heartache.

Working with Noms requires a slightly different mindset than other data stores. Recall in particular that Noms data is immutable. Adding a new entry into a Map creates a new Map. Setting a member of a Struct creates a new Struct. Changing nested structures such as the one defined by our schema requires unzipping it, and then zipping it back together. Here’s a Go snippet that demonstrates that methodology for creating a new directory:

Demo

Showing it off has all the normal glory of a systems demo! Check out the documentation for requirements.

Create and mount a filesystem from a new local Noms dataset:

$ go build
$ mkdir /var/tmp/mnt
$ go run nomsfs.go /var/tmp/nomsfs::fs /var/tmp/mnt
running...

You can open the folder and drop data into it.

Your database fell into my filesystem!

Now let’s take a look at the underlying Noms dataset:

$ noms show http://demo.noms.io/ahl_blog::fs
struct Commit {
  meta: struct {},
  parents: Set<Ref<Cycle<0>>>,
  value: struct Filesystem {
    root: struct Inode {
      attr: struct Attr {
        ctime: Number,
        gid: Number,
        mode: Number,
        mtime: Number,
        uid: Number,
        xattr: Map<String, Blob>,
      },
      contents: struct Directory {
        entries: Map<String, Cycle<1>>,
      } | struct Symlink {
        targetPath: String,
      } | struct File {
        data: Ref<Blob>,
      },
    },
  },
}({
  meta:  {},
  parents: {
    5v82rie0be68915n1q7pmcdi54i9tmgs,
  },
  value: Filesystem {
    root: Inode {
      attr: Attr {
        ctime: 1.4705227450393803e+09,
        gid: 502,
        mode: 511,
        mtime: 1.4705227450393803e+09,
        uid: 110853,
        xattr: {},
      },
      contents: Directory {
        entries: {
          "usenix_winter91_faulkner.pdf": Inode {
            attr: Attr {
              ctime: 1.4705228859273868e+09,
              gid: 502,
              mode: 420,
              mtime: 1.468425783e+09,
              uid: 110853,
              xattr: {
                "com.apple.FinderInfo": 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  // 32 B
                00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00,
                "com.apple.quarantine": 30 30 30 31 3b 35 37 38 36 36 36 33 37 3b 53 61  // 21 B
                66 61 72 69 3b,
              },
            },
            contents: File {
              data: dmc45152ie46mn3ls92vvhnm41ianehn,
            },
          },
        },
      },
    },
  },
})

You can see the type at the top and then the actual filesystem contents. Let’s look at more complicated example where I’ve taken part of the Noms source tree and copied it to nomsfs. You can navigate around its structure courtesy of the Splore utility (take particular note of nested directories that show the recursive data definition described above):

Embedded ‘Splore! http://splore.noms.io/?db=https://demo.noms.io/ahl_blog&hash=2nhi5utm4s38hka22vt9ilv5i3l8r2ol

You can see the all of the various states that the filesystem has been through — each state change — using noms log http://demo.noms.io/ahl_blog::fsnoms. You can sync it to your local computer with noms sync http://demo.noms.io/ahl_blog::fsnoms /var/tmp/fs or checkout some previous state from the log (just like a filesystem snapshot). Diff two states from the log or make your own changes and diff it with the original using noms diff.

Nom Nom Nom

It took less than 1000 lines of Go code to make Noms appear as a Window in the Finder, eating data as quickly as I could drag and drop (try it!). Imagine what Noms might look like behind other known data interfaces; it could bring git semantics to existing islands of data. Noms could form the basis of a new type of data lake — maybe one that’s simple and powerful enough to bring real results. Beyond the marquee features, Noms is fun to build with. I’m already working on my next Noms application.

In previous posts I discussed the problems with the legacy ZFS write throttle that cause degraded performance and wildly variable latencies. I then presented the new OpenZFS write throttle and I/O scheduler that Matt Ahrens and I designed. In addition to solving several problems in ZFS, the new approach was designed to be easy to reason about, measure, and adjust. In this post I’ll cover performance analysis and tuning — using DTrace of course. These details are intended for those using OpenZFS and trying to optimize performance — if you have only a casual interest in ZFS consider yourself warned!

Buffering dirty data

OpenZFS limits the amount of dirty data on the system according to the tunable zfs_dirty_data_max. It’s default value is 10% of memory up to 4GB. The tradeoffs are pretty simple:

Lower Higher
Less memory reserved for use by OpenZFS More memory reserved for use by OpenZFS
Able to absorb less workload variation before throttling Able to absorb more workload variation before throttling
Less data in each transaction group More data in each transaction group
Less time spent syncing out each transaction group More time spent syncing out each transaction group
More metadata written due to less amortization Less metadata written due to more amortization

 

Most workloads contain variability. Think of the dirty data as a buffer for that variability. Let’s say the LUNs assigned to your OpenZFS storage pool are able to sustain 100MB/s in aggregate. If a workload consistently writes at 100MB/s then only a very small buffer would be required. If instead the workload oscillates between 200MB/s and 0MB/s for 10 seconds each, then a small buffer would limit performance. A buffer of 800MB would be large enough to absorb the full 20 second cycle over which the average is 100MB/s. A buffer of only 200MB would cause OpenZFS to start to throttle writes — inserting artificial delays — after less than 2 seconds during which the LUNs could flush 200MB of dirty data while the client tried to generate 400MB.

Track the amount of outstanding dirty data within your storage pool to know which way to adjust zfs_dirty_data_max:

txg-syncing
{
        this->dp = (dsl_pool_t *)arg0;
}

txg-syncing
/this->dp->dp_spa->spa_name == $$1/
{
        printf("%4dMB of %4dMB used", this->dp->dp_dirty_total / 1024 / 1024,
            `zfs_dirty_data_max / 1024 / 1024);
}

# dtrace -s dirty.d pool
dtrace: script 'dirty.d' matched 2 probes
CPU ID FUNCTION:NAME
11 8730 txg_sync_thread:txg-syncing 966MB of 4096MB used
0 8730 txg_sync_thread:txg-syncing 774MB of 4096MB used
10 8730 txg_sync_thread:txg-syncing 954MB of 4096MB used
0 8730 txg_sync_thread:txg-syncing 888MB of 4096MB used
0 8730 txg_sync_thread:txg-syncing 858MB of 4096MB used

The write throttle kicks in once the amount of dirty data exceeds zfs_delay_min_dirty_percent of the limit (60% by default). If the the amount of dirty data fluctuates above and below that threshold, it might be possible to avoid throttling by increasing the size of the buffer. If the metric stays low, you may reduce zfs_dirty_data_max. Weigh this tuning against other uses of memory on the system (a larger value means that there’s less memory for applications or the OpenZFS ARC for example).

A larger buffer also means that flushing a transaction group will take longer. This is relevant for certain OpenZFS administrative operations (sync tasks) that occur when a transaction group is committed to stable storage such as creating or cloning a new dataset. If the interactive latency of these commands is important, consider how long it would take to flush zfs_dirty_data_max bytes to disk. You can measure the time to sync transaction groups (recall, there are up to three active at any given time) like this:

txg-syncing
/((dsl_pool_t *)arg0)->dp_spa->spa_name == $$1/
{
        start = timestamp;
}

txg-synced
/start && ((dsl_pool_t *)arg0)->dp_spa->spa_name == $$1/
{
        this->d = timestamp - start;
        printf("sync took %d.%02d seconds", this->d / 1000000000,
            this->d / 10000000 % 100);
}

# dtrace -s duration.d pool
dtrace: script 'duration.d' matched 2 probes
CPU ID FUNCTION:NAME
5 8729 txg_sync_thread:txg-synced sync took 5.86 seconds
2 8729 txg_sync_thread:txg-synced sync took 6.85 seconds
11 8729 txg_sync_thread:txg-synced sync took 6.25 seconds
1 8729 txg_sync_thread:txg-synced sync took 6.32 seconds
11 8729 txg_sync_thread:txg-synced sync took 7.20 seconds
1 8729 txg_sync_thread:txg-synced sync took 5.14 seconds

Note that the value of zfs_dirty_data_max is relevant when sizing a separate intent log device (SLOG). zfs_dirty_data_max puts a hard limit on the amount of data in memory that has yet been written to the main pool; at most, that much data is active on the SLOG at any given time. This is why small, fast devices such as the DDRDrive make for great log devices. As an aside, consider the ostensible upgrade that Oracle brought to the ZFS Storage Appliance a few years ago replacing the 18GB “Logzilla” with a 73GB upgrade.

I/O scheduler

Where ZFS had a single IO queue for all IO types, OpenZFS has five IO queues for each of the different IO types: sync reads (for normal, demand reads), async reads (issued from the prefetcher), sync writes (to the intent log), async writes (bulk writes of dirty data), and scrub (scrub and resilver operations). Note that bulk dirty data described above are scheduled in the async write queue. See vdev_queue.c for the related tunables:

uint32_t zfs_vdev_sync_read_min_active = 10;
uint32_t zfs_vdev_sync_read_max_active = 10;
uint32_t zfs_vdev_sync_write_min_active = 10;
uint32_t zfs_vdev_sync_write_max_active = 10;
uint32_t zfs_vdev_async_read_min_active = 1;
uint32_t zfs_vdev_async_read_max_active = 3;
uint32_t zfs_vdev_async_write_min_active = 1;
uint32_t zfs_vdev_async_write_max_active = 10;
uint32_t zfs_vdev_scrub_min_active = 1;
uint32_t zfs_vdev_scrub_max_active = 2;

Each of these queues has tunable values for the min and max number of outstanding operations of the given type that can be issued to a leaf vdev (LUN). The tunable zfs_vdev_max_active limits the number of IOs issued to a single vdev. If its value is less than the sum of the zfs_vdev_*_max_active tunables, then the minimums come into play. The minimum number of each queue will be scheduled and the remainder of zfs_vdev_max_active is issued from the queues in priority order.

At a high level, the appropriate values for these tunables will be specific to your LUNs. Higher maximums lead to higher throughput with potentially higher latency. On some devices such as storage arrays with distinct hardware for reads and writes, some of the queues can be thought of as independent; on other devices such as traditional HDDs, reads and writes will likely impact each other.

A simple way to tune these values is to monitor I/O throughput and latency under load. Increase values by 20-100% until you find a point where throughput no longer increases, but latency is acceptable.

#pragma D option quiet

BEGIN
{
        start = timestamp;
}

io:::start
{
        ts[args[0]->b_edev, args[0]->b_lblkno] = timestamp;
}

io:::done
/ts[args[0]->b_edev, args[0]->b_lblkno]/
{
        this->delta = (timestamp - ts[args[0]->b_edev, args[0]->b_lblkno]) / 1000;
        this->name = (args[0]->b_flags & (B_READ | B_WRITE)) == B_READ ?
            "read " : "write ";

        @q[this->name] = quantize(this->delta);
        @a[this->name] = avg(this->delta);
        @v[this->name] = stddev(this->delta);
        @i[this->name] = count();
        @b[this->name] = sum(args[0]->b_bcount);

        ts[args[0]->b_edev, args[0]->b_lblkno] = 0;
}

END
{
        printa(@q);

        normalize(@i, (timestamp - start) / 1000000000);
        normalize(@b, (timestamp - start) / 1000000000 * 1024);

        printf("%-30s %11s %11s %11s %11s\n", "", "avg latency", "stddev",
            "iops", "throughput");
        printa("%-30s %@9uus %@9uus %@9u/s %@8uk/s\n", @a, @v, @i, @b);
}

# dtrace -s rw.d -c 'sleep 60'

  read
           value  ------------- Distribution ------------- count
              32 |                                         0
              64 |                                         23
             128 |@                                        655
             256 |@@@@                                     1638
             512 |@@                                       743
            1024 |@                                        380
            2048 |@@@                                      1341
            4096 |@@@@@@@@@@@@                             5295
            8192 |@@@@@@@@@@@                              5033
           16384 |@@@                                      1297
           32768 |@@                                       684
           65536 |@                                        400
          131072 |                                         225
          262144 |                                         206
          524288 |                                         127
         1048576 |                                         19
         2097152 |                                         0

  write
           value  ------------- Distribution ------------- count
              32 |                                         0
              64 |                                         47
             128 |                                         469
             256 |                                         591
             512 |                                         327
            1024 |                                         924
            2048 |@                                        6734
            4096 |@@@@@@@                                  43416
            8192 |@@@@@@@@@@@@@@@@@                        102013
           16384 |@@@@@@@@@@                               60992
           32768 |@@@                                      20312
           65536 |@                                        6789
          131072 |                                         860
          262144 |                                         208
          524288 |                                         153
         1048576 |                                         36
         2097152 |                                         0

                               avg latency      stddev        iops  throughput
write                              19442us     32468us      4064/s   261889k/s
read                               23733us     88206us       301/s    13113k/s

Async writes

Dirty data governed by zfs_dirty_data_max is written to disk via async writes. The I/O scheduler treats async writes a little differently than other operations. The number of concurrent async writes scheduled depends on the amount of dirty data on the system. Recall that there is a fixed (but tunable) limit of dirty data in memory. With a small amount of dirty data, the scheduler will only schedule a single operation (zfs_vdev_async_write_min); the idea is to preserve low latency of synchronous operations when there isn’t much write load on the system. As the amount of dirty data increases, the scheduler will push the LUNs harder to flush it out by issuing more concurrent operations.

The old behavior was to schedule a fixed number of operations regardless of the load. This meant that the latency of synchronous operations could fluctuate significantly. While writing out dirty data ZFS would slam the LUNs with writes, contending with synchronous operations and increasing their latency. After the syncing transaction group had completed, there would be a period of relatively low async write activity during which synchronous operations would complete more quickly. This phenomenon was known as “picket fencing” due to the square wave pattern of latency over time. The new OpenZFS I/O scheduler is optimized for consistency.

In addition to tuning the minimum and maximum number of concurrent operations sent to the device, there are two other tunables related to asynchronous writes: zfs_vdev_async_write_active_min_dirty_percent and zfs_vdev_async_write_active_max_dirty_percent. Along with the min and max operation counts (zfs_vdev_async_write_min_active and zfs_vdev_aysync_write_max_active), these four tunables define a piece-wise linear function that determines the number of operations scheduled as depicted in this lovely ASCII art graph excerpted from the comments:

 * The number of concurrent operations issued for the async write I/O class
 * follows a piece-wise linear function defined by a few adjustable points.
 *
 *        |                   o---------| <-- zfs_vdev_async_write_max_active
 *   ^    |                  /^         |
 *   |    |                 / |         |
 * active |                /  |         |
 *  I/O   |               /   |         |
 * count  |              /    |         |
 *        |             /     |         |
 *        |------------o      |         | <-- zfs_vdev_async_write_min_active
 *       0|____________^______|_________|
 *        0%           |      |       100% of zfs_dirty_data_max
 *                     |      |
 *                     |      `-- zfs_vdev_async_write_active_max_dirty_percent
 *                     `--------- zfs_vdev_async_write_active_min_dirty_percent

In a relatively steady state we’d like to see the amount of outstanding dirty data stay in a narrow band between the min and max percentages, by default 30% and 60% respectively.

Tune zfs_vdev_async_write_max_active as described above to maximize throughput without hurting latency. The only reason to increase zfs_vdev_async_write_min_active is if additional writes have little to no impact on latency. While this could be used to make sure data reaches disk sooner, an alternative approach is to decrease zfs_vdev_async_write_active_min_dirty_percent thereby starting to flush data despite less dirty data accumulating.

To tune the min and max percentages, watch both latency and the number of scheduled async write operations. If the operation count fluctuates wildly and impacts latency, you may want to flatten the slope by decreasing the min and/or increasing the max (note below that you will likely want to increase zfs_delay_min_dirty_percent if you increase zfs_vdev_async_write_active_max_dirty_percent — see below).

#pragma D option aggpack
#pragma D option quiet

fbt::vdev_queue_max_async_writes:entry
{
        self->spa = args[0];
}
fbt::vdev_queue_max_async_writes:return
/self->spa && self->spa->spa_name == $$1/
{
        @ = lquantize(args[1], 0, 30, 1);
}

tick-1s
{
        printa(@);
        clear(@);
}

fbt::vdev_queue_max_async_writes:return
/self->spa/
{
        self->spa = 0;
}

# dtrace -s q.d dcenter

min .--------------------------------. max | count
< 0 : ▃▆ : >= 30 | 23279

min .--------------------------------. max | count
< 0 : █ : >= 30 | 18453

min .--------------------------------. max | count
< 0 : █ : >= 30 | 27741

min .--------------------------------. max | count
< 0 : █ : >= 30 | 3455

min .--------------------------------. max | count
< 0 : : >= 30 | 0

Write delay

In situations where LUNs cannot keep up with the incoming write rate, OpenZFS artificially delays writes to ensure consistent latency (see the previous post in this series). Until a certain amount of dirty data accumulates there is no delay. When enough dirty data accumulates OpenZFS gradually increases the delay. By delaying writes OpenZFS effectively pushes back on the client to limit the rate of writes by forcing artificially higher latency. There are two tunables that pertain to delay: how much dirty data there needs to be before the delay kicks in, and the factor by which that delay increases as the amount of outstanding dirty data increases.

The tunable zfs_delay_min_dirty_percent determines when OpenZFS starts delaying writes. The default is 60%; note that we don’t start delaying client writes until the IO scheduler is pushing out data as fast as it can (zfs_vdev_async_write_active_max_dirty_percent also defaults to 60%).

The other relevant tunable is zfs_delay_scale is really the only magic number here. It roughly corresponds to the inverse of the maximum number of operations per second (denominated in nanoseconds), and is used as a scaling factor.

Delaying writes is an aggressive step to ensure consistent latency. It is required if the client really is pushing more data than the system can handle, but unnecessarily delaying writes degrades overall throughput. There are two goals to tuning delay: reduce or remove unnecessary delay, and ensure consistent delays when needed.

First check to see how often writes are delayed. This simple DTrace one-liner does the trick:

# dtrace -n fbt::dsl_pool_need_dirty_delay:return'{ @[args[1] == 0 ? "no delay" : "delay"] = count(); }'

If a relatively small percentage of writes are delayed, increasing the amount of dirty data allowed (zfs_dirty_data_max) or even pushing out the point at which delays start (zfs_delay_min_dirty_percent). When increasing zfs_dirty_data_max consider the other users of DRAM on the system, and also note that a small amount of small delays does not impact performance significantly.

If many writes are being delayed, the client really is trying to push data faster than the LUNs can handle. In that case, check for consistent latency, again, with a DTrace one-liner:

# dtrace -n delay-mintime'{ @ = quantize(arg2); }'

With high variance or if many write operations are being delayed for the maximum zfs_delay_max_ns (100ms by default) then try increasing zfs_delay_scale by a factor of 2 or more, or try delaying earlier by reducing zfs_delay_min_dirty_percent (remember to also reduce zfs_vdev_async_write_active_max_dirty_percent).

Summing up

Our experience at Delphix tuning the new write throttle has been so much better than in the old ZFS world: each tunable has a clear and comprehensible purpose, their relationships are well-defined, and the issues in tension pulling values up or down are both easy to understand and — most importantly — easy to measure. I hope that this tuning guide helps others trying to get the most out of their OpenZFS systems whether on Linux, FreeBSD, Mac OS X, illumos — not to mention the support engineers for the many products that incorporate OpenZFS into a larger solution.

It’s no small feat to build a stable, modern filesystem. The more I work with ZFS, the more impressed I am with how much it got right, and how malleable it’s proved. It has evolved to fix shortcomings and accommodate underlying technological shifts. It’s not surprising though that even while its underpinnings have withstood the test of production use, ZFS occasionally still shows the immaturity of the tween that it is.

Even before the ZFS storage appliance launched in 2008, ZFS was heavily used and discussed Solaris and OpenSolaris communities, the frequent subject of praise and criticism. A common grievance was that write-heavy workloads would consume massive amounts of system memory… and then render the system unusable as ZFS dutifully deposited the new data onto the often anemic storage (often a single spindle for OpenSolaris users).

For workloads whose ability to generate new data far outstripped the throughput of persistent storage, it became clear that ZFS needed to impose some limits. ZFS should have effective limits on the amount of system memory devoted to “dirty” (modified) data. Transaction groups should be bounded to prevent high latency IO and administrative operations. At a high level, ZFS transaction groups are just collections of writes (transactions), and there can be three transaction groups active at any given time; for a more thorough treatment, check out last year’s installment of ZFS knowledge.

Write Throttle 1.0 (2008)

The proposed solution appealed to an intuitive understanding of the system. At the highest level, don’t let transaction groups grow indefinitely. When a transaction reached a prescribed size, ZFS would create a new transaction group; if three already existed, it would block waiting for the syncing transaction group to complete. Limiting the size of each transaction group yielded a number of benefits. ZFS would no longer consume vast amounts of system memory (quelling outcry from the user community). Administrative actions that execute at transaction group boundaries would be more responsive. And synchronous, latency-sensitive operations wouldn’t have to contend with a deluge of writes from the syncing transaction group.

So how big should transaction groups be? The solution included a target duration for writing out a transaction group (5 seconds). The size of each transaction group would be based on that time target and an inferred write bandwidth. Duration times bandwidth equals target size. The inferred bandwidth would be recomputed after each transaction group.

When the size limit for a transaction group was reached, new writes would wait for the next transaction group to open. This could be nearly instantaneous if there weren’t already three transaction groups active, or it could incur a significant delay. To ameliorate this, the write throttle would insert a 10ms delay for all new writes once 7/8th of the size had been consumed.

See the gory details in the git commit.

Commentary

That initial write throttle made a comprehensible, earnest effort to address some critical problems in ZFS. And, to a degree, it succeeded. Though the lack of rigorous ZFS performance testing at that time is reflected in the glaring deficiencies with that initial write throttle. A simple logic bug lingered for other two months, causing all writes to be delayed by 10ms, not just those executed after the transaction group had reached 7/8ths of its target capacity — trivial, yes, but debilitating and telling. The computation of the write throttle resulted in values that varied rapidly; eventually a slapdash effort at hysteresis was added.

Stepping back, the magic constants arouse concern. Why should transaction groups last 5 seconds? Yes, they should be large enough to amortize metadata updates within a transaction group, and they should not be so large that they cause administrative unresponsiveness. For the ZFS storage appliance we experimented with lower values in an effort to smooth out the periodic bursts of writes — an effect we refer to as “picket-fencing” for its appearance in our IO visualization interface. Even more glaring, where did the 7/8ths cutoff come from or the 10ms worth of delay? Even if the computed throughput was dead accurate, the algorithm would lead to ZFS unnecessarily delaying writes. At first blush, this scheme was not fatally flawed, but surely arbitrary, disconnected from real results, and nearly impossible to reason about on a complex system.

Problems

The write throttle demonstrated problems more severe than the widely observed picket-fencing. While ZFS attempted to build a stable estimate of write throughput capacity, the computed number would, in practice, swing wildly. As a result, ZFS would variously over-throttle and under-throttle. It would often insert the 10ms delay, but that delay was intended merely as a softer landing than the hard limit. Once reached, the hard limit — still the primary throttling mechanism — could impose delays well in excess of a second.

The graph below shows the frequency (count) and total contribution (time) for power-of-two IO latencies from a production system.

The latency frequencies clearly show a tri-modal distribution: writes that happen at the speed of software (much less than 1ms), writes that are delayed by the write throttle (tens of milliseconds), and writes that bump up against the transaction group size (hundred of milliseconds up to multiple seconds).

The total accumulated time for each latency bucket highlights the dramatic impact of outliers. The 110 operations taking a second or longer contribute more to the overall elapsed time than the time of the remaining 16,000+ operations.

A new focus

The first attempt at the write throttle addressed a critical need, but was guided by the need to patch a hole rather than an understanding of the fundamental problem. The rate at which ZFS can move data to persistent storage will vary for a variety of reasons: synchronous operations will consume bandwidth; not all writes impact storage in the same way — scattered writes to areas of high fragmentation may be slower than sequential writes. Regardless of the real, instantaneous throughput capacity, ZFS needs to pass on the effective cost — as measured in write latency — to the client. Write throttle 1.0 carved this cost into three tranches: writes early in a transaction group that pay nothing, those late in a transaction group that pay 10ms each, and those at the end that pick up the remainder of the bill.

If the rate of incoming data was less than the throughput capacity of persistent storage the client should be charged nothing — no delay should be inserted. The write throttle failed by that standard as well, delaying 10ms in situations that warranted no surcharge.

Ideally ZFS should throttle writes in a way that optimizes for minimized and consistent latency. As we developed a new write throttle, our objectives were low variance for write latency, and steady and consistent (rather than bursty) writes to persistent storage. In my next post I’ll describe the solution that Matt Ahrens and I designed for OpenZFS.

I’ve been watching ZFS from moments after its inception at the hands of Matt Ahrens and Jeff Bonwick, so I’m excited to see it enter its newest phase of development in OpenZFS. While ZFS has long been regarded as the hottest filesystem on 128 bits, and has shipped in many different products, what’s been most impressive to me about ZFS development has been the constant iteration and reinvention.

Before shipping in Solaris 10 update 2, major components of ZFS had already advanced to “2.0” and “3.0”. I’ve been involved with several ZFS-related products: Solaris 10, the ZFS Storage Appliance (nee Sun Storage 7000), and the Delphix Engine. Each new product and each new use has stressed ZFS in new ways, but also brought renewed focus to development. I’ve come to realize that ZFS will never be completed. I thought I’d use this post to cover the many ways that ZFS had failed in the products I’ve worked on over the years — and it has failed spectacularly at time — but this distracted from the most important aspect of ZFS. For each new failure in each new product with each new use and each new workload ZFS has adapted and improved.

OpenZFS doesn’t need a caretaker community for a finished project; if that were the case, porting OpenZFS to Linux, FreeBSD, and Mac OS X would have been the end. Instead, it was the beginning. The need for the OpenZFS community grew out of the porting efforts who wanted the world’s most advanced filesystem on their platforms and in their products. I wouldn’t trust my customers’ data to a filesystem that hadn’t been through those trials and triumphs over more than a decade. I can’t wait to see the next phase of evolution that OpenZFS brings.

 

If you’re at LinuxCon today, stop by the talk by Matt Ahrens and Brian Behlendor for more on OpenZFS; follow @OpenZFS for all OpenZFS news.

I’ve continued to explore ZFS as I try to understand performance pathologies, and improve performance. A particular point of interest has been the ZFS write throttle, the mechanism ZFS uses to avoid filling all of system memory with modified data. I’m eager to write about the strides we’re making in that regard at Delphix, but it’s hard to appreciate without an understanding of how ZFS batches data. Unfortunately that explanation is literally nowhere to be found. Back in 2001 I had not yet started working on DTrace, and was talking to Matt and Jeff, the authors of ZFS, about joining them. They had only been at it for a few months; I was fortunate to be in a conference with them as the ideas around transaction groups formulated. Transaction groups are how ZFS batches up chunks of data to be written to disk (“groups” of “transactions”). Jeff stood at the whiteboard and drew the progression of states for transaction groups, from open, accepting new transactions, to quiescing, allowing transactions to complete, to syncing, writing data out to disk. As far as I can tell, that was both the first time that picture had been drawn and the last. If you search for information on ZFS transaction groups you’ll find mention of those states… and not much else. The header comment in usr/src/uts/common/fs/zfs/txg.c isn’t particularly helpful:

/*
 * Pool-wide transaction groups.
 */

I set out to write a proper description of ZFS transaction groups. I’m posting it here first, and I’ll be offering it as a submission to illumos. Many thanks to Matt Ahrens, George Wilson, and Max Bruning for their feedback.

ZFS Transaction Groups

ZFS transaction groups are, as the name implies, groups of transactions that act on persistent state. ZFS asserts consistency at the granularity of these transaction groups. Each successive transaction group (txg) is assigned a 64-bit consecutive identifier. There are three active transaction group states: open, quiescing, or syncing. At any given time, there may be an active txg associated with each state; each active txg may either be processing, or blocked waiting to enter the next state. There may be up to three active txgs, and there is always a txg in the open state (though it may be blocked waiting to enter the quiescing state). In broad strokes, transactions — operations that change in-memory structures — are accepted into the txg in the open state, and are completed while the txg is in the open or quiescing states. The accumulated changes are written to disk in the syncing state.

Open

When a new txg becomes active, it first enters the open state. New transactions — updates to in-memory structures — are assigned to the currently open txg. There is always a txg in the open state so that ZFS can accept new changes (though the txg may refuse new changes if it has hit some limit). ZFS advances the open txg to the next state for a variety of reasons such as it hitting a time or size threshold, or the execution of an administrative action that must be completed in the syncing state.

Quiescing

After a txg exits the open state, it enters the quiescing state. The quiescing state is intended to provide a buffer between accepting new transactions in the open state and writing them out to stable storage in the syncing state. While quiescing, transactions can continue their operation without delaying either of the other states. Typically, a txg is in the quiescing state very briefly since the operations are bounded by software latencies rather than, say, slower I/O latencies. After all transactions complete, the txg is ready to enter the next state.

Syncing

In the syncing state, the in-memory state built up during the open and (to a lesser degree) the quiescing states is written to stable storage. The process of writing out modified data can, in turn modify more data. For example when we write new blocks, we need to allocate space for them; those allocations modify metadata (space maps)… which themselves must be written to stable storage. During the sync state, ZFS iterates, writing out data until it converges and all in-memory changes have been written out. The first such pass is the largest as it encompasses all the modified user data (as opposed to filesystem metadata). Subsequent passes typically have far less data to write as they consist exclusively of filesystem metadata.

To ensure convergence, after a certain number of passes ZFS begins overwriting locations on stable storage that had been allocated earlier in the syncing state (and subsequently freed). ZFS usually allocates new blocks to optimize for large, continuous, writes. For the syncing state to converge however it must complete a pass where no new blocks are allocated since each allocation requires a modification of persistent metadata. Further, to hasten convergence, after a prescribed number of passes, ZFS also defers frees, and stops compressing.

In addition to writing out user data, we must also execute synctasks during the syncing context. A synctask is the mechanism by which some administrative activities work such as creating and destroying snapshots or datasets. Note that when a synctask is initiated it enters the open txg, and ZFS then pushes that txg as quickly as possible to completion of the syncing state in order to reduce the latency of the administrative activity. To complete the syncing state, ZFS writes out a new uberblock, the root of the tree of blocks that comprise all state stored on the ZFS pool. Finally, if there is a quiesced txg waiting, we signal that it can now transition to the syncing state.

What else?

Please let me know if you have suggestions for how to improve the descriptions above. There’s more to be written on the specifics of the implementation, transactions, the DMU, and, well, ZFS in general. One thing that I’d note is that Matt mentioned to me recently that were he starting from scratch, he might eliminate the quiescing state. I didn’t understand fully until I researched the subsystem. Typically transactions take a very brief amount of time to “complete”, time measured by CPU latency as opposed, say, to I/O latency. Had the quiescing phase been merged into the syncing phase, the design would be slightly simpler, and it would eliminate the mostly idle intermediate phase where a bunch of dirty data can sit in memory relatively idle.

Next I’ll write about the ZFS write throttle, it’s various brokenness, and our ideas for how to fix it.

Back in October I was pleased to attend — and my employer, Delphix, was pleased to sponsor — illumos day and ZFS day, run concurrently with Oracle Open World. Inspired by the success of dtrace.conf(12) in the Spring, the goal was to assemble developers, practitioners, and users of ZFS and illumos-derived distributions to educate, share information, and discuss the future.

illumos day

The week started with the developer-centric illumos day. While illumos picked up the torch when Oracle re-closed OpenSolaris, each project began with a very different focus. Sun and the OpenSolaris community obsessed with inclusion, and developer adoption — often counterproductively. The illumos community is led by those building products based on the unique technologies in illumos — DTrace, ZFS, Zones, COMSTAR, etc. While all are welcome, it’s those who contribute the most whose voices are most relevant.

I was asked to give a talk about technologies unique to illumos that are unlikely to appear in Oracle Solaris. It was only when I started to prepare the talk that the difference in focuses of illumos and Oracle Solaris fell into sharp focus. In the illumos community, we’ve advanced technologies such as ZFS in ways that would benefit Oracle Solaris greatly, but Oracle has made it clear that open source is anathema for its version of Solaris. For example, at Delphix we’ve recently been fixing bugs, asking ourselves, “how has Oracle never seen this?”.

Yet the differences between illumos and Oracle Solaris are far deeper. In illumos we’re building products that rely on innovation and differentiation in the operating system, and it’s those higher-level products that our various customers use. At Oracle, the priorities are more traditional: support for proprietary SPARC platforms, packaging and updating for administrators, and ease-of-use. In my talk, rather than focusing on the sundry contributions to illumos, I picked a few of my favorites. The slides are more or less incomprehensible on their own; many thanks to Deirdre Straughan for posting the video (and for putting together the event!) — check out 40:30 for a photo of Jean-Luc Picard attending the DTrace talk at OOW.

[youtube_sc url=”https://www.youtube.com/watch?v=7YN6_eRIWWc&t=0m19s”]

ZFS day

While illumos day was for developers, ZFS day was for users of ZFS to learn from each others’ experiences, and hear from ZFS developers. I had the ignominous task of presenting an update on the Hybrid Storage Pool (HSP). We developed the HSP at Fishworks as the first enterprise storage system to add flash memory into the storage hierarchy to accelerate reads and writes. Since then, economics and physics have thrown up some obstacles: DRAM has gotten cheaper, and flash memory is getting harder and harder to turn into a viable enterprise solution. In addition, the L2ARC that adds flash as a ZFS read cache, has languished; it has serious problems that no one has been motivated or proficient enough to address.

I’ll warn you that after the explanation of the HSP, it’s mostly doom and gloom (also I was sick as a dog when I prepared and gave the talk), but check out the slides and video for more on the promise and shortcomings of the HSP.

[youtube_sc url=”http://www.youtube.com/watch?v=P77HEEgdnqE&feature=youtu.be”]

Community

For both illumos day and ZFS day, it was a mostly full house. Reuniting with the folks I already knew was fun as always, but even better was to meet so many who I had no idea were building on illumos or using ZFS. The events highlighted that we need to facilitate more collaboration — especially around ZFS — between the illumos distros, FreeBSD, and Linux — hell, even Oracle Solaris would be welcome.

Recent Posts

January 22, 2024
January 13, 2024
December 29, 2023
February 12, 2017
December 18, 2016
August 9, 2016

Archives

Archives