Bill Pijewski's Blog

Our ZFS I/O Throttle

March 1, 2011

When designing a cloud computing platform, a cloud provider should take care to mitigate any performance vagaries due to multi-tenant effects. One physical machine will be running many virtual machines, and since the load on each virtual machine is not consistent and uniformly distribtued, bursts of activity will affect the performance of other virtual machines. One way to avoid these multi-tenant effects is to overprovision the system to handle all of the spikes in activity, but that approach leaves machines underutilized and undermines the economics of cloud computing.

Here at Joyent, we use Solaris zones to host a cloud platform. This platform is a software virtualized solution, as Solaris zones are a lightweight container built into the underlying operating system. We provision a zone (also known as SmartMachine) for each customer, and this architecture grants us additional flexibility when allocating resources to zones. The global zone can observe the activity of all customer zones, and can coordinate with the kernel to optimize resource management between zones. Jerry has already covered some of this architecture in his earlier post.

Of the four basic computing resources: CPU, memory, I/O, and network bandwidth, we have a reasonable solutions for managing CPU and memory. For almost all customer workloads, network bandwidth has not been a bottleneck, but that may change as applications become more and more distributed. Until now, I/O contention can be major pain point for customers. On one machine, a single zone can issue a stream of I/O operations, usually synchronous writes, which disrupt I/O performance for all other zones. This problem is further exacerbated by ZFS, which buffers all asynchronous writes in a single TXG (transaction group), a set of blocks which are atomically flushed to disk. The process of flushing a TXG can occupy all of a device’s I/O bandwidth, thereby starving out any pending read operations.

The Solution

Jerry Jelinek and I set out to solve this problem several months ago. Jerry has framed the problem well in a earlier blog post, so I’ll explain what we’ve done and how it works. First, when we sat down to solve this problem, we brainstormed some requirements we’d like in our eventual solution:

  • We want to ensure consistent and predictable I/O latency across all zones.
  • Sequential and random workloads have very different characteristics, so it is a non-starter to track IOPS or throughput.
  • A zone should be able to use the full disk bandwidth if no other zone is actively using the disk.

Our ZFS I/O throttle has two components: one to track and account for each zone’s I/O requests, and another to throttle each zone’s operations when it exceeds its fair share of disk I/O. When the throttle detects that a zone is consuming more than is appropriate, each read or write system call is delayed by up to 100 microseconds, which we’ve found is sufficient to allow other zones to interleave I/O requests during those delays.

The throttle calculates a I/O utilization metric for each zone using the following formula:

(# of read syscalls) x (Average read latency) + (# of write syscalls) x (Average write latency)

Yes, the mapping between system calls and phyiscal I/Os isn’t 1:1 due to I/O aggregation and prefetching, but we’re only trying to detect gross inequities between zones instead of small deltas. Once each zone has its utilization metric, the I/O throttle will compare I/O utilization across all zones, and if a zone has a higher-than-average I/O utilization, system calls from that zone are throttled. That is, each system call will be delayed up to 100 microseconds, depending on the severity of the inequity between the various zones.

Performance Results

The following performance results were gathered on a test machine in our development lab. The vfsstat tool is an iostat-like tool I’ve written which reports VFS operations, throughput, and latency on a per-zone basis. I’ll explain more about vfsstat in a later blog entry, but for this example, focus on r/s (VFS reads per second), w/s (VFS writes per second), read_t (Average VFS read latency), and writ_t (Average VFS write latency).

In zone z02, I’ve started three threads issuing random reads to a 500GB file. That file is sufficiently large that no significant portion of it can be cached in the ARC, so 90%+ of reads will require going to the disk. In zone z03, I’m running a benchmarking tool I wrote called fsyncbomb. This tool is designed to emulate a workload which has caused trouble for us in the past: streaming write workload followed by periodic calls to fsync(3C) to flush that data to disk. In the results below, fsyncbomb writes 1GB of data per file, then fsync(3C)’s that file before moving onto another file. It will round-robin between 100 files, but since it truncates each file before rewriting it, this workload is essentially an append-only workload.

As a baseline, let’s look at the performance of each workload when run separately:

[root@z02 ~]# vfsstat -M 5
[...]
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
 149.8    0.0    1.2    0.0    0.0   3.0   0.0   20.0    0.0  99   0 z02
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
 153.3    0.0    1.2    0.0    0.0   3.0   0.0   19.6    0.0 100   0 z02
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
 156.8    0.0    1.2    0.0    0.0   3.0   0.0   19.1    0.0  99   0 z02
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
 162.3    0.0    1.3    0.0    0.0   3.0   0.0   18.5    0.0 100   0 z02
[root@z03 ~]# vfsstat -M 5
[...]
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
   0.0 72846.2    0.0  569.1    0.0   0.0   0.9    0.0    0.0   0  86 z03
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
   0.0 75292.2    0.0  588.2    0.0   0.0   0.9    0.0    0.0   0  89 z03
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
   0.0 76149.6    0.0  594.9    0.0   0.0   0.9    0.0    0.0   0  86 z03
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
   0.0 81295.6    0.0  635.1    0.0   0.0   0.9    0.0    0.0   0  86 z03

The random reader zone is averaging about 150 IOPS with an average latency around 20ms, and the fsyncbomb zone achieves a throughput around 600 MB/s.

Now, when the two workloads are run together without I/O throttling:

[root@z02 ~]# vfsstat -M 5
[...]
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
  26.8    0.0    0.2    0.0    0.0   3.0   0.0  111.9    0.0  99   0 z02
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
  30.8    0.0    0.2    0.0    0.0   3.0   0.0   97.4    0.0  99   0 z02
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
  27.4    0.0    0.2    0.0    0.0   3.0   0.0  109.5    0.0 100   0 z02
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
  28.6    0.0    0.2    0.0    0.0   3.0   0.0  104.9    0.0  99   0 z02
[root@z03 ~]# vfsstat -M 5
[...]
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
   0.0 66662.1    0.0  520.8    0.0   0.0   0.9    0.0    0.0   0  89 z03
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
   0.0 51410.3    0.0  401.6    0.0   0.0   0.9    0.0    0.0   0  90 z03
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
   0.0 56404.6    0.0  440.7    0.0   0.0   0.9    0.0    0.0   0  93 z03
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
   0.0 60773.1    0.0  474.8    0.0   0.0   0.9    0.0    0.0   0  92 z03

The streaming write performance has suffered a bit, but the read latency is exceeding 100ms! This zone is going to suffer terrible application performance, as Brendan has explained how I/O latency can affect performance. This is an unacceptable result: one zone should not be able to induce such pathological I/O latency in another zone.

What’s even worse, the average read latency is over 100ms, but there are certain operations which exceed one second! A DTrace script shows the exact latency distribution:

  read (us)
           value  ------------- Distribution ------------- count
               0 |                                         0
               1 |@                                        3
               2 |@                                        5
               4 |                                         2
               8 |                                         1
              16 |                                         0
              32 |                                         0
              64 |                                         0
             128 |                                         0
             256 |                                         0
             512 |                                         0
            1024 |                                         0
            2048 |                                         0
            4096 |@                                        3
            8192 |                                         1
           16384 |@                                        7
           32768 |@@@@@                                    32
           65536 |@@@@@@@@                                 48
          131072 |@@@@@@@@@@@@@@@@@                        98
          262144 |                                         1
          524288 |@                                        7
         1048576 |@                                        7
         2097152 |@                                        4
         4194304 |                                         0

Over a similar interval, eighteen read I/Os took over half a second, and eleven took over a second! During those delays, the application running in z02 was idling and couldn’t perform any useful work.

Now, let’s look at the same workloads with the I/O throttle enabled:

[root@z02 ~]# vfsstat -M 5
[...]
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
 134.8    0.0    1.1    0.0    0.0   3.0   0.0   22.3    0.0  99   0 z02
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
 152.8    0.0    1.2    0.0    0.0   3.0   0.0   19.6    0.0 100   0 z02
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
 144.9    0.0    1.1    0.0    0.0   3.0   0.0   20.7    0.0  99   0 z02
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
 147.9    0.0    1.2    0.0    0.0   3.0   0.0   20.3    0.0 100   0 z02
[root@z03 ~]# vfsstat -M 5
[...]
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
   0.0 8788.7    0.0   68.7    0.0   0.0   1.0    0.0    0.1   0  95 z03
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
   0.0 9154.9    0.0   71.5    0.0   0.0   1.0    0.0    0.1   0  99 z03
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
   0.0 9164.7    0.0   71.6    0.0   0.0   1.0    0.0    0.1   0  99 z03
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
   0.0 9122.8    0.0   71.3    0.0   0.0   1.0    0.0    0.1   0  98 z03

This result is much better. The throughput of the fsyncbomb zone has suffered, but the random read latency is basically unaffected by the streaming write workload; it remains around 20ms. The difference between a consistent 20ms latency for random read I/O and latency spikes up above one second is a huge win for that application, and the I/O throttle has allowed these two zones to
coexist peacefully on the same machine.

Now, the throughput drop for the streaming write workload is substantial, but keep in mind that this test is designed to test the worst-case performance of both zones. These benchmarks read and write data as quickly as possible without
even examining the data, whereas more realistic workloads will perform some processing on the data as part of the I/O. If the I/O throttle can provide acceptable and equitable performance even in this pessimistic case, it will perform even better in the presense of a more realistic I/O profile.

Jerry and I will continue to tune the I/O throttle as we gain experience for deploying it on Joyent’s infrastructure. Look for future blog entries as we gain experience with ZFS performance and the larger study of multi-tenant I/O performance.

12 Responses

  1. Hi Bill, very interesting results. This reminds me of a question I’ve been wanting to ask: I observe that when I do something that makes zfs busy, applications freeze for seconds at a time (i.e. the X server, browsers, etc.). Maybe that’s not surprising, but when the activity making zfs busy stops (i.e. I kill the find, copy or whatever) I’m surprised to see that zfs keeps the disk busy for quite a long time after the requesting process goes away. This leads me to wonder, is there any mechanism to limit the amount of asynchronous I/O requests a process is allowed to queue up in zfs before the process is forced to wait?

    1. Hi Gordon, thanks for the question. For almost all writes (e.g. those not explicitly requested to be synchronous), ZFS will buffer that data in a transaction group (TXG), and will periodically flush those TXGs to the disk, thereby committing the data to stable storage. An application can request this TXG flush by opening a file with the O_SYNC/O_DSYNC flag, or by using fsync(3C) to force a TXG flush. A TXG flush may occur seconds after the application writes data, and on a system with lots of write activity, a TXG can contain hundreds of MBs. Flushing hundreds of MBs to the disk(s) will take some time, during which any other I/O operations — specifically reads — will have trouble completing in a timely manner. You’re likely seeing the lingering effect of TXGs being flushed to disk. There’s some pending work to see how tuning the TXG flush affects I/O latency across the rest of the system. Of course, you can use DTrace to look at the TXG flush on your own system; trace the spa_sync() function to watch that activity.

  2. Hey, your solution looks awesome. Would this work for exported volumes for example over ISCSI, or is this limited to the local attached volumes through the Solaris OS?

Recent Posts

September 27, 2012
March 1, 2011

Archives

Archives