Bill Pijewski's Blog

Next Tuesday, October 2nd I’ll be talking at ZFS Day on how Joyent deploys its cloud services on top of ZFS.

One of the main design principles of ZFS is merging the management of physical volumes with individual filesystems. Instead of relying on an underlying volume manager, ZFS manages disks directly and aggregates them into pools from which individual filesystems are allocated. Storage servers using ZFS typically configure two pools: one pool onto which the system’s root filesystem is installed, and a second for the data to be managed by that system.

At Joyent we’ve taken a different approach and discarded the root pool in favor of a single system-wide pool. Not only does this approach free up an additional two drives to be used for main storage, it also provides us flexibility in upgrading system software, higher customer multitenancy, and ease of deploying new machines. In this talk, I’ll describe our overall architecture, talk about challenges we faced in constructing such an architecture, and characterize our experiences having deployed this model in production over the last 18 months.

The event is free to attend and will be streamed live. Hope to see you there!

When designing a cloud computing platform, a cloud provider should take care to mitigate any performance vagaries due to multi-tenant effects. One physical machine will be running many virtual machines, and since the load on each virtual machine is not consistent and uniformly distribtued, bursts of activity will affect the performance of other virtual machines. One way to avoid these multi-tenant effects is to overprovision the system to handle all of the spikes in activity, but that approach leaves machines underutilized and undermines the economics of cloud computing.

Here at Joyent, we use Solaris zones to host a cloud platform. This platform is a software virtualized solution, as Solaris zones are a lightweight container built into the underlying operating system. We provision a zone (also known as SmartMachine) for each customer, and this architecture grants us additional flexibility when allocating resources to zones. The global zone can observe the activity of all customer zones, and can coordinate with the kernel to optimize resource management between zones. Jerry has already covered some of this architecture in his earlier post.

Of the four basic computing resources: CPU, memory, I/O, and network bandwidth, we have a reasonable solutions for managing CPU and memory. For almost all customer workloads, network bandwidth has not been a bottleneck, but that may change as applications become more and more distributed. Until now, I/O contention can be major pain point for customers. On one machine, a single zone can issue a stream of I/O operations, usually synchronous writes, which disrupt I/O performance for all other zones. This problem is further exacerbated by ZFS, which buffers all asynchronous writes in a single TXG (transaction group), a set of blocks which are atomically flushed to disk. The process of flushing a TXG can occupy all of a device’s I/O bandwidth, thereby starving out any pending read operations.

The Solution

Jerry Jelinek and I set out to solve this problem several months ago. Jerry has framed the problem well in a earlier blog post, so I’ll explain what we’ve done and how it works. First, when we sat down to solve this problem, we brainstormed some requirements we’d like in our eventual solution:

  • We want to ensure consistent and predictable I/O latency across all zones.
  • Sequential and random workloads have very different characteristics, so it is a non-starter to track IOPS or throughput.
  • A zone should be able to use the full disk bandwidth if no other zone is actively using the disk.

Our ZFS I/O throttle has two components: one to track and account for each zone’s I/O requests, and another to throttle each zone’s operations when it exceeds its fair share of disk I/O. When the throttle detects that a zone is consuming more than is appropriate, each read or write system call is delayed by up to 100 microseconds, which we’ve found is sufficient to allow other zones to interleave I/O requests during those delays.

The throttle calculates a I/O utilization metric for each zone using the following formula:

(# of read syscalls) x (Average read latency) + (# of write syscalls) x (Average write latency)

Yes, the mapping between system calls and phyiscal I/Os isn’t 1:1 due to I/O aggregation and prefetching, but we’re only trying to detect gross inequities between zones instead of small deltas. Once each zone has its utilization metric, the I/O throttle will compare I/O utilization across all zones, and if a zone has a higher-than-average I/O utilization, system calls from that zone are throttled. That is, each system call will be delayed up to 100 microseconds, depending on the severity of the inequity between the various zones.

Performance Results

The following performance results were gathered on a test machine in our development lab. The vfsstat tool is an iostat-like tool I’ve written which reports VFS operations, throughput, and latency on a per-zone basis. I’ll explain more about vfsstat in a later blog entry, but for this example, focus on r/s (VFS reads per second), w/s (VFS writes per second), read_t (Average VFS read latency), and writ_t (Average VFS write latency).

In zone z02, I’ve started three threads issuing random reads to a 500GB file. That file is sufficiently large that no significant portion of it can be cached in the ARC, so 90%+ of reads will require going to the disk. In zone z03, I’m running a benchmarking tool I wrote called fsyncbomb. This tool is designed to emulate a workload which has caused trouble for us in the past: streaming write workload followed by periodic calls to fsync(3C) to flush that data to disk. In the results below, fsyncbomb writes 1GB of data per file, then fsync(3C)’s that file before moving onto another file. It will round-robin between 100 files, but since it truncates each file before rewriting it, this workload is essentially an append-only workload.

As a baseline, let’s look at the performance of each workload when run separately:

[root@z02 ~]# vfsstat -M 5
[...]
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
 149.8    0.0    1.2    0.0    0.0   3.0   0.0   20.0    0.0  99   0 z02
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
 153.3    0.0    1.2    0.0    0.0   3.0   0.0   19.6    0.0 100   0 z02
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
 156.8    0.0    1.2    0.0    0.0   3.0   0.0   19.1    0.0  99   0 z02
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
 162.3    0.0    1.3    0.0    0.0   3.0   0.0   18.5    0.0 100   0 z02
[root@z03 ~]# vfsstat -M 5
[...]
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
   0.0 72846.2    0.0  569.1    0.0   0.0   0.9    0.0    0.0   0  86 z03
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
   0.0 75292.2    0.0  588.2    0.0   0.0   0.9    0.0    0.0   0  89 z03
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
   0.0 76149.6    0.0  594.9    0.0   0.0   0.9    0.0    0.0   0  86 z03
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
   0.0 81295.6    0.0  635.1    0.0   0.0   0.9    0.0    0.0   0  86 z03

The random reader zone is averaging about 150 IOPS with an average latency around 20ms, and the fsyncbomb zone achieves a throughput around 600 MB/s.

Now, when the two workloads are run together without I/O throttling:

[root@z02 ~]# vfsstat -M 5
[...]
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
  26.8    0.0    0.2    0.0    0.0   3.0   0.0  111.9    0.0  99   0 z02
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
  30.8    0.0    0.2    0.0    0.0   3.0   0.0   97.4    0.0  99   0 z02
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
  27.4    0.0    0.2    0.0    0.0   3.0   0.0  109.5    0.0 100   0 z02
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
  28.6    0.0    0.2    0.0    0.0   3.0   0.0  104.9    0.0  99   0 z02
[root@z03 ~]# vfsstat -M 5
[...]
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
   0.0 66662.1    0.0  520.8    0.0   0.0   0.9    0.0    0.0   0  89 z03
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
   0.0 51410.3    0.0  401.6    0.0   0.0   0.9    0.0    0.0   0  90 z03
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
   0.0 56404.6    0.0  440.7    0.0   0.0   0.9    0.0    0.0   0  93 z03
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
   0.0 60773.1    0.0  474.8    0.0   0.0   0.9    0.0    0.0   0  92 z03

The streaming write performance has suffered a bit, but the read latency is exceeding 100ms! This zone is going to suffer terrible application performance, as Brendan has explained how I/O latency can affect performance. This is an unacceptable result: one zone should not be able to induce such pathological I/O latency in another zone.

What’s even worse, the average read latency is over 100ms, but there are certain operations which exceed one second! A DTrace script shows the exact latency distribution:

  read (us)
           value  ------------- Distribution ------------- count
               0 |                                         0
               1 |@                                        3
               2 |@                                        5
               4 |                                         2
               8 |                                         1
              16 |                                         0
              32 |                                         0
              64 |                                         0
             128 |                                         0
             256 |                                         0
             512 |                                         0
            1024 |                                         0
            2048 |                                         0
            4096 |@                                        3
            8192 |                                         1
           16384 |@                                        7
           32768 |@@@@@                                    32
           65536 |@@@@@@@@                                 48
          131072 |@@@@@@@@@@@@@@@@@                        98
          262144 |                                         1
          524288 |@                                        7
         1048576 |@                                        7
         2097152 |@                                        4
         4194304 |                                         0

Over a similar interval, eighteen read I/Os took over half a second, and eleven took over a second! During those delays, the application running in z02 was idling and couldn’t perform any useful work.

Now, let’s look at the same workloads with the I/O throttle enabled:

[root@z02 ~]# vfsstat -M 5
[...]
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
 134.8    0.0    1.1    0.0    0.0   3.0   0.0   22.3    0.0  99   0 z02
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
 152.8    0.0    1.2    0.0    0.0   3.0   0.0   19.6    0.0 100   0 z02
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
 144.9    0.0    1.1    0.0    0.0   3.0   0.0   20.7    0.0  99   0 z02
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
 147.9    0.0    1.2    0.0    0.0   3.0   0.0   20.3    0.0 100   0 z02
[root@z03 ~]# vfsstat -M 5
[...]
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
   0.0 8788.7    0.0   68.7    0.0   0.0   1.0    0.0    0.1   0  95 z03
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
   0.0 9154.9    0.0   71.5    0.0   0.0   1.0    0.0    0.1   0  99 z03
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
   0.0 9164.7    0.0   71.6    0.0   0.0   1.0    0.0    0.1   0  99 z03
   r/s    w/s   Mr/s   Mw/s wait_t ractv wactv read_t writ_t  %r  %w zone
   0.0 9122.8    0.0   71.3    0.0   0.0   1.0    0.0    0.1   0  98 z03

This result is much better. The throughput of the fsyncbomb zone has suffered, but the random read latency is basically unaffected by the streaming write workload; it remains around 20ms. The difference between a consistent 20ms latency for random read I/O and latency spikes up above one second is a huge win for that application, and the I/O throttle has allowed these two zones to
coexist peacefully on the same machine.

Now, the throughput drop for the streaming write workload is substantial, but keep in mind that this test is designed to test the worst-case performance of both zones. These benchmarks read and write data as quickly as possible without
even examining the data, whereas more realistic workloads will perform some processing on the data as part of the I/O. If the I/O throttle can provide acceptable and equitable performance even in this pessimistic case, it will perform even better in the presense of a more realistic I/O profile.

Jerry and I will continue to tune the I/O throttle as we gain experience for deploying it on Joyent’s infrastructure. Look for future blog entries as we gain experience with ZFS performance and the larger study of multi-tenant I/O performance.

Recent Posts

September 27, 2012
March 1, 2011

Archives

Archives