File System Latency: part 1

When considering I/O performance from the perspective of applications, file systems can prove to be a better target for analysis than disks. With modern file systems using more DRAM-based cache and performing more asynchronous disk I/O, what the application experiences can vastly differ to what the disks are doing. I’ll demonstrate this by examining I/O performance of a MySQL database at both the disk and file system level.

In this post I’ll discuss the commonly used approach, disk I/O analysis using iostat(1M), and in the following post I’ll show file system analysis using DTrace.

Disk I/O

Historically there has been a focus on performance of the storage level devices – the disks themselves – when trying to understand how disk I/O affects application performance. This includes the use of tools such as iostat(1M), which prints various I/O statistics for disk devices. Either system administrators are running iostat(1M) directly at the command line, or are using it via another interface. Some monitoring software will use iostat(1M) to fetch the disk statistics which they then archive and plot (eg, munin).

Here is an iostat(1M) screenshot from a Solaris-based system which is running a MySQL database as well as other applications (with some extraneous output lines trimmed):

# iostat -xnz 1 10
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    1.1   33.8   78.8 1208.1  0.0  1.0    0.0   27.8   0   4 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  175.4    0.0 22449.9    0.0  0.0  1.1    0.0    6.1   0  82 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  106.1  379.2 13576.5 22036.7  0.0  2.4    0.0    4.9   0  85 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  139.9    0.0 17912.6    0.0  0.0  1.0    0.0    6.8   0  82 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  176.1    0.0 22538.0    0.0  0.0  1.0    0.0    5.7   0  85 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  208.0    0.0 26619.9    0.0  0.0  1.9    0.0    9.2   0  99 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  208.0    0.0 26624.4    0.0  0.0  1.7    0.0    8.2   0  95 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  106.0  368.9 13566.1 26881.3  0.0  2.7    0.0    5.7   0  93 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  146.0    0.0 18691.3    0.0  0.0  0.9    0.0    6.4   0  88 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   84.2    0.0 10779.2    0.0  0.0  0.5    0.0    6.1   0  42 c0t1d0

These statistics show an average I/O service time (“asvc_t”) between 4.9 and 9.2 milliseconds and a high percent busy (“%b”) rate, reaching 99% busy in one interval. The MySQL database on this server is suffering a degree of slow queries (queries longer than 1 second), and based on the iostat(1M) output you may be guessing why – the disks. For an application, this looks like an AWFUL system to be running on.

iostat(1M) shows disk level performance.

And I do love iostat(1M), especially the “-x” form of output. I’ll save going into details for a later post. Here, I’ll explain why iostat(1M) can become pretty confusing to interpret when your perspective is an application.

Other Processes

I’ll explain a simple issue to start with. I said earlier that the system above was running MySQL and other applications. So, the heavy disk I/O could be caused by – and largely affecting – some other application. What if MySQL was caching very well in DRAM and hardly using the disks, while a nightly backup processes walked the entire file system rattling the disks? You might see output like the above, with the slow MySQL queries I mentioned caused by something else entirely.

iostat(1M)’s disk I/O is from all processes, which may not be the one you’re looking at.

I’ve worked this issue before, creating psio and later iosnoop and iotop to identify disk I/O by-process (and filename). But they can’t always identify the process and file, as happens with the ZFS file system. Fixing this isn’t easy, and can lead to the question – is looking at the disks the right place to start with? We can get a handle on this by considering the I/O stack.

I/O Stack

Typically, applications are not performing I/O to the disks directly, rather they do so via a file system. And file systems work hard to prevent applications from suffering disk I/O latency directly, such as by using DRAM to buffer writes and to cache and prefetch reads.

File System in the I/O Stack

Here is an example I/O stack showing key components of file system, based loosely on ZFS:

                   +---------------+     +---------------+
                   |  Application  | ... |  Application  |
                   |    mysqld     |     |    (others)   |
                   +---------------+     +---------------+
                    reads ^ | writes     //
                          | |           //
                          | |          //      user-land
       -------------system calls---------------------------
                          | |        //        kernel
          +--------------------------------+
          |  File System  | |      //      |
          |               | V     //       |
          |        [ access locks ]        |
          |               | |              |
          |      +------------------+      |
          |      |    DRAM cache    |      |
          |      +------------------+      |
          |          ^    ^ |    |         |
          |          |    | |    V         |
          | +----------+  | |  +---------+ |
          | | prefetch |  | |  | flusher | |
          | +------^---+  | |  +----|----+ |
          |      +-|----------------V-+    |
          |      |   on-disk layout   |    |
          |      |           +--------|    |
          |      |           |metadata|    |
          |      +--------------------+    |
          |        |      | |   |^  |      |
          +--------------------------------+
                   |      | |   ||  |
                   |      | V   V|  V
              +-------------------------+
              |   Disk Device Drivers   |      iostat(1M)
              +-------------------------+

This diagram shows that there are other sources of disk I/O other than what the application is directly (synchronously) requesting. For instance, on the write-side, the application may “dirty” buffers in the file system cache and consider the I/O completed, however the file system doesn’t perform the disk I/O until much later – seconds – by batching together dirty data and writing them in bulk. This was evident in the previous iostat(1M) output with the bursts of writes (see the “kw/s” column) – which does not reflect how the application is actually performing writes.

I/O Inflation

Apart from other sources of disk I/O adding to the confusion, there is also what happens to the direct I/O itself – particularly with the on-disk layout layer. This is a big topic I won’t go into much here, but I’ll enumerate a single example of I/O inflation to consider:

  1. An application performs a 1 byte write to an existing file.
  2. The file system identifies the location as part of a 128 Kbyte file system “record”, which is not cached (but the metadata to reference it is).
  3. The file system requests the record be loaded from disk.
  4. The disk device layer breaks the 128 Kbyte read into smaller reads suitable for the device.
  5. The disks perform multiple smaller reads, totaling 128 Kbytes.
  6. The file system now replaces the 1 byte in the record with the new byte.
  7. Sometime later, the file system requests the 128 Kbyte “dirty” record be written back to disk.
  8. The disks write the 128 Kbyte record (broken up if needed).
  9. The file system writes new metadata; eg, references (for Copy-On-Write), or atime (access time).
  10. The disks perform more writes.

So, while the applicaiton performed a single 1 byte write, the disks performed multiple reads (128 Kbytes worth) and even more writes (over 128 Kbytes worth).

Application I/O to the file system != Disk I/O

It can be even worse than that example, such as if the required metadata to reference the location had to be read in the first place, and if the file system (volume manager) employed RAID with a stripe size larger than the record size.

I/O Deflation

Having mentioned inflation, I should mention deflation possibilities in case it isn’t obvious. These include caching in DRAM to satisfy reads, and cancellation of buffered writes (data rewritten before it has been flushed to disk).

I’ve recently been watching production ZFS file systems running with over a 99.9% cache hit rate, meaning that only a trickle of reads are reaching disk.

Considering Disk I/O for understanding Application I/O

To summarize so far: looking at how hard the disks are rattling, like we did above using iostat(1M), is far away from what the target application is actually experiencing. Application I/O can be inflated and deflated by the file system by the time it reaches disks, making it difficult to make a direct connection between disk and application I/O. Disk I/O also includes requests from other file system components, such as prefetch, the background flusher, on-disk layout metadata, as well as other users of the file system (other applications). If you were to identify an issue at the disk level, it’s hard to tell how much it matters to the application of interest.

iostat(1M) includes other file system I/O, which may not directly affect the performance of the target application.

Summarizing the previous issues:

iostat(1M) shows disk level performance, not file system performance.

In the next post, I’ll show how file system performance can be analyzed.

Print Friendly
Posted on May 11, 2011 at 6:47 pm by Brendan Gregg · Permalink
In: Performance · Tagged with: , , , ,

15 Responses

Subscribe to comments via RSS

  1. Written by Brendan's blog » File System Latency: part 2
    on May 13, 2011 at 12:26 pm
    Permalink

    [...] is part 2 of File System Latency, a series on storage I/O performance from the application perspective. In the first part I [...]

  2. Written by Disk latency versus filesystem latency at Xaprb
    on May 15, 2011 at 4:13 am
    Permalink

    [...] should add Brendan’s feed to Planet MySQL. Here are the articles: part 1, part 2. Brendan will be talking about this topic at Percona Live on the [...]

  3. Written by Geoff Flarity
    on May 18, 2011 at 7:34 am
    Permalink

    Could you please talk a bit more about write caching? For instance, how can a database like MySQL claim to be provide the Durability component of ACID if the write’s are being cached in DRAM. My understanding is that databases set O_DSYNC flag when doing writes?

  4. Written by Brendan's blog » File System Latency: part 3
    on May 18, 2011 at 9:28 am
    Permalink

    [...] 3 on file system latency, a series on storage I/O performance from the application perspective (see part 1 and part 2). Here I’ll show how to actually measure file system I/O latency – the time [...]

  5. [...] pointed to two really good articles recently by Brendan Gregg on Filesystem Latency (part1 / part2), that I too would recommend everybody [...]

  6. Written by Brendan's blog » File System Latency: part 4
    on May 24, 2011 at 5:57 pm
    Permalink

    [...] 4 on file system latency, a series on storage I/O performance from the application perspective (see part 1, part 2 and part 3). In the previous post I showed how to trace file system latency from within [...]

  7. [...] a series of in-depth posts on understanding and inspecting file system latency — parts one, two, three and four are up, with a fifth on the [...]

  8. Written by Brendan's blog » File System Latency: part 5
    on June 16, 2011 at 4:37 pm
    Permalink

    [...] is part 5 of File System Latency, a series on storage I/O performance from the application perspective (see parts 1, 2, 3 and 4). [...]

  9. Written by Dave Pacheco's Blog » Heatmap coloring
    on June 20, 2011 at 2:30 pm
    Permalink

    [...] has written a great 5-part series on filesystem latency. Towards the end of part 5, he alluded to a lesser-known feature of Cloud Analytics heatmaps, which [...]

  10. [...] Brendan Gregg finished his opus in five parts on file system latency. After reading parts one, two, three, four and five, check out his special bonus post on why you probably still want to use [...]

  11. Written by Brendan's blog » Viewing the Invisible
    on June 27, 2011 at 9:30 am
    Permalink

    [...] of my posts are long and carefully explained, like my file system latency series reaching 5 parts. This post is different: here I’ll share an advanced topic for an [...]

  12. Written by Understanding File System Latency: Part 1 (of 5) « Joyeur
    on November 7, 2011 at 11:03 am
    Permalink

    [...] (This post by Joyent employee Brendan Gregg analyzing the causes of File System Latency originally appeared on the DTrace.org blog) [...]

  13. Written by When iostat Leads You Astray « Joyeur
    on November 7, 2011 at 4:00 pm
    Permalink

    [...] 1 of 5 on Examining File System Latency in Production, by Brendan Gregg, Lead Performance Engineer at Joyent and author with Jim Mauro of “DTrace: [...]

  14. [...] 2 of 5 on Examining File System Latency in Production, by Brendan Gregg, Lead Performance Engineer at Joyent and author with Jim Mauro of “DTrace: [...]

  15. Written by Measuring File System Latency from Applications « Joyeur
    on November 30, 2011 at 8:50 am
    Permalink

    [...] 3 of 5 on Examining File System Latency in Production, by Brendan Gregg, Lead Performance Engineer at Joyent and author with Jim Mauro of “DTrace: [...]

Subscribe to comments via RSS