File System Latency: part 3

This is part 3 on file system latency, a series on storage I/O performance from the application perspective (see part 1 and part 2). Here I’ll show how to actually measure file system I/O latency – the time spent waiting on the file system to complete I/O. Examining this can save a lot of time when considering disk I/O as a source of performance issues.

As an example of an application to study I picked a busy MySQL server, and I’ll focus on using the DTrace pid provider to examine storage I/O. For an introduction to MySQL analysis with DTrace, see my earlier post on MySQL Query Latency. Here I’ll take that topic further, measuring the file system component of query latency.

File System Latency Distribution

I’ll start by showing some results of measuring this, and then tool used to do it.

This is file system latency within a busy MySQL database, for a 10 second interval:

# ./mysqld_pid_fslatency.d -n 'tick-10s { exit(0); }' -p 7357
Tracing PID 7357... Hit Ctrl-C to end.
MySQL filesystem I/O: 55824; latency (ns):

  read
           value  ------------- Distribution ------------- count
            1024 |                                         0
            2048 |@@@@@@@@@@                               9053
            4096 |@@@@@@@@@@@@@@@@@                        15490
            8192 |@@@@@@@@@@@                              9525
           16384 |@@                                       1982
           32768 |                                         121
           65536 |                                         28
          131072 |                                         6
          262144 |                                         0        

  write
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |                                         1
            8192 |@@@@@@                                   3003
           16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@             13532
           32768 |@@@@@                                    2590
           65536 |@                                        370
          131072 |                                         58
          262144 |                                         27
          524288 |                                         12
         1048576 |                                         1
         2097152 |                                         0
         4194304 |                                         10
         8388608 |                                         14
        16777216 |                                         1
        33554432 |                                         0

This shows the distribution of file system I/O latency in nanoseconds in the left column (“value”), with the number of I/O in that latency range shown in the right column (“count”). Most of the I/O (where the ASCII distribution plot has its spikes) was between 2 and 16 microseconds for the reads, and 8 and 65 microseconds for the writes. That’s fast – and is a strong indication that these reads and writes were to the DRAM cache and not to disk.

The slower time for writes vs reads is probably due to the time to acquire write locks and the buffers to write data to, and to manage the new file system metadata to reference it. I can answer this with more DTrace if needed.

A small handful of the writes, 25 in total, fell in the 4 to 33 millisecond range – the expected time for disk I/O, including a degree of queueing. (If it’s not clear in the above output – 4194304 nanoseconds == 4 milliseconds.) Compared to all the other I/O during tracing shows that the file system cache was running with a hit rate of over 99.9%.

It’s neat to be able to see these system components from the latency distribution, with annotations:

  write
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |                                         1
            8192 |@@@@@@                                   3003
           16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@            13532   <--- DRAM cache I/O
           32768 |@@@@@                                    2590
           65536 |@                                        370
          131072 |                                         58
          262144 |                                         27
          524288 |                                         12
         1048576 |                                         1
         2097152 |                                         0
         4194304 |                                         10     <--- disk I/O
         8388608 |                                         14
        16777216 |                                         1
        33554432 |                                         0

This is just based on the times; to be certain that the 4+ ms I/O was disk based, more DTrace can confirm.

In summary, this shows that file system I/O is usually lightening fast here, hitting out of DRAM. For an application, this looks like a GREAT system to be running on.

Comparing to iostat(1M)

This is actually the same system examined in part 1 using iostat(1M), and was traced at the same time. I had two shells running, and collected the iostat(1M) output at the exact same time as this DTrace output. As a reminder, the disks were doing this according to iostat(1M):

# iostat -xnz 1 10
                    extended device statistics
[...]
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  208.0    0.0 26619.9    0.0  0.0  1.9    0.0    9.2   0  99 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  208.0    0.0 26624.4    0.0  0.0  1.7    0.0    8.2   0  95 c0t1d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  106.0  368.9 13566.1 26881.3  0.0  2.7    0.0    5.7   0  93 c0t1d0
[...]

Which looks awful. But at the file system level, it is great.

During the 10 seconds that both tools were running, this MySQL database experienced multiple slow queries (queries longer than one second). Based on the iostat(1M) output, one may spend a while investigating disk I/O issues, but this would be starting out in the wrong direction. The issue isn’t slow disk I/O. The file system latency distribution show only a trickle reaching disk, and the vast majority of I/O returning at microsecond speeds.

iostat(1M) pointed in the wrong direction for this application issue.

So what are the disks doing? In this example, the reads are mostly from other applications that are running on the same system as this MySQL database. The bursts of writes seen are ZFS transaction group flushes, which are batching writes from MySQL and the other applications for sending to disk later as a group. Some of the disk I/O are other file system types as described in part 1 as well. All of these details were confirmed using more DTrace (which could make for another blog post).

What it isn’t

So what are causing the slow queries? It may seem like we haven’t learned anything about the slow queries yet, but we have – it’s not the disks. We can move onto searching other areas – the database itself, as well as where the time is spent during the slow query (on-CPU or off-CPU, as measured by mysqld_pid_slow.d). The time could be spent waiting on database locks, for example.

Quickly identifying what an issue isn’t helps narrow the search to what it is.

Before you say…

Presentation

The above latency distributions were a neat way of presenting the data, but not the only way. As just mentioned, a different presentation of this data would be needed to really confirm that slow queries were caused by the file system: specifically, a sum of file system latency per query.

It was getting the latency data in the first place that was previously difficult – which DTrace now lets us do. The presentation of that data can be what we need to effectively answer questions, and DTrace lets us present it as totals, averages, min, max, and event-by-event data as well, if needed.

Distribution Script

The previous example traced I/O inside the application using the DTrace pid provider, and printed out a distribution plot of the file system latency. The tool that was executed, mysqld_pid_fslatency.d, is a DTrace script that is about 50 lines long (including comments). I’ve included it here, enumerated:

mysqld_pid_fslatency.d

     1	#!/usr/sbin/dtrace -s
     2	/*
     3	 * mysqld_pid_fslatency.d  Print file system latency distribution.
     4	 *
     5	 * USAGE: ./mysqld_pid_fslatency.d -p mysqld_PID
     6	 *
     7	 * TESTED: these pid-provider probes may only work on some mysqld versions.
     8	 *	5.0.51a: ok
     9	 *
    10	 * 27-Mar-2011	brendan.gregg@joyent.com
    11	 */
    12
    13	#pragma D option quiet
    14
    15	dtrace:::BEGIN
    16	{
    17		printf("Tracing PID %d... Hit Ctrl-C to end.\n", $target);
    18	}
    19
    20	pid$target::os_file_read:entry,
    21	pid$target::os_file_write:entry,
    22	pid$target::my_read:entry,
    23	pid$target::my_write:entry
    24	{
    25		self->start = timestamp;
    26	}
    27
    28	pid$target::os_file_read:return  { this->dir = "read"; }
    29	pid$target::os_file_write:return { this->dir = "write"; }
    30	pid$target::my_read:return       { this->dir = "read"; }
    31	pid$target::my_write:return      { this->dir = "write"; }
    32
    33	pid$target::os_file_read:return,
    34	pid$target::os_file_write:return,
    35	pid$target::my_read:return,
    36	pid$target::my_write:return
    37	/self->start/
    38	{
    39		@time[this->dir] = quantize(timestamp - self->start);
    40		@num = count();
    41		self->start = 0;
    42	}
    43
    44	dtrace:::END
    45	{
    46		printa("MySQL filesystem I/O: %@d; latency (ns):\n", @num);
    47		printa(@time);
    48		clear(@time); clear(@num);
    49	}

This script traces functions in the mysql and innodb source that perform reads and writes to the file system: os_file_read(), os_file_write(), my_read() and my_write(). A little poking around this version of MySQL (5.0.51a) showed that these were causing all the file system I/O, which along with their function names and use in the code made them attractive locations to trace.

Script Caveats

Since this is pid provider-based, these functions are not considered a stable interface – and this script may not work on other versions of MySQL. This would happen if the function names were changed, removed, or other file system functions added. It shouldn’t be too hard to locate the new locations if that has happened, and fix the script.

Besides the stability of this script, other caveats are:

I’ll explain that last note in more detail:

CPU Latency

These MySQL functions use system calls to perform the I/O, which will block and voluntarily context switch the thread off-CPU and put it to sleep until the I/O completes. When the I/O completes, the thread is woken up but it may need to wait its turn on-CPU if there are threads with a higher priority already running (it will probably have had a priority boost due to the scheduler to help its chances at preemption). Waiting its turn is the “CPU dispatcher queue latency”, and if the CPUs are heavily saturated with work that could be milliseconds. This is included in the time that mysqld_pid_fslatency.d prints out.

Showing the time components with CPUs at saturation:

                +--------------------------------+
                |           Application          |
                |             mysqld             |
                +--------------------------------+
                     |     FS I/O latency     ^
                     | <--------------------> |
                     |                        |   user-land
     -----------------------system calls ----------------------
                     |                        |   kernel
        +-------------------------------------------+
        | File       |                        |     |
        | System     |  I/O wait     CPU wait |     |
        |            V <--------->   <------> |     |
        |            X.............+----------+     |
        |            |             ^                |
        | thread is: | off-CPU     | runnable       |
        |            |             |                |
        +-------------------------------------------+
            disk I/O |             | disk I/O
             request |             | completion
                     V             |
                    .---------------.
                    `---------------'
                    |     disk      |
                    `---------------'

While this could make interpreting the measured file system I/O latency confusing, you don’t want to be running the system in this state to start with. If the CPUs are at saturation, then the application could be affected at random times with involuntary context switches, apart from the additional dispatcher queue latency at the end of I/O.

Identifying CPU saturation is usually straightforward with standard operating system tools (or at least, the lack of CPU %idle); the best tool on Solaris-based systems would be “prstat -mL” to examine the percent of time threads spent waiting on the CPU dispatcher queues (“LAT”). This is a much better measurement, as it also catches other cases that can’t be seen via the lack of %idle (eg, dispatcher queue latency due to processes reaching their CPU caps).

Slow Query Logger

Apart from examining file system latency as a distribution, it may be desirable to express it as a portion of the query time, so that “slow” queries can be clearly attributed to file system latency. The following script does this, by tracing both the query latency and summing file system latency, for any query where the total file system latency is over a threshold:

mysqld_pid_fslatency_slowlog.d

     1	#!/usr/sbin/dtrace -s
     2	/*
     3	 * mysqld_pid_fslatency_slowlog.d  Print slow filesystem I/O events.
     4	 *
     5	 * USAGE: ./mysql_pid_fslatency_slowlog.d mysqld_PID
     6	 *
     7	 * This traces mysqld filesystem I/O during queries, and prints output when
     8	 * the total I/O time during a query was longer than the MIN_FS_LATENCY_MS
     9	 * tunable.  This requires tracing every query, whether it performs FS I/O
    10	 * or not, which may add a noticable overhead.
    11	 *
    12	 * TESTED: these pid-provider probes may only work on some mysqld versions.
    13	 *	5.0.51a: ok
    14	 *
    15	 * 27-Mar-2011	brendan.gregg@joyent.com
    16	 */
    17
    18	#pragma D option quiet
    19
    20	inline int MIN_FS_LATENCY_MS = 1000;
    21
    22	dtrace:::BEGIN
    23	{
    24		min_ns = MIN_FS_LATENCY_MS * 1000000;
    25	}
    26
    27	pid$1::*dispatch_command*:entry
    28	{
    29		self->q_start = timestamp;
    30		self->io_count = 0;
    31		self->total_ns = 0;
    32	}
    33
    34	pid$1::os_file_read:entry,
    35	pid$1::os_file_write:entry,
    36	pid$1::my_read:entry,
    37	pid$1::my_write:entry
    38	/self->q_start/
    39	{
    40		self->fs_start = timestamp;
    41	}
    42
    43	pid$1::os_file_read:return,
    44	pid$1::os_file_write:return,
    45	pid$1::my_read:return,
    46	pid$1::my_write:return
    47	/self->fs_start/
    48	{
    49		self->total_ns += timestamp - self->fs_start;
    50		self->io_count++;
    51		self->fs_start = 0;
    52	}
    53
    54	pid$1::*dispatch_command*:return
    55	/self->q_start && (self->total_ns > min_ns)/
    56	{
    57		this->query = timestamp - self->q_start;
    58		printf("%Y filesystem I/O during query > %d ms: ", walltimestamp,
    59		    MIN_FS_LATENCY_MS);
    60		printf("query %d ms, fs %d ms, %d I/O\n", this->query / 1000000,
    61		    self->total_ns / 1000000, self->io_count);
    62	}
    63
    64	pid$1::*dispatch_command*:return
    65	/self->q_start/
    66	{
    67		self->q_start = 0;
    68		self->io_count = 0;
    69		self->total_ns = 0;
    70	}

A key difference with this script is that it only examines file system I/O if they are called during a query, as checked on line 38 (which checks that a thread-local variable q_start was set, which is only true during a query). The previous script, mysqld_pid_fslatency.d, showed all file system I/O latency – whether it was during a query or for another task in the database.

To capture some sample output, I modified line 20 to reduce the threshold to 100 milliseconds:

# ./mysqld_pid_fslatency_slowlog.d 29952
2011 May 16 23:34:00 filesystem I/O during query > 100 ms: query 538 ms, fs 509 ms, 83 I/O
2011 May 16 23:34:11 filesystem I/O during query > 100 ms: query 342 ms, fs 303 ms, 75 I/O
2011 May 16 23:34:38 filesystem I/O during query > 100 ms: query 479 ms, fs 471 ms, 44 I/O
2011 May 16 23:34:58 filesystem I/O during query > 100 ms: query 153 ms, fs 152 ms, 1 I/O
2011 May 16 23:35:09 filesystem I/O during query > 100 ms: query 383 ms, fs 372 ms, 72 I/O
2011 May 16 23:36:09 filesystem I/O during query > 100 ms: query 406 ms, fs 344 ms, 109 I/O
2011 May 16 23:36:44 filesystem I/O during query > 100 ms: query 343 ms, fs 319 ms, 75 I/O
2011 May 16 23:36:54 filesystem I/O during query > 100 ms: query 196 ms, fs 185 ms, 59 I/O
2011 May 16 23:37:10 filesystem I/O during query > 100 ms: query 254 ms, fs 209 ms, 83 I/O

In the few minutes this was running, there were 9 queries longer than 100 milliseconds due to file system I/O. With this output, we can immediately identify the reason for those slow queries – they spent most of their time waiting on the file system. Reaching this conclusion with other tools is much more difficult and time consuming – if possible (and practical) at all.

DTrace can be used to positively identify slow queries caused by file system latency.

But this is about more than DTrace – it’s about the metric itself – file system latency. Since this has been of tremendous use so far, it may make sense to add file system latency to the slow query log (requiring a MySQL source code change). If you are on MySQL 5.6 and later, it looks like you may also be able to get this information from the wait/io events from the new performance schema additions (I haven’t tried). If that isn’t viable, or you are on older MySQL, or a different application entirerly (MySQL was just my example application), you can keep using DTrace to dynamically fetch this information from the running source.

Interpreting Totals

For the queries in the above output, most of the query latency is due to file system latency (eg, 509 ms out of 538 ms = 95%). The last column printed out the I/O count, which helps answer the next question: is the file system latency due to many I/O, or a few slow I/O? For example:

If those descriptions sound a little vague, that’s because we’ve lost so much data when summarizing as an I/O count and a total latency. The script has achieved its goal – of identifying the issue – but to investigate the issue further I need to return to the distribution plots like those used by mysqld_pid_fslatency.d. By examining the full distribution, I could confirm if the 152 ms I/O was as isolated as it looks, or if in fact slow I/O are responsible for all of the latency seen above (eg, for the first line, was it 3 slow I/O + 80 fast I/O = 83 I/O?).

Script Caveats

Caveats and notes for this script are:

I’ll finish with one final note about these scripts: since they are pid provider-based, they can work from inside Solaris Zones or Joyent SmartMachines. A minor exception is with how I executed the first script: I added “-n ‘tick-10s { exit(0); }’” at the command line to exit after 10 seconds, which currently doesn’t work reliably in those environments due to a bug (tick only sometimes fires). Until that bug is fixed, drop that statement from the command line and it will still work fine, and require a Ctrl-C to end tracing.

Considering File System Latency

By examining latency at the file system, application issues can be immediately identified as file system (and probably disk) based, or not. This can set the investigation down the right path immediately, and not the wrong direction as was being suggested by iostat(1M)’s view of busy disks.

Two pid provider-based DTrace scripts were introduced in this post to do this: mysqld_pid_fslatency.d for summarizing the distribution of file system latency, and mysqld_pid_fslatency_slowlog.d for printing slow queries due to the file system.

The pid provider isn’t the only way to measure file system latency: it’s also possible from the syscall layer and from the file system code in the kernel. I’ll demonstrate those methods in the next post, and discuss how they differ to the pid provider method.

Print Friendly
Posted on May 18, 2011 at 9:28 am by Brendan Gregg · Permalink
In: Performance · Tagged with: , , , ,

9 Responses

Subscribe to comments via RSS

  1. Written by links for 2011-05-20 « Dan Creswell’s Linkblog
    on May 20, 2011 at 5:03 am
    Permalink

    [...] Brendan's blog » File System Latency: part 3 (tags: filesystem latency dtrace) [...]

  2. Written by More on measuring IO latency at Xaprb
    on May 24, 2011 at 3:17 pm
    Permalink

    [...] on to my earlier links to Brendan Gregg’s blog posts on measuring I/O latency, there is a third one discussing DTrace, and then a very detailed response from Mark Leith showing how to do it with the PERFORMANCE_SCHEMA [...]

  3. Written by Brendan's blog » File System Latency: part 4
    on May 24, 2011 at 7:54 pm
    Permalink

    [...] a series on storage I/O performance from the application perspective (see part 1, part 2 and part 3). In the previous post I showed how to trace file system latency from within MySQL using the pid [...]

  4. Written by Brendan's blog » File System Latency: part 5
    on June 16, 2011 at 4:38 pm
    Permalink

    [...] Latency, a series on storage I/O performance from the application perspective (see parts 1, 2, 3 and 4). Previously I explained why disk I/O metrics may not reflect application performance, and [...]

  5. Written by Brendan's blog » Off-CPU Performance Analysis
    on July 8, 2011 at 9:36 am
    Permalink

    [...] Threads can leave CPU for a number of reasons, including waiting for file system or network I/O, acquiring a synchronization lock, paging/swapping (virtual memory), an explicit timer and signals. They can also leave CPU for some reasons somewhat unrelated to the current thread’s execution, including involuntary context switching due to high demand for CPU resources, and interrupts. Whatever the reason, if this is during a synchronous code-path then it is introducing latency. Understanding what it is and the time spent waiting can help quickly pinpoint the origin of off-CPU performance issues. As an example of an off-CPU issue, I previously posted about file system latency. [...]

  6. Written by Anonymous
    on August 22, 2011 at 12:32 pm
    Permalink

    Any possibility any of the pid-provider probes used for mysql would work on oracle database server?

  7. Written by Richard Olivieri
    on August 22, 2011 at 12:33 pm
    Permalink

    Would these pid provider probes used in mysql_pid_fslatency.d work for an Oracle database?

  8. Written by Monitoring MySQL IO Latency with performance_schema
    on October 13, 2011 at 5:12 am
    Permalink

    [...] has followed up on his posts with a third on how to track these filesystem calls from a running MySQL process with DTrace, which again I [...]

  9. [...] DTrace isn’t just about running scripts; you can write your own custom scripts, run one-liners, and use higher-level tools that use DTrace behind the scenes, like Apple’s Instruments. And it’s for more than just your MacBook: if you are using servers that have DTrace available, you can use it to diagnose their performance and issues too, including tracing kernel and application code. I regularly use it to see how a MySQL database interacts with a kernel file system, for example. [...]

Subscribe to comments via RSS