Viewing the Invisible

Some of my posts are long and carefully explained, like my file system latency series reaching 5 parts. This post is different: here I’ll share an advanced topic for an advanced audience, and let the screenshots do most of the talking.

This assumes you know OS internals, x86 assembly, mdb and DTrace. Do not try any of the following without understanding and accepting the risk of using these tools.

The problem

I found some unexplained behavior while chasing down a ~10% performance loss in MySQL (after enabling the performance schema). Profiling found my_timer_cycles() on-CPU due to the config change, so, tracing its execution time:

# dtrace -n 'pid$target::my_timer_cycles:entry { self->vs = vtimestamp; }
    pid$target::my_timer_cycles:return /self->vs/ {
    @["ns"] = quantize(vtimestamp - self->vs); self->vs = 0; }
    tick-10s { exit(0); }' -p 1711
dtrace: description 'pid$target::my_timer_cycles:entry ' matched 3 probes
CPU     ID                    FUNCTION:NAME
  3  75844                        :tick-10s 

  ns
           value  ------------- Distribution ------------- count
              32 |                                         0
              64 |                                         3
             128 |                                         5
             256 |@@@@@@@@@@@@@@@@@@                       1783997
             512 |@@@@@@@@@@@@@@@@@@@@@@                   2183562
            1024 |                                         38028
            2048 |                                         5706
            4096 |                                         7538
            8192 |                                         7103
           16384 |                                         4551
           32768 |                                         2058
           65536 |                                         508
          131072 |                                         213
          262144 |                                         39
          524288 |                                         10
         1048576 |                                         1
         2097152 |                                         0 

Ok, so it’s on-CPU because it’s called a boat-load of times.

Why does it sometimes reach 1+ ms?:

# echo my_timer_cycles::dis | mdb -p 1711
my_timer_cycles:                pushq  %rbp
my_timer_cycles+1:              movq   %rsp,%rbp
my_timer_cycles+4:              xorl   %eax,%eax
my_timer_cycles+6:              rdtsc
my_timer_cycles+8:              shlq   $0x20,%rdx
my_timer_cycles+0xc:            orq    %rdx,%rax
my_timer_cycles+0xf:            leave
my_timer_cycles+0x10:           ret

Hmmm. That’s about as register-bound as they come (rdtsc is read time stamp counter).

Now it’s really interesting.

I’ll focus here on investigating why this function can take 1 ms. This function has no loads/stores – it’s register bound – so usual suspects including CPU and MMU stall cycles are unlikely at play (even if it were to I$ miss, it’s hard to explain how it ends up taking 1 ms for a function called so often).

Whatever could be happening isn’t visible in the function above.

Hypothesis

There could be kernel-level events that are interrupting the execution of this function: including hardware device interrupts, cross calls, and involuntary switches (or simply, “interrupts”).

# mpstat 1
[...]
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0 18849 16847 14813  922 2118 1458    0 35294   29  55   0  16
  1    0   0    0  1503  159 14085  637 1680  483    0 40124   44  18   0  39
  2    0   0 1416  2192  211 21684 1398 2830  836    0 49985   44  20   0  36
  3    0   0    0  1486  144 13019  511 1394  470    0 38281   44  16   0  39
  4    0   0    0  2079  149 20340 1189 2426  797    0 47753   40  19   0  41
  5    0   0    0  1305  147 12126  467 1199  454    0 33649   37  15   0  47
  6    0   0    0  1868  197 18549  927 1974  760    0 43298   45  16   0  39
  7    0   0    0  1275  159 9103  317  892  301    0 26289   43  12   0  44
  8    0   0    0  1644  165 15883  845 1768  520    0 39550   41  19   0  41
  9    0   0    0  1357  183 11293  371 1076  414    0 32063   30  15   0  55
 10    0   0    0  1588  158 16086  710 1556  633    0 38003   37  17   0  46
 11    2   0    0   922  109 7679  213  717  322    0 23071   41  11   0  48
 12    0   0    0  1759  240 15800  775 1560  623    0 37871   37  15   0  48
 13    0   0   15  1070  259 7573  268  799  274    0 22658   34  13   0  53
 14    0   0 7784  1326  450 15993  681 1497  607    0 37083   32  16   0  52
 15    0   0    0 17247 15953 10697  372  979  397    0 33010   38  21   0  40

and:

# dtrace -n 'pid$target::my_timer_cycles:entry { self->vs = vtimestamp; }
    pid$target::my_timer_cycles:return /self->vs/ { @["CPU:", cpu] =
    quantize(vtimestamp - self->vs); self->vs = 0; }
    tick-10s { exit(0); }' -p 1711
dtrace: description 'pid$target::my_timer_cycles:entry ' matched 3 probes
CPU     ID                    FUNCTION:NAME
  4  75844                        :tick-10s 

  CPU:                                                     13
           value  ------------- Distribution ------------- count
             128 |                                         0
             256 |@@@@@@@@@@@@@@@@@                        39591
             512 |@@@@@@@@@@@@@@@@@@@@@@@                  53690
            1024 |                                         1147
            2048 |                                         85
            4096 |                                         155
            8192 |                                         87
           16384 |                                         2
           32768 |                                         0
[...]

  CPU:                                                     15
           value  ------------- Distribution ------------- count
             128 |                                         0
             256 |@@@@@@@@@@@@@@@@@@@                      96785
             512 |@@@@@@@@@@@@@@@@@@@                      97022
            1024 |@                                        7105
            2048 |@                                        3272
            4096 |@                                        2955
            8192 |                                         1077
           16384 |                                         214
           32768 |                                         46
           65536 |                                         0
[...]

  CPU:                                                      2
           value  ------------- Distribution ------------- count
              64 |                                         0
             128 |                                         1
             256 |@@@@@@@@@@@@@@@@@@@@                     168860
             512 |@@@@@@@@@@@@@@@@@@@                      164438
            1024 |@                                        5473
            2048 |                                         3071
            4096 |                                         275
            8192 |                                         264
           16384 |                                         22
           32768 |                                         4
           65536 |                                         3
          131072 |                                         0        

  CPU:                                                      0
           value  ------------- Distribution ------------- count
              64 |                                         0
             128 |                                         1
             256 |@@@@@@@@@@@@@@@@@@@                      129491
             512 |@@@@@@@@@@@@@@@@@@@                      130355
            1024 |@                                        5547
            2048 |                                         2007
            4096 |                                         630
            8192 |                                         2998
           16384 |@                                        3547
           32768 |                                         1656
           65536 |                                         415
          131072 |                                         153
          262144 |                                         46
          524288 |                                         13
         1048576 |                                         1
         2097152 |                                         0

Ah-hah!

What’s on 0?

# intrstat 1

      device |      cpu0 %tim      cpu1 %tim      cpu2 %tim      cpu3 %tim
-------------+------------------------------------------------------------
       bnx#0 |         0  0.0        16  0.0         0  0.0         0  0.0
      ehci#0 |         0  0.0         0  0.0         0  0.0         0  0.0
      ehci#1 |         0  0.0         0  0.0         0  0.0         0  0.0
       igb#0 |         0  0.0         0  0.0         0  0.0         0  0.0
       igb#2 |     16889 37.7         0  0.0         0  0.0         0  0.0
  mega_sas#0 |         0  0.0         0  0.0         0  0.0         0  0.0
      uhci#0 |         0  0.0         0  0.0         0  0.0         0  0.0
      uhci#1 |         0  0.0         0  0.0         0  0.0         0  0.0
      uhci#2 |         0  0.0         0  0.0         0  0.0         0  0.0
      uhci#3 |         0  0.0         0  0.0         0  0.0         0  0.0
[...]

      device |     cpu12 %tim     cpu13 %tim     cpu14 %tim     cpu15 %tim
-------------+------------------------------------------------------------
       bnx#0 |         0  0.0         0  0.0         0  0.0         0  0.0
      ehci#0 |         0  0.0         0  0.0         0  0.0         0  0.0
      ehci#1 |         0  0.0         0  0.0         0  0.0         0  0.0
       igb#0 |         0  0.0        11  0.0       155  0.1         0  0.0
       igb#2 |         0  0.0         0  0.0         0  0.0     16453  7.6
  mega_sas#0 |         0  0.0         0  0.0         0  0.0         0  0.0
      uhci#0 |         0  0.0         0  0.0         0  0.0         0  0.0
      uhci#1 |         0  0.0         0  0.0         0  0.0         0  0.0
      uhci#2 |         0  0.0         0  0.0         0  0.0         0  0.0
      uhci#3 |         0  0.0         0  0.0         0  0.0         0  0.0
[...]

Context

While mpstat(1M) and intrstat(1M) show interrupts system-wide, I’d like to see them in application context to measure their effect on this user-level function. However, application debuggers and performance analysis tools typically only show what is happening in user-land, and their visibility ends at the syscall boundry.

Kernel-level events are typically invisible to application performance tools.

Fortunately DTrace can do it; I’ll start by trying the sdt:::interrupt-start probe, which is what intrstat(1M) uses:

# dtrace -n 'sdt:::interrupt-start { @[execname] = count(); }'
dtrace: description 'sdt:::interrupt-start ' matched 1 probe
^C

  sched                                                         55608

No.

# dtrace -n 'sdt:::interrupt-start { @[stack()] = count(); }'
[...]

              unix`dispatch_hilevel+0x1f
              unix`switch_sp_and_call+0x13
              unix`do_interrupt+0xde
              unix`_interrupt+0xb8
            13030

              unix`dispatch_hardint+0x33
              unix`switch_sp_and_call+0x13
            93038

Need this before the switch_sp_and_call(). do_interrupt()?

# dtrace -n 'fbt::do_interrupt:entry { @[execname] = count(); }'
dtrace: description 'fbt::do_interrupt:entry ' matched 1 probe
^C

  poold                                                             8
  dtrace                                                           60
  rotatelogs                                                       66
  zeus.zxtm                                                       225
  zabbix_agentd                                                   682
  ruby                                                           2555
  memcached                                                      4970
  sysbench                                                      13966
  httpd                                                         28736
  sched                                                         44916
  mysqld                                                        92089

That’s better.

do_interrupt() runs with interrupts disabled — it feels a little strange to be able to trace this at all.

Viewing the Invisible

Counting interrupts during function execution:

# dtrace -n 'pid$target::my_timer_cycles:entry { self->vs = vtimestamp; }
    fbt::do_interrupt:entry /self->vs/ { self->i++; }
    pid$target::my_timer_cycles:return /self->vs/ {
        @[self->i] = quantize(vtimestamp - self->vs);
        self->vs = 0; self->i = 0; }' -p 1711
dtrace: description 'pid$target::my_timer_cycles:entry ' matched 3 probes
^C

        7
           value  ------------- Distribution ------------- count
          131072 |                                         0
          262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@           3
          524288 |@@@@@@@@@@                               1
         1048576 |                                         0        

        8
           value  ------------- Distribution ------------- count
          131072 |                                         0
          262144 |@@@@@@@@@@@@@                            1
          524288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              2
         1048576 |                                         0        

        6
           value  ------------- Distribution ------------- count
           65536 |                                         0
          131072 |@@@@@@@                                  2
          262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              8
          524288 |@@@@@@@                                  2
         1048576 |                                         0        

        5
           value  ------------- Distribution ------------- count
           32768 |                                         0
           65536 |@@                                       1
          131072 |@@@@@@@@@@@@@@@@@@@@                     9
          262144 |@@@@@@@@@@@@@@@@                         7
          524288 |                                         0
         1048576 |@@                                       1
         2097152 |                                         0        

        4
           value  ------------- Distribution ------------- count
            4096 |                                         0
            8192 |@@                                       3
           16384 |@@                                       2
           32768 |                                         0
           65536 |@@                                       2
          131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@               32
          262144 |@@@@@@@@                                 10
          524288 |                                         0        

        3
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@                                        2
            8192 |@@                                       5
           16384 |@@                                       6
           32768 |@                                        3
           65536 |@@@@@@@@@@@@@@@@@@@@                     59
          131072 |@@@@@@@@@@@@@@                           40
          262144 |                                         1
          524288 |                                         0        

        2
           value  ------------- Distribution ------------- count
            1024 |                                         0
            2048 |                                         4
            4096 |@                                        14
            8192 |@@@@@@@@                                 94
           16384 |@@@@@                                    57
           32768 |@@@@                                     43
           65536 |@@@@@@@@@@@@@@@@@@@@@                    240
          131072 |@                                        9
          262144 |                                         0
          524288 |                                         1
         1048576 |                                         0        

        1
           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |@@                                       641
            2048 |@@@@@@@@                                 3332
            4096 |@@@@@@@@@@@                              4230
            8192 |@@@@@@@@@@                               4070
           16384 |@@@@@@@                                  2611
           32768 |@@@                                      1082
           65536 |                                         22
          131072 |                                         9
          262144 |                                         2
          524288 |                                         0        

        0
           value  ------------- Distribution ------------- count
             128 |                                         0
             256 |@@@@@@@@@@@@@@@@@@                       921978
             512 |@@@@@@@@@@@@@@@@@@@@@                    1083485
            1024 |                                         17032
            2048 |                                         489
            4096 |                                         498
            8192 |                                         124
           16384 |                                         0

!!!

# dtrace -n 'pid$target::my_timer_cycles:entry { self->vs = vtimestamp; }
    fbt::do_interrupt:entry /self->vs/ { self->i++; }
    pid$target::my_timer_cycles:return /self->vs/ {
        @[self->i] = avg(vtimestamp - self->vs);
        self->vs = 0; self->i = 0; }' -p 1711
dtrace: description 'pid$target::my_timer_cycles:entry ' matched 3 probes
^C

                0              545
                1            11748
                2            67950
                3           146795
                4           190061
                5           271303
                6           295016
                7           463835
                8           823128
               10           878304

!

At some level this is expected: a network interface on CPU 0 has a high rate of interrupts, which sometimes slows this function. The more interrupts that occur during the function, the slower it is.

What’s amazing is that it’s possible to see this so clearly.

Print Friendly
Posted on June 27, 2011 at 9:30 am by Brendan Gregg · Permalink
In: Performance · Tagged with: ,

2 Responses

Subscribe to comments via RSS

  1. Written by Dimitri
    on June 28, 2011 at 2:45 am
    Permalink

    Hi Brendan,

    Usually when I’m running MySQL tests on Solaris I’m placing ‘mysqld’ into a processor set to avoid any kind of external interruptions and reduce LWP management overhead as well. Did you still observed similar Performance Schema overhead while the processor set is used?.. – interesting that on my tests DTrace profiler showing the “pfs_rw_lock_s_lock_func” function.

    Also, it’ll be great if you subscribe your “MySQL category” on the “Planet MySQL” site – it’s pity people are missing your stuff if not aware :-)

    Rgds,
    -Dimitri

  2. Written by Baron Schwartz
    on July 10, 2011 at 7:16 pm
    Permalink

    I agree — amazing.

Subscribe to comments via RSS