2000x performance win

I recently helped analyze a performance issue in an unexpected but common place, where the fix improved performance of a task by around 2000x (two thousand times faster). As this is short, interesting and useful, I’ve reproduced it here in a lab environment to share details and screenshots.

Issue

In a production SmartOS cloud environment, a script is used to count entries in an active log. This script is executed from cron(1M), and the grep(1) command it uses is running very slowly indeed:

# time grep done log | wc -l
1492751

real    8m56.062s
user    8m12.275s
sys     0m0.218s
# ls -l log
-rw-r--r-- 1 root root 18M 2011-12-08 21:59 log

That took almost nine minutes. The grep(1) command is processing this file at about 34 Kbytes/sec, which seems awfully slow. Dropping the “wc -l” and using “grep -c” instead didn’t make much difference.

Hypothesis

I have a method for attacking performance issues (which I’ll explain in a later article); it starts with checking for errors, because they do cause performance issues and are a fast area to check.

The slow time sounds like it could be disk I/O related: either slow reads caused by disk errors, or random disk reads. I first checked the error counts:

# iostat -En | grep Hard
sd0              Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
sd1              Soft Errors: 0 Hard Errors: 0 Transport Errors: 0

Which were fine, and then for random reads:

# iostat -xnz 1
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.2    0.3    1.3    1.0  0.0  0.0    0.0    0.0   0   0 ramdisk1
   12.3   81.6  355.4 3005.9  0.0  0.1    0.0    0.7   0   2 sd1
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    7.0    0.0  245.7  0.0  0.0    0.0    0.1   0   0 sd1
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    4.0    0.0   59.5  0.0  0.0    0.0    0.1   0   0 sd1
[...]

Which are not present. This doesn’t look disk related.

Analysis

A great way to see what threads are doing is with “prstat -mLc 1″, which uses thread microstate accounting (only available on Solaris-related systems):

# prstat -mLc 1
[...]
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
 14566 root     100 0.0 0.0 0.0 0.0 0.0 0.0 0.0   0  13   0   0 app1/1
 12740 root      98 0.0 0.0 0.0 0.0 0.0 0.0 2.1   1  26   8   0 grep/1
 18998 root     3.0  12 0.0 0.0 0.0 0.0  85 0.0  45   0  5K   0 app2/1
 17347 root     0.1 0.2 0.0 0.0 0.0 0.0 100 0.0  16   0 170   0 prstat/1
 29693 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0  13   0 128   0 sshd/1
 12741 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   7   0   7   0 wc/1
[...]

The output shows grep(1) spending 98% of its time in USR (running in user-land), and 2.1% in LAT (CPU dispatcher queue latency – ready to run). There is no high percentage of SLP (sleep) time for disk I/O, which we already suspected from using iostat(1) earlier. (This lab system is also running a couple of dummy applications to simulate production load.)

Hot CPU issues are some of the easiest to solve, usually through profiling. Here, I’ve used DTrace to sample grep(1) user-land stack traces at 997 Hertz. I ran this one-liner for several seconds, then hit Ctrl-C to stop sampling and print the report:

# dtrace -n 'profile-997 /execname == "grep"/ { @[ustack()] = count(); }'
dtrace: description 'profile-997 ' matched 1 probe
^C
[...]
              libc.so.1`_UTF8_mbrtowc
              grep`check_multibyte_string+0x85
              grep`EGexecute+0xa6
              grep`grepbuf+0xf2
              grep`grepfile+0x4f3
              grep`main+0x1157
              grep`_start+0x6c
              406

              libc.so.1`mbrtowc+0x1d
              grep`EGexecute+0xa6
              grep`grepbuf+0xf2
              grep`grepfile+0x4f3
              grep`main+0x1157
              grep`_start+0x6c
              498

              grep`check_multibyte_string+0xd5
              grep`EGexecute+0xa6
              grep`grepbuf+0xf2
              grep`grepfile+0x4f3
              grep`main+0x1157
              grep`_start+0x6c
             1023

This shows codepaths that were sampled. At the top is the on-CPU function, then its ancestory. The number is the count of samples for that stack trace.

The hot functions included check_multibyte_string(), mbrtowc() and _UTF8_mbrtowc(). check_multibyte_string() is declared in grep(1) (prefixed with “grep`”), and mbrtowc() is from libc. I’ll start with mbrtowc() – given that it’s in libc, it probably has a man page.

# man mbrtowc
mbrtowc(3C)              Standard C Library Functions              mbrtowc(3C)

NAME
       mbrtowc - convert a character to a wide-character code (restartable)
[...]

Wide character conversion?

Hypothesis #2

The team I was working with suggested this could be locale related:

# echo $LANG
en_US.UTF-8

I’d also punched “grep mbrtowc” into google and found grep – Bugs: bug #14472, grep is slow in multibyte locales [Savannah]“. But that’s the GNU version of grep(1), which we didn’t think we were using.

# which grep
/opt/local/bin/grep
# grep --version
GNU grep 2.5.3
[...]

Ah. So we are using GNU grep(1), and we have LANG set to en_US.UTF-8.

Experiment & Fix

# time LANG=C grep done log | wc -l
1492751

real    0m0.304s
user    0m0.361s
sys     0m0.057s

That’s over 1700x faster.

And using the SmartOS grep(1):

# time /usr/bin/grep done log | wc -l
1492751

real    0m0.268s
user    0m0.318s
sys     0m0.059s

That’s 2000x faster.

On an idle system, these improvements are between 100x – 200x (still huge). It seems that with production load, the long running grep(1) process takes even longer than usual. I’d suspect this was due in part to thread priorities and the kernel scheduler, giving the longer-running grep(1) a lower priority and less CPU time per-second than the short-lived grep(1). (If I needed to confirm this, I can examine the kernel scheduling events and priorities using DTrace.)

Cause

Both PATH and LANG had changed in the new systems, causing the GNU grep to be executed with a non-C locale. The environment was set is these locations:

# grep LANG= /etc/default/init
LANG=en_US.UTF-8
# grep PATH= ~/.profile 
MANPATH=/opt/local/man:/usr/share/man
PATH=/opt/local/bin:/opt/local/sbin:/usr/bin:/usr/sbin
# grep PATH= /etc/default/cron
PATH=/opt/local/bin:/opt/local/sbin:/usr/bin:/usr/sbin:/usr/sfw/bin
SUPATH=/opt/local/bin:/opt/local/sbin:/usr/bin:/usr/sbin:/usr/sfw/bin

Happy grep(1)

In case you are curious, here’s what GNU grep is doing when LANG=C, also profiled with DTrace:

[...]
              grep`prtext+0x230
              grep`grepbuf+0x51
              grep`grepfile+0x4f3
              grep`main+0x1157
              grep`_start+0x6c
                7

              grep`EGexecute+0x4e5
              grep`grepbuf+0xf2
              grep`grepfile+0x4f3
              grep`main+0x1157
              grep`_start+0x6c
                7

              libc.so.1`mutex_lock_impl+0xf8
              libc.so.1`mutex_lock+0xc
              libc.so.1`cancel_safe_mutex_lock+0x9
              libc.so.1`_flockget+0x1b
              libc.so.1`fwrite+0x67
              grep`prline+0x39a
              grep`prtext+0x21d
              grep`grepbuf+0x51
              grep`grepfile+0x4f3
              grep`main+0x1157
              grep`_start+0x6c
               15

              libc.so.1`clear_lockbyte+0x2c
              libc.so.1`mutex_unlock_queue+0x43
              libc.so.1`mutex_unlock+0x12c
              libc.so.1`cancel_safe_mutex_unlock+0x1f
              libc.so.1`fwrite+0xa2
              grep`prline+0x39a
              grep`prtext+0x21d
              grep`grepbuf+0x51
              grep`grepfile+0x4f3
              grep`main+0x1157
              grep`_start+0x6c
               18

Much lower sample counts (since it ended faster), and no mbrtowc(). The hottest two stacks show lock activity (which is where I’d start if I wanted to improve GNU grep(1) performance further).

A bit faster

Using LANG=C and grep(1)’s inbuilt “-c” to do counting instead of “wc -l”, makes it run even faster. Usually faster than 0.2s, and as fast as 0.12s:

# time LANG=C grep -c done log
1492751

real    0m0.120s
user    0m0.110s
sys     0m0.010s

Over 4000x.

Conclusion

Picking a non-C locale can hurt performance – something that has been known for many years. In this case, a GNU grep(1) bug inflated the translation overhead to slow down performance by a huge degree: up to 2000x. In short:

Leave LANG=C.

Aim DTrace at everything – even grep(1).

Print Friendly
Posted on December 8, 2011 at 11:57 pm by Brendan Gregg · Permalink
In: DTrace · Tagged with: , ,

13 Responses

Subscribe to comments via RSS

  1. Written by Leal
    on December 9, 2011 at 4:34 am
    Permalink

    Hi Brendan,

    I had this same issue back on 2009:
    http://www.eall.com.br/blog/?p=1369
    Exactly when OSOL was migrating to default gnu stuff..

    Leal

  2. Written by richlowe
    on December 9, 2011 at 12:37 pm
    Permalink

    It’s a problem you never forget once it’s really hit you too. I suspect as with other such things, GNU grep will be living down “Slow in UTF-8 locales” for years after they actually fix the problem.

    It’s probably already passed into folklore.

  3. Written by tyler
    on December 10, 2011 at 5:27 am
    Permalink

    I hit this routinely as well. I’m sympathetic to the challenges here, but it really stretches the imagination to believe that this degree of perf impact is anything other than a heinous bug. I’d merely shrug at 5x or 10x. But it’s so much worse than that.

    My favorite low-hanging issue that was reported half a decade ago and hasn’t been fixed is definitely the perl ithreads memory leak. But this one is a close second.

  4. Written by paul
    on December 10, 2011 at 3:43 pm
    Permalink

    This problem was fixed in GNU grep 2.7. See this: http://rg03.wordpress.com/2009/09/09/gnu-grep-is-slow-on-utf-8/

    • Written by Brendan Gregg
      on December 10, 2011 at 11:42 pm
      Permalink

      Thanks for the link. It’s interesting to see this on Slackware system too, with GNU grep (before the bug fix) becoming over 4000x faster with LANG=C.

  5. Written by Steve
    on December 10, 2011 at 7:07 pm
    Permalink

    Hi, I tried this out and was only able to get a 2x improvement. Has anyone independently verified your claim?

    I only get a 2x improvement when switching LANG on Redhat linux on EC2:
    % export LANG=C % time grep done nohup.out | wc -l 152929
    real 0m0.343s user 0m0.233s sys 0m0.112s
    % export LANG=en_US.UTF-8 % time grep done nohup.out | wc -l 152931
    real 0m0.771s user 0m0.673s sys 0m0.100s
    % grep –version GNU grep 2.6.3

    Author is using grep 2.5.3, I am using 2.6.3, so not testing the same thing. You might want to try upgrading coreutils and see if that helps.

    • Written by Brendan Gregg
      on December 10, 2011 at 11:46 pm
      Permalink

      I’d guess that version of grep had the fix. The many X improvement has been reproducible on these systems, on both the customer system and in the lab. And I’ve seen similar examples in older blog posts (see the links by Leal and Paul).

    • Written by Mccade
      on December 19, 2011 at 10:38 pm
      Permalink

      That’s not even 10 mniteus well spent!

  6. Written by Eric
    on December 10, 2011 at 9:41 pm
    Permalink

    You should be using fgrep / grep -F when you aren’t using regular expressions especially on large files.

    • Written by Kjetil T.
      on December 12, 2011 at 2:10 am
      Permalink

      GNU grep is smart enough to see that your search string doesn’t contain any meta-characters, and use the fast grep algorithm in that case. (actually it doesn’t need to check explicitly, it will apply Boyer Moore to every constant substring.)

  7. Written by cabo
    on December 11, 2011 at 3:10 am
    Permalink

    “Leave LANG=C.”.

    No. If that were the right answer for this kind of problem, you would still run your code on a /360 in EBCDIC.

    I’d say:

    “For production, use debugged code.”

  8. Written by 001
    on December 11, 2011 at 3:24 pm
    Permalink

    Indeed Eric is right, that’s about 50% faster for me.

    time LANG=C grep -Fc done log

  9. Written by A 2000x Performance Win with DTrace Analytics « Joyeur
    on December 12, 2011 at 12:55 pm
    Permalink

    [...] This was originally posted on the DTrace.org blog by [...]

Subscribe to comments via RSS