A quarter million NFS IOPS

Following the launch of the Sun Storage 7000 series, various performance results have been published. It’s important when reading these numbers to understand their context, and how that may apply to your workload. Here I’ll introduce some numbers regarding NFS read ops/sec, and explain what they mean.

A key feature of the Sun Storage 7410 is DRAM scalability, which can currently reach 128 Gbytes per head node. This can span a significant working set size, and so serve most (or even all) requests from the DRAM filesystem cache. If you aren’t familiar with the term working set size – this refers to the amount of data which is frequently accessed; for example, your website could be multiple Tbytes in size – but only tens of Gbytes are frequently accessed hour after hour, which is your working set.

Considering that serving most or all of your working set from DRAM may be a real possibility, it’s worth exploring this space. I’ll start by finding the upper bound – what’s the most NFS read ops/sec I can drive. Here are screenshots from Analytics that shows sustained NFS read ops/sec from DRAM. Starting with NFSv3:

And now NFSv4:

Both beating 250,000 NFS random read ops/sec from a single head node – great to see!

Questions when considering performance numbers

To understand these numbers, you must understand the context. These are the sort of questions you can ask yourself, along with the answers for those results above:

The above list covers many subtle issues to help you avoid them (don’t learn them the hard way).

Traps to watch out for regarding IOPS

For IOPS results, there are some specific additional questions to consider:

Being more realistic: 8 Kbyte I/O with latency

The aim of the above was to discuss context, and to show how to understand a great result – such as 250,000+ NFS IOPS – by knowing what questions to ask. The two key criticisms for this result would be that it was for 1 byte I/Os, and that latency wasn’t shown at all. Here I’ll redo this with 8 Kbyte I/Os, and show how Analytics can display the NFS I/O latency. I’ll also wind back to 10 clients, only use 1 of the 10 GigE ports on the 7410, and I’ll gradually add threads to the clients until each is running 20:

The steps in the NFSv3 ops/sec staircase are where I’m adding more client threads.

I’ve reached over 145,000 NFSv3 read ops/sec – and this is not the maximum the 7410 can do (I’ll need to use a second 10 GigE port to take this further). The latency does increase as more threads queue up, here it is plotted as a heat map with latency on the y-axis (the darker the pixel – the more I/Os were at that latency for that second). At our peak (which has been selected by the vertical line), most of the I/Os were faster than 55 us (0.055 milliseconds) – which can be seen in the numbers in the list on the left.

Note that this is the NFSv3 read ops/sec delivered to the 7410 after the client NFS driver has processed the 8 Kbyte I/Os, which decided to split some of the 8 Kbyte reads into 2 x 4 Kbyte NFS reads (pagesize). This means the workload became a mixed 4k and 8k read workload – for which 145,000 IOPS is still a good value. (I’m tempted to redo this for just 4 Kbyte I/Os to keep things simpler, but perhaps this is another useful lesson in the perils of benchmarking – the system doesn’t always do what it is asked.)

Reaching 145,000 4+ Kbyte NFS cached read ops/sec without blowing out latency is a great result – and it’s the latency that really matters (and from latency comes IOPS)… And on the topic of latency and IOPS – I do need to post a follow up for the next level after DRAM – no, not disks, it’s the L2ARC using SSDs in the Hybrid Storage Pool.

Print Friendly
Posted on December 2, 2008 at 10:47 pm by Brendan Gregg · Permalink
In: Fishworks · Tagged with: ,

11 Responses

Subscribe to comments via RSS

  1. Written by jason arneil
    on December 3, 2008 at 1:07 am

    Hi Brendan,
    Nice post, liking the analytics screenshots. When I was reading it though, and got to the "Questions when considering the performance numbers" part, and additional question sprung to mind:
    how much does the kit used in the benchmark cost me? Perhaps even in £/IO

  2. Written by Brendan Gregg
    on December 3, 2008 at 2:55 am

    Jason – thanks, I added "What is its price/performance" – which is often the most important question to ask when considering such numbers. Indeed, producing the best price/performance has been a key design principal for the Sun Storage 7000 series (especially the HSP model).

  3. Written by Aaron Newcomb
    on December 3, 2008 at 2:18 pm

    Excellent work!
    And a special thanks for NOT hiding the fact that you were using 1 byte IOs. A fact that some performance test results seem to gloss over or skip entirely.

  4. Written by Abby
    on December 13, 2008 at 10:15 pm

    Hi Brendan,
    On your 8K example were you saturating gigabit on the clients? Would be nice to see 10 gig on the client side as well, and also cpu monitor on the client and server?
    Looking forward to your SSD benchmarks follow-up.

  5. Written by Brendan Gregg
    on December 15, 2008 at 8:23 pm

    Abby – I wasn’t saturating the client NICs, although I would have been close to it had NFS not split some of those 8 Kbyte I/Os into 2 x 4 Kbyte I/Os. Discovering that behavior using Analytics should make an interesting follow up post when I get a chance.

  6. Written by Abby
    on December 17, 2008 at 12:14 pm

    Hi Brendan,
    To saturate gigabit ( say 900 Mbit/s) would require 14000 random iops with 8k reads and 28000 random iop/s with 4k reads. Let’s say in your example the average read is ~ 6k, you would need ~ 19000 random Iop/s to saturate gigabit.
    Gigabit ( or 10 Gb) latency: 35-70 microseconds ~ 15000-30000 iops / s
    Seems like your latency for the 1 byte example was ~4 microseconds.
    Are you using Jumbo buffers?
    Do you have ACKS disabled on the client?
    What’s pushing your latency so high on the 4-8k example? What is the actual ping latency in your setup?
    It would be interesting to see results using IP over Infiniband or low-latency 10Gigabit.
    Thanks alot, really appreciate your work and testing.

  7. Written by Abby
    on December 17, 2008 at 11:09 pm

    Forget the 4 microsecond latency…I forgot the 281000 iops was multiple threads and clients.

  8. Written by Brendan Gregg
    on December 18, 2008 at 6:50 pm

    Abby – answers to questions: Jumbo buffers – if you mean jumbo frames, then yes. ACKS disabled – do you mean outright disabled? I haven’t come across such a tunable – these Solaris clients are probably using SACK to reduce needed ACKS, but not disabled altogether (next time I run this I can dtrace/snoop to see). Ping latency is about 150 us from the clients. The heat map latency is showing the time from a received NFS request to when the response was sent, so the network latency is in addition to this. I wouldn’t say the 4-8k latency is high – most are faster than 55 us, which includes the time for the kernel to process NFS, fetch the data from ZFS, and bundle the NFS response.

  9. Written by Eric Grancher
    on January 1, 2009 at 12:34 pm

    good afternoon,
    thank you for these blog entries, really interesting.
    Is it understood / expected that the performance with NFSv4 is, (obviously not a general degradation!) for this IO pattern, 8% less than the NFSv3 one?
    thank you in advance,
    Eric Grancher

  10. Written by Brendan's blog » L2ARC Screenshots
    on September 25, 2011 at 12:55 am

    [...] a workload, I’m using 10 clients (described previously), 2 random read threads per client with an 8 Kbyte I/O size, and a 500 Gbyte total working set [...]

  11. Written by Brendan's blog » 1 Gbyte/sec NFS, streaming from disk
    on September 26, 2011 at 6:10 pm

    [...] previously explored maximum throughput and IOPS that I could reach on a Sun Storage 7410 by caching the working set entirely in DRAM, which may be [...]

Subscribe to comments via RSS