1 Gbyte/sec NFS, streaming from disk

I’ve previously explored maximum throughput and IOPS that I could reach on a Sun Storage 7410 by caching the working set entirely in DRAM, which may be likely as the 7410 currently scales to 128 Gbytes of DRAM per head node. The more your working set exceeds 128 Gbytes or is shifting, the more I/O will be served from disk instead. Here I’ll explore the streaming disk performance of the 7410, and show some of the highest throughputs I’ve seen.

Roch posted performance invariants for capacity planning, which included values of 1 Gbyte/sec for streaming read from disk, and 500 Mbytes/sec for streaming write to disk (50% of the read value due to the nature of software mirroring). Amitabha posted Sun Storage 7000 Filebench results on a 7410 with 2 x JBODs, which had reached 924 Mbytes/sec streaming read and 461 Mbytes/sec streaming write – consistent with Roch’s values. What I’m about to post here will further reinforce these numbers, and include the screenshots taken from the Sun Storage 7410 browser interface to show this in action – specifically from Analytics.

Streaming read

The 7410 can currently scale to 12 JBODs (each with 24 disks), but when I performed this test I only had 6 available to use, which I configured with mirroring. While this isn’t a max config, I don’t think throughput will increase much further with more spindles – after about 3 JBODs there is plenty of raw disk throughput capacity. This is the same 7410, clients and network as I’ve described before, with more configuration notes below.

The following screenshot shows 10 clients reading 1 Tbyte in total over NFS, with 128 Kbyte I/Os and 2 threads per client. Each client is connected using 2 x 1 GbE interfaces, and the server is using 2 x 10 GbE interfaces:

Showing disk I/O bytes along with network bytes is important – it helps us confirm that this streaming read test did read from disk. If the intent is to go to disk, make sure it does – otherwise it could be hitting from the server or client cache.

I’ve highlighted the network peak in this screenshot, which was 1.16 Gbytes/sec network outbound – but this is just a peak. The average throughput can be seen by zooming in on Analytics (not shown here), which found the network outbound throughput to average 1.10 Gbytes/sec, and disk bytes at 1.07 Gbyte/sec. The difference is that the network result includes data payload plus network protocol headers, whereas the disk result is data payload plus ZFS metadata – which is adding less than the protocol headers. So 1.07 Gbytes/sec is closer to the average data payload read over NFS from disk.

It’s always worth sanity checking results however possible, and we can do this here using the times. This run took 16:23 to complete, and 1 Tbyte was moved – that’s an average of 1066 Mbytes/sec of data payload – 1.04 Gbytes/sec. This time includes the little step at the end of the run, and when I zoomed in to see the average it didn’t include that. The step is from a slow client completing (I found out who using the Analytics statistic “NFS operations broken down by client” – and I need to now check why I have one client that is a little slower than the others!) Even including the slow client, 1.04 Gbytes/sec is a great result for delivered NFS reads from disk, on a single head node.

Update: 2-Mar-09

In a comment posted on this blog, Rex noticed a slight throughput decay over time in the above screenshots, and wondered if this continued if left longer. To test this out, I had the clients loop over their input files (files which were much too big to fit in the 7410′s DRAM cache), and mounted the clients with forcedirectio (to avoid them caching); this kept the reads being served from disk, which I could leave running for hours. The result:

The throughput is rock steady over this 24 hour interval.

I hoped to post some screenshots showing the decay and drilling down with Analytics to explain the cause. Problem is, it doesn’t happen anymore – throughput is always steady. I have upgraded our 10 GbE network since the original tests – our older switches were getting congested and slowing throughput a little, which may have been happening earlier.

Streaming write, mirroring

These write tests were with 5 JBODs (from a max of 12), but I don’t think the full 12 would improve write throughput much further. This is the same 7410, clients and network as I’ve described before, with more configuration notes below.

The following screenshot shows 20 clients writing 1 Tbyte in total over NFS, using 2 threads per client and 32 Kbyte I/Os. The disk pool has been configured with mirroring:

The statistics shown tell the story – the network throughput has averaged about 577 Mbytes/sec inbound, shown in the bottom zoomed graph. This network throughput includes protocol overheads, so the actual data throughput is a little less. The time for the 1 Tbyte transfer to complete is about 31 minutes (top graphs), which indicates the data throughput was about 563 Mbytes/sec. The Disk I/O bytes graph confirms that this is being delivered to disk, at a rate of 1.38 Gbytes/sec (measured by zooming in and reading the range average, as with the bottom graph). The rate of 1.38 Gbytes/sec is due to software mirroring ( (563 + metadata) x 2 ).

Streaming write, RAID-Z2

For comparison, the following shows the same test as above, but with double parity raid instead of mirroring:

The network throughput has dropped to 535 Mbytes/sec inbound, as there is more work for the 7410 to calculate parity during writes. As this took 34 minutes to write 1 Tbyte, our data rate is about 514 Mbytes/sec. The disk I/O bytes is much lower (notice the vertical range), averaging about 720 Mbytes/sec – a big difference to the previous test, due to RAID-Z2 vs mirroring.

Configuration Notes

To get the maximum streaming performance, jumbo frames were used and the ZFS record size (“database size”) was left at 128 Kbytes. 3 x SAS HBA cards were used, with dual pathing configured to improve performance.

This isn’t a maximum config for the 7410, since I’m testing a single head node (not a cluster), and I don’t have the full 12 JBODs.

In these tests, ZFS compression wasn’t enabled (the 7000 series provides 5 choices: off, LZJB, GZIP-2, GZIP, GZIP-9). Enabling compression may improve performance as there is less disk I/O, or it may not due to the extra CPU cycles to compress/uncompress the data. This would need to be tested with the workload in mind.

Note that the read and write optimized SSD devices known as Readzilla and Logzilla wern’t used in this test. They currently help with random read workloads and synchronous write workloads, neither of which I was testing here.

Conclusion

As shown in the screenshots above, the streaming performance to disk for the Sun Storage 7410 matches what Roch posted, which rounding down is about 1 Gbyte/sec for streaming disk read, and 500 Mbytes/sec for streaming disk write. This is the delivered throughput over NFS, the throughput to the disk I/O subsystem was measured to confirm that this really was performing disk I/O. The disk write throughput was also shown to be higher due to software mirroring or RAID-Z2 – averaging up to 1.38 Gbytes/sec. The real limits of the 7410 may be higher than these – this is just the highest I’ve been able to find with my farm of test clients.

Print Friendly
Posted on January 9, 2009 at 3:26 pm by Brendan Gregg · Permalink
In: Fishworks · Tagged with: , , ,

8 Responses

Subscribe to comments via RSS

  1. Written by Luqman
    on January 10, 2009 at 2:49 pm
    Permalink

    Thanks for this informative post. However, can you shed the light on RAID 5 performance since it’s still in use in many production environments?

  2. Written by Rex di Bona
    on January 15, 2009 at 5:01 am
    Permalink

    Actually, The more interesting question, Brendan, is why are the network and disk I/O performance numbers slowly deteriorating during the run. There is a slight, but very noticeable downward slope to each of those graphs. Does it continue if the read run is longer? Were you by any chance screaming at your drives during the test?

  3. Written by Brendan Gregg
    on January 27, 2009 at 6:27 pm
    Permalink

    G’Day Rex,
    I have some theories – I need to redo the test and recheck with Analytics and DTrace to confirm. I don’t believe it continues forever – ideally I’d post a screenshot showing streaming reads over a few days to prove that, the problem is having a few days where I can do this (this test server is in high demand!)
    Luqman – I did show RAID-Z2 write performance, which compared to mirrored gives you some idea. If I get the time I’ll post more tests – one problem is that if I tested every config combination, by the time I would finish (months) we may have a hardware or software update to the series – and I’d have to start again! :)

  4. Written by Brendan Gregg
    on March 2, 2009 at 12:49 am
    Permalink

    Rex, I finally had the 6 JBODs hooked up to the original 7410 and could run a longer test over the weekend. I added a screenshot to this blog – it was steady over 24 hours (and longer). While that was good to see, I was also hoping for another opportunity to demo the power of Analytics – by graphing "NFS operations by I/O size" heat maps (as I suspected the client’s NFS read ahead stumbled during the run, causing smaller requests but the same number of them – which fits all the plots showing the decay.) But it wasn’t to be – it didn’t reproduce, throughput is now always steady (just when I want it not to be!). I did upgrade our 10 GbE network, as it was getting congested, so that might have been a factor…

  5. Written by Bruce Bullis
    on April 22, 2009 at 12:52 pm
    Permalink

    We just purchased a clustered 7410 (2 cpus and 64GBs per head) and were doing some testing before we roll it into production. We are seeing something strange when writing NFS files. We see disk IO bytes per second running at about 3 times the network bytes, for example 470MBs (disk) vs 150MBs (network).
    We have one pool of disks (2 JBODs with 44 1T data/parity disks – 2 spares and 2 Logzillas). We have the Data Profile set to Double Parity Raid (Raid-Z2) and the Log profile set to Striped. We were expecting something closer to what you reported above for Raid-Z2 (720MBs disk to 514MBs net) and certainly not more than mirroring!
    Do you know any reason for this? Do we have something misconfigured? We are using 128K rec size, 1MB NFS buffer size, and NFS v4. We only do sequential access.
    Thanks for any information.

  6. Written by Sateesh Potturu
    on September 4, 2009 at 12:59 pm
    Permalink

    Brendan,
    I think test results with a minimal configuration will help people like us who started with a small configuration and hoping to scale it with increasing load.
    We started with one 7410 controller with 64GB RAM, 100GB logzilla, 4 1GbE NICs and one J4400 Array with 23 1TB SATA-II disks of 7200 RPM and 1 18GB logzilla.
    We are using it in virtualization and we see predominantly write (4047) and negligible (<100) reads. We are facing with two problems in our tests:
    1. Similar to what Bruce mentioned, we are observing disk IO to be many times that of network IO. Ex: N/w 5.4 MBytes per second and Disk I/O 42.44 MBytes per second. In another case, N/w 8.85 MBytes and Disk I/O 24.99 MBytes per second
    2. The N/w throughput of NFS seems to be very low for us, but we are being hinted that what we are getting for a single controller with single array is good. But, with 1 array we should be getting at least 1/5th of 514 MBytes/sec, right?
    I will be grateful to you if you can publish results for a minimal configuration like I mentioned above and also look at the two queries I have above.
    Thanks in Advance

  7. Written by True Religion Jeans
    on September 19, 2009 at 10:30 pm
    Permalink

    Thanks for this informative post. However, can you shed the light on RAID 5 performance since it’s still in use in many production environments?

  8. Written by Brendan Gregg
    on September 28, 2009 at 7:36 pm
    Permalink

    @Bruce, @Sateesh: there are a few ways the back-end I/O throughput (to disk) can inflated compared to the front-end (on the network.) These include:
    A) Pool profile (most obvious): mirror doubles the throughput, RAID-Z2 not so much.
    B) ZFS recordsize, which by default is 128 Kbytes. If over the network you are performing random 1 Kbyte writes to a large file (such that it is using the full 128 Kbyte recsize), at the back-end ZFS will turn those 1 Kbyte writes into 128 Kbyte writes. You can tune the ZFS recordsize to better match your workload, although it will reduce streaming performance and increase DRAM overhead a little.
    C) in my demos I didn’t have Logzilla (write optimized SSD) devices. If your workload hits logzillas, it will be written to them first and then flushed to disk. This can double the back end I/O throughput. Use Analytics to see bytes broken down by disk, and see how hot those Logzillas are.
    D) Readzillas: these can populate at 40 or 80 Mbytes/sec, and will pick up data written to the storage system. That’s extra disk write bytes on top of your workload. Again, use Analytics to see bytes broken down by disk to identify this.
    There may be more; many of these you can figure out with Analytics.

Subscribe to comments via RSS