About two years ago Joyent began offering Linux instances, running under KVM, stored on ZFS, and secured by Zones (“double hull virtualization”). Since then, I’ve been doing more and more work on Linux performance as customers deploy on these instances. It’s been fascinating to work on both the illumos and Linux kernels at the same time (a Linux guest in an illumos host), with full stack visibility of the guests, and hypervisor, down to metal. It’s let me better understand the design choices for each OS by having another perspective, something I talked about in my recent SCaLE12x keynote, and in my Systems Performance book. Apart from different OSes, I’ve also worked directly with dozens of customers, with many different applications, databases, and programming languages — too many to list. It’s been amazing, and I’m grateful that I had the chance to work and innovate in this unique environment.
However, I’ve decided to leave Joyent and pursue a new opportunity (more details about this later). It involves, among other things, working higher up in the application level, as well as Linux full time. Compared to the production observability I typically get on SmartOS/DTrace, working on Linux will at times be a challenge, but I’m interested in challenging myself and taking that on. I don’t know what tracing tool I’ll be using in the long term, but I’m no stranger to the Linux tracers: SystemTap, perf_events, dtrace4linux, ktap, and LTTng, which I’ve been using in a lab environment to solve customer issues. It’s been challenging, but also rewarding, to analyze Linux in new and different ways, and improve its performance.
I know some in the Solaris/illumos/SmartOS communities may be saddened by this news. I hope people can be glad for what I have contributed, and I’ll certainly continue to participate in these communities when I can. Also note that my change of job doesn’t change my technical opinion on those platforms and their technologies (especially DTrace, ZFS, and Zones) – which are still great, for the exact reasons I’ve publicly documented over the years and spoken about at conferences. I’m proud to have worked with them, and with the smart and dedicated people who build, support, and use them.
Following on from my earlier 10 performance wins post, here is another group of 10 I have worked on.
|13||mongoperf||System||DTrace||System||ZFS tuning||up to 8x|
|14||backups||System||iostat||System||OS tuning||2x – 4x|
|16||ZFS||System||DTrace||System||OS tuning||up to 100x|
|17||Sphinx||System||Flame Graphs||Build||compiler options||75%|
Longer summaries for each below. See the first post for notes on the above columns.
Issues 13, 15, and 18 are for Linux instances under the KVM hypervisor (HW virtualization); the remainder are for SmartOS instances (OS virtualization).
While I began documenting these performance wins to share the tools I was using, it’s also useful to consider the methodologies as well. I’ll mention key methodologies in the descriptions that follow. For most issues, I’m using the USE method to look for resource bottlenecks, and the TSA method to direct latency investigations.
11. redis … 41%
Description: A potential customer benchmarked redis performance and found it not quite good enough for their needs.
Analysis: Microstate accounting showed that the single-threaded redis-server was hot on-CPU, mostly in system time (TSA method). DTrace profiling of kernel activity showed most time processing network I/O, with 20% in polling. This identified one potential win: the polling time can be improved by use of event ports, which was integrated in a newer version of Redis (2.6). To attack the bulk of the CPU time, in network I/O, locality was investigated by using DTrace to profile CPU usage. This showed poor CPU affinity, with the hot thread walking across even-numbered CPUs instead of staying put, which is a known kernel scheduler bug that has been fixed in a newer kernel version. Estimating the speedup for the fix, the thread was manually bound to a CPU, which improved performance by 41%.
12. rsync … 5x
Description: A routine task copied data between systems using rsync. The throughput was low, and this had become a blocker.
Analysis: prstat and microstate accounting showed that an thread was hot on-CPU on the remote sending system, spending 95% in user-time (TSA method). DTrace was used to profile on-CPU user stacks for sshd, which showed that it was mostly spending time in compression. By disabling compression (using “-o ‘Compression no’”), performance improved from 7.0 Mbytes/sec to 35.1 Mbytes/sec. DTrace was then used to profile both the sender and receiver to determine the next bottleneck (why stop at 35?), which was various other factors, including I/O overheads and encryption (arcfour). netstat was used to check the health of the network. (See issue 18 for more about rsync.)
13. mongoperf … up to 8x
Description: The performance of a KVM/Linux cloud system intended for MongoDB was being evaluated using mongoperf, which showed it was much slower than expected. Is something wrong, or is this system just slow?
Analysis: strace was used to show that mongoperf was executing read/write/fsync in a loop, resulting in synchronous disk I/O. This bypassed the file system caching and buffering that was expected to improve performance, and was not a realistic test of the intended MongoDB workload. To see if there were any performance issues anyway, I traced I/O latency using DTrace in both the Linux guest (dtrace4linux) and the SmartOS host, which found that it was usually fast (0.1 ms: buffered by the battery-backed storage subsystem). However, occasionally it would have high latency, over 500 ms. DTrace of I/O with ZFS spa_sync() showed some were queueing behind TXG flushes. With ZFS tuning to increase the frequency of flushes (by 5x) and therefore reduce the size, these outliers moved to around 60ms (tail-end latency on a shorter queue).
14. backups … 2x – 4x
Description: A scheduled backup process was transferring data slowly.
Analysis: iostat showed that the disks were very busy with reads (USE method). Large files were being copied, however, the iostat numbers strongly suggested a random read workload, which didn’t make sense. I checked the ZFS prefetch property, and found (and then remembered) that it had been disabled in our platform software. DTrace of the io provider probes, and ZFS internals, confirmed that this was a streaming workload that should benefit from prefetch. I created a simulation of the workload in the lab to test prefetch, and then tested in production, where it showed a 2x to 4x improvement. After some more testing, we turned on ZFS prefetch by default in the cloud platform software, to benefit all customers.
15. Percona … 16x
Description: A benchmark evaluation of Percona on Joyent KVM/Linux vs AWS was not favorable. Business would be lost unless performance could be competitive.
Analysis: The benchmark was of custom queries run via mysqlslap. These updated a single row of a database – the same row – and were run from hundreds of threads in parallel. While this seemed like an odd test that would result in contention, the question was: why was it slower or our instances? Analysis began on the same ubuntu 12.04 instances that were tested, with 8 vCPUs. mpstat and pidstat -t showed that there was a scalability issue: despite a high number of threads and concurrency, the CPUs were about 88% idle, with 8% user and 4% system time.
To get a handle on what was occurring, I wanted to profile CPU usage, and also trace user- and kernel-stacks during scheduler off-CPU events. I switched SmartOS so that I could use DTrace for this, and got the same database and load running. prstat and DTrace confirmed that mysql was experiencing heavy lock contention, and additional CPUs allowed additional threads to spin and contend, creating negative scalability. I tested this experimentally, on the Linux instance I offlined all CPUs except for one (echo 0 to the /sys/…cpu…/online files). This would hurt most workloads, however, it improved the benchmark by 16x, and made the result very competitive. A similar result was seen by binding the mysqld to one CPU using taskset. I’ve included this as an example of negative scalability, diagnosed experimentally. The real root cause (not described here) was later found using a different tool and fixed.
16. ZFS … up to 100x
Description: Semi-regular spikes in I/O latency on an SSD postgres server.
Analysis: The customer reported multi-second I/O latency for a server with flash memory-based solid state disks (SSDs). Since this SSD type was new in production, it was feared that there may be a new drive or firmware problem causing high latency. ZFS latency counters, measured at the VFS interface, confirmed that I/O latency was dismal, sometimes reaching 10 seconds for I/O. The DTrace-based iosnoop tool (DTraceToolkit) was used to trace at the block device level, however, no seriously slow I/O was observed from the SSDs. I plotted the iosnoop traces using R for evidence of queueing behind TXG flushes, but they didn’t support that theory either.
This was difficult to investigate since the slow I/O was intermittent, sometimes only occurring once per hour. Instead of a typical interactive investigation, I developed various ways to log activity from DTrace and kstats, so that clues for the issue could be examined afterwards from the logs. This included capturing which processes were executed using execsnoop, and dumping ZFS metrics from kstat, including arcstats. This showed that various maintenance processes were executing during the hour, and, the ZFS ARC, which was around 210 Gbytes, would sometimes drop by around 6 Gbytes. Having worked performance issues with shrinking ARCs before, I developed a DTrace script to trace ARC reaping along with process execution, and found that it was a match with a cp(1) command. This was part of the maintenance task, which was copying a 30 Gbyte file, hitting the ARC limit and triggering an ARC shrink. Shrinking involves holding ARC hash locks, which can cause latency, especially when shrinking 6 Gbytes worth of buffers. The zfs:zfs_arc_shrink_shift tunable was adjusted to reduce the shrink size, which also made them more frequent. The worst-case I/O improved from 10s to 100ms.
17. Sphinx … 75%
Description: The Sphinx search engine was twice as fast on AWS Linux than SmartOS, when all other factors were identical.
Analysis: The TSA Method showed over 90% of time was spent in USR mode. DTrace CPU profiling data was made into a Flame Graph, which showed Sphinx acting normally. The Flame Graph revealed the request functions, which were traced with the pid provider and both timestamp and vtimestamp deltas to confirm USR time. On AWS Linux, perf profiling data was collected and also made into a Flame Graph. By examining them both and spotting the difference, it was clear that not the same application code was running, and a case of a function being elided on Linux was found. A renewed investigation into compiler differences found that the Linux build used -O3 (optimizations), whereas the SmartOS build did not. This was added to the SmartOS build, roughly doubling performance.
18. rsync … 2x
Description: A customer needed better rsync throughput, which was low on Joyent Linux compared to AWS.
Analysis: Joyent Linux could rsync at around 16 MB/s, whereas AWS Linux was around 50 MB/s. Some network and ssh tuning was initially suggested, however, the customer wanted to know why such tuning wasn’t necessary on AWS, which was already performing well. This was debugged using static performance tuning: checking the configurations and versions of everything, as well as active analysis. Several areas to improve performance were identified, the greatest was that the default version of OpenSSH on Joyent Linux was 5.9p1, whereas on AWS it was 6.2p2, which ran over two times faster. This issue was discovered by the static performance tuning methodology, and not a tool. The other issues identified did involve many tools: sar, pidstat, dtrace4linux, perf, ktap, Flame Graphs. I wanted to use the OpenSSH version difference as the example here, as it was an example of a methodology-based win and not a tool.
19. ab … 5x
Description: During a capacity planning exercise, a customer noticed AWS had around 5x higher node.js HTTP throughput than Joyent SmartOS.
Analysis: Apache Bench (ab) was used to drive load to a node.js process on localhost. I used netstat to investigate TCP activity, which showed the connection rate began at about 5k per second, and after about two seconds dropped down to 1k per second. When performance was poor, both node and ab spent time in the sleep state, blocked on something else. This sounded like a TCP TIME_WAIT issue I’ve debugged earlier, but the number of open TCP sessions suggested otherwise (only 11k; was expecting 33k+). Using DTrace to trace syscall latency, showed this time was in portfs: the SmartOS event ports implementation (epoll). This means that time was spent blocked on something else, which then woke up the file descriptors. Using DTrace to trace the actual wakeup which unblocked node, showed that it was TCP retransmits, despite this being a localhost benchmark. A packet capture of the retransmits, and then more DTrace of the kernel code paths, showed that these were caused by packets arriving during TIME_WAIT. This was the TIME_WAIT issue I’ve debugged earlier, although manifesting differently. The problem is one of benchmarking from a single client IP address, and new SYNs colliding with earlier sessions still in TIME_WAIT. Running this from multiple clients – which is the real world use case – showed performance could be steady at 5k connections/sec. Longer writeup here.
20. Elasticsearch … 50%
Description: Performance of an Elasticsearch benchmark on SmartOS was only 15% faster than AWS. Why not much more?
Analysis: Given the various performance features with SmartOS (OS virtualization, ZFS ARC, CPU bursting), the expectation was much more than 15%. Various performance analysis tools were used for analysis while the benchmark was running (the “active benchmarking” methodology), with arcstat showing a moderate rate of ARC misses, and iostat showing a random read workload. Other resources were not stressed, and this looked like a disk-bound benchmark. However, over time, the ARC become warmer, and the benchmark numbers improved. After about 20 minutes of testing, a steady state was reached, where the working set was fully cached, and disk reads reached zero. At this point, the benchmark become CPU-bound, and limited by a cloud imposed CPU resource control (CPU cap), as shown by prstat and kstat. In the warm state, SmartOS was running 50% faster than earlier. DTrace was used to investigate further improvements, such as comparing the Elasticsearch I/O size (~1 Kbyte) with the ZFS record size (128 Kbyte default).
In: Performance · Tagged with: performance, wins
Benchmarking, and benchmarking the cloud, is incredibly error prone. I provided guidance though this minefield in the benchmarking chapter of my book (Systems Performance: Enterprise and the Cloud); that chapter can be read online on the InformIT site. I also gave a lightning talk about benchmarking gone wrong at Surge last year. In this post, I’m going to cut to the chase and show you the tools I commonly use for basic cloud benchmarking.
As explained in the benchmarking chapter, I do not run these tools passively. I perform Active Benchmarking, where I use a variety of other observability tools while the benchmark is running, to confirm that it is measuring what it is supposed to. For perform reliable benchmarking, you need to be rigorous. For some suggestions of observability tools that you can use, try starting with the OS checklists (Linux, Solaris, etc.) from the USE Method.
Why These Tools
The aim here is to benchmark the performance of cloud instances, either for evaluations, capacity planning, or for troubleshooting performance issues. My approach is to use micro-benchmarks, where a single component or activity is tested, and to test on different resource dimensions: CPUs, networking, file systems. The results can then be mapped à la carte: I may be investigating an production application workload that has high network throughput, moderate CPU usage, and negligible file system usage, and so I can weigh the importance of each accordingly. Additional goals for testing these dimensions in the cloud environment are listed in the following sections.
CPUs: noploop, sysbench
For CPUs, this is what I’d like to test, and why:
- Single-threaded performance: this should map to my expectations for processor speed and type. If it doesn’t, if there is noticeable jitter or throttling, then I’ll investigate that separately.
- Multi-threaded performance: to see how much CPU capacity this instance has, including whether there are cloud limits in play.
For single-threaded performance, I start by hacking up an assembly program (noploop) to investigate instruction retire rates, and disassemble the binary to confirm what is being measured. That gives me a baseline result for how fast the CPUs really are. I’ll write up that process when I get a chance.
sysbench can test single-threaded and multi-threaded performance by calculating prime numbers. This also brings memory I/O into play. You need to be running the same version of sysbench, with the same compilation options, to be able to compare results. Testing from 1 to 8 threads:
sysbench --num-threads=1 --test=cpu --cpu-max-prime=25000 run sysbench --num-threads=2 --test=cpu --cpu-max-prime=25000 run sysbench --num-threads=4 --test=cpu --cpu-max-prime=25000 run sysbench --num-threads=8 --test=cpu --cpu-max-prime=25000 run
The value for cpu-max-prime should be chosen so that the benchmark runs for at least 10 seconds. I don’t test for longer than 60 seconds, unless I’m looking for systemic perturbations like cronjobs.
I’ll run the same multi-threaded sysbench invocation a number of times, to look for repeatability. This could vary based on scheduler placement, CPU affinity, and memory groups.
The single-threaded results are important for single-threaded (or effectively single-threaded) applications, like node.js. Multi-threaded for applications like MySQL server.
While sysbench is running, you’ll want to analyze CPU usage. For example, on Linux, I’d use mpstat, sar, pidstat, and perf. On SmartOS, I’d use mpstat, prstat, and DTrace profiling.
For networking, this is what I’d like to test, and why:
- Throughput, single-threaded: find the maximum throughput of one connection.
- Throughput, multi-threaded: look for cloud instance limits (resource controls).
iperf works well for this. Example commands:
# server iperf -s -l 128k # client, 1 thread iperf -c server_IP -l 128k -i 1 -t 30 # client, 2 threads iperf -c server_IP -P 2 -l 128k -i 1 -t 30
Here I’ve included -i 1 to print per-second summaries, so I can watch for variance.
While iperf is running, you’ll want to analyze network and CPU usage. On Linux, I’d use nicstat, sar, and pidstat. On SmartOS, I’d use nicstat, mpstat, and prstat.
File systems: fio
For file systems, this is what I’d like to test, and why:
- Medium working set size: as a realistic test of the file system and its cache. I’d test both single-threaded and multi-threaded performance, and analyze to confirm whether these hit CPU or file system limits.
By “medium”, I mean a working set size somewhat larger than the instance memory size. Eg, for a 1 Gbyte instance, I’d create a total file set of 10 Gbytes, with a non-uniform access distribution so that it has a cache hit ratio in the 90%s. These characteristics are chosen to match what I’ve seen are typical of the cloud. If you know what your total file size will be, working set size, and access distribution, then by all means test that instead.
I’ve been impressed by fio by Jens Axboe. Here’s how I’d use it:
# throw-away: 5 min warm-up fio --runtime=300 --time_based --clocksource=clock_gettime --name=randread --numjobs=8 \ --rw=randread --random_distribution=pareto:0.9 --bs=8k --size=10g --filename=fio.tmp # file system random I/O, 10 Gbytes, 1 thread fio --runtime=60 --time_based --clocksource=clock_gettime --name=randread --numjobs=1 \ --rw=randread --random_distribution=pareto:0.9 --bs=8k --size=10g --filename=fio.tmp # file system random I/O, 10 Gbytes, 8 threads fio --runtime=60 --time_based --clocksource=clock_gettime --name=randread --numjobs=8 \ --rw=randread --random_distribution=pareto:0.9 --bs=8k --size=10g --filename=fio.tmp
This is all about finding the “Goldilocks” working set size. People often test too small or too big:
- A small working set size, smaller than the instance memory size, may cache entirely in the instance file system cache. This can then become a test of kernel file system code paths, bounded by CPU speed. I sometimes do this to investigate file system pathologies, such as lock contention, and perturbations from cache flushing/shrinking.
- A large working set size, much larger than the instance memory size, can cause the file system cache to mostly miss. This becomes a test of storage I/O.
These tests are useful if and only if you can explain why the results are interesting: how they map to your production environment.
While fio is running, you’ll want to analyze file system, disk, and CPU usage. On Linux, I’d use sar, iostat, pidstat, and a profiler (perf, …). On SmartOS, I’d use vfsstat, iostat, prstat, and DTrace.
There are many more benchmarking tools for different targets. In my experience, it’s best to assume that they are all broken or misleading until proven otherwise, by use of analysis and sanity checking.
I’m giving a webinar about benchmarking the cloud on February 6th, where I’ll explain the process in detail, and help you benchmark reliably and effectively! Sign up here.