The USE Method: Linux Performance Checklist
The USE Method provides a strategy for performing a complete a check of system health, identifying common bottlenecks and errors. For each system resource, metrics for utilization, saturation and errors are identified and checked. Any issues discovered are then investigated using further strategies.
In this post, I’ll provide an example of a USE-based metric list for Linux operating systems (eg, Ubuntu, CentOS, Fedora). This is primarily intended for system administrators of the physical systems, who are using command line tools. Some of these metrics can be found in remote monitoring tools.
Physical Resources
| component | type | metric |
|---|---|---|
| CPU | utilization | per-cpu: mpstat -P ALL 1, “%idle”; sar -P ALL, “%idle”; system-wide: vmstat 1, “id”; sar -u, “%idle”; dstat -c, “idl”; per-process: top, “%CPU”; htop, “CPU%”; ps -o pcpu; per-kernel-thread: top/htop (“K” to toggle), where VIRT == 0 (heuristic). [1] |
| CPU | saturation | system-wide: vmstat 1, “r” > CPU count [2]; sar -q, “runq-sz” > CPU count; dstat -p, “run” > CPU count; per-process: /proc/PID/schedstat 2nd field (sched_info.run_delay); perf sched latency (shows “Average” and “Maximum” delay per-schedule); dynamic tracing, eg, SystemTap schedtimes.stp “queued(us)” [3] |
| CPU | errors | perf (LPE) if processor specific error events (CPC) are available; eg, AMD64′s “04Ah Single-bit ECC Errors Recorded by Scrubber” [4] |
| Memory capacity | utilization | system-wide: free -m, “Mem:” (main memory), “Swap:” (virtual memory); vmstat 1, “free” (main memory), “swap” (virtual memory); sar -r, “%memused”; dstat -m, “free”; per-process: top/htop, “RES” (resident main memory), “VIRT” (virtual memory), “Mem” for system-wide summary |
| Memory capacity | saturation | system-wide: vmstat 1, “si”/”so” (swapping); sar -B, “pgscank” + “pgscand” (scanning); sar -W; per-process: 10th field (min_flt) from /proc/PID/stat for minor-fault rate, or dynamic tracing [5] |
| Memory capacity | errors | dmesg for physical failures; dynamic tracinig, eg, SystemTap uprobes for failed malloc()s |
| Network Interfaces | utilization | ip -s link, RX/TX tput / max bandwidth; /proc/net/dev, “bytes” RX/TX tput/max |
| Network Interfaces | saturation | ifconfig, “overruns”, “dropped”; netstat -s, “segments retransmited”; /proc/net/dev, RX/TX “drop”; dynamic tracing for other TCP/IP stack queueing [5] |
| Network Interfaces | errors | ifconfig, “errors”, “dropped”; netstat -i, “RX-ERR”/”TX-ERR”; ip -s link, “errors”; /proc/net/dev, “errs”, “drop”; extra counters may be under /sys/class/net/…; dynamic tracing of driver function returns [6] |
| Storage device I/O | utilization | system-wide: iostat -xz 1, “%util”; sar -d, “%util”; per-process: iotop; /proc/PID/sched “se.statistics.iowait_sum” |
| Storage device I/O | saturation | iostat -xnz 1, high “avgqu-sz”, or “await” >> “r_await” | “w_await”; sar -d same; LPE block probes for queue length/latency; dynamic/static tracing of I/O subsystem (incl. LPE block probes) |
| Storage device I/O | errors | /sys/devices/…/ioerr_cnt; smartctl; dynamic/static tracing of I/O subsystem response codes [7] |
| Storage capacity | utilization | swap: swapon -s; free; /proc/meminfo “SwapFree”/”SwapTotal”; file systems: “df -h” |
| Storage capacity | saturation | not sure this one makes sense – once its full, ENOSPC |
| Storage capacity | file systems: errors | strace for ENOSPC; dynamic tracing for ENOSPC; /var/log/messages errs, depending on FS |
| Storage controller | utilization | iostat -xz 1, sum devices and compare to known IOPS/tput limits per-card |
| Storage controller | saturation | see storage device saturation, … |
| Storage controller | errors | see storage device errors, … |
| Network controller | utilization | infer from ip -s link (or /proc/net/dev) and known controller max tput for its interfaces |
| Network controller | saturation | see network interface saturation, … |
| Network controller | errors | see network interface errors, … |
| CPU interconnect | utilization | LPE (CPC) for CPU interconnect ports, tput / max |
| CPU interconnect | saturation | LPE (CPC) for stall cycles |
| CPU interconnect | errors | LPE (CPC) for whatever is available |
| Memory interconnect | utilization | LPE (CPC) for memory busses, tput / max; or CPI greater than, say, 5; CPC may also have local vs remote counters |
| Memory interconnect | saturation | LPE (CPC) for stall cycles |
| Memory interconnect | errors | LPE (CPC) for whatever is available |
| I/O interconnect | utilization | LPE (CPC) for tput / max if available; inference via known tput from iostat/ip/… |
| I/O interconnect | saturation | LPE (CPC) for stall cycles |
| I/O interconnect | errors | LPE (CPC) for whatever is available |
- [1] There can be some oddities with the %CPU from top/htop in virtualized environments; I’ll update with details later when I can.
- CPU utilization: a single hot CPU can be caused by a single hot thread, or mapped hardware interrupt. Relief of the bottleneck usually involves tuning to use more CPUs in parallel.
- uptime “load average” (or /proc/loadavg) wasn’t included for CPU metrics since Linux load averages include tasks in the uninterruptable state (usually I/O).
- [2] The man page for vmstat describes “r” as “The number of processes waiting for run time”, which is either incorrect or misleading (on recent Linux distributions it’s reporting those threads that are waiting, and threads that are running on-CPU; it’s just the wait threads in other OSes).
- [3] There may be a way to measure per-process scheduling latency with perf’s sched:sched_process_wait event, otherwise perf probe to dynamically trace the scheduler functions, although, the overhead under high load to gather and post-process many (100s of) thousands of events per second may make this prohibitive. SystemTap can aggregate per-thread latency in-kernel to reduce overhead, although, last I tried schedtimes.stp (on FC16) it produced thousands of “unknown transition:” warnings.
- LPE == Linux Performance Events, a powerful observability toolkit that reads CPC and can also use dynamic and static tracing. Its interface is the perf command.
- CPC == CPU Performance Counters (aka “Performance Instrumentation Counters” (PICs) or “Performance Monitoring Events” (PMUs) or “Hardware Events”), read via programmable registers on each CPU by perf (which it was originally designed to do). These have traditionally been hard to work with due to differences between CPUs. LPE perf makes life easier by providing aliases for commonly used counters. Be aware that there are usually many more made available by the processor, accessible by providing their hex values to perf stat -e. Expect to spend some quality time (days) with the processor vendor manuals when trying to use these. (My short video about CPC may be useful, despite not being on Linux).
- [4] There aren’t many error-related events in the recent Intel and AMD processor manuals; be aware that the public manuals may not show a complete list of events.
- [5] The goal is a measure of memory capacity saturation – the degree to which a process is driving the system beyond its ability (and causing paging/swapping). High fault latency works well, but there isn’t a standard LPE probe or existing SystemTap example of this (roll your own using dynamic tracing). Another metric that may serve a similar goal is minor-fault rate by process, which could be watched from /proc/PID/stat. This should be available in htop as MINFLT.
- [6] Dropped packets are included as both saturation and error indicators, since they can occur due to both types of events.
- [7] This includes tracing functions from different layers of the I/O subsystem: block device, SCSI, SATA, IDE, … Some static probes are available (LPE “scsi” and “block” tracepoint events), else use dynamic tracing.
- CPI == Cycles Per Instruction (others use IPC == Instructions Per Cycle).
- I/O interconnect: this includes the CPU to I/O controller busses, the I/O controller(s), and device busses (eg, PCIe).
- Dynamic Tracing: Allows custom metrics to be developed, live in production. Options on Linux include: LPE’s “perf probe”, which has some basic functionality (function entry and variable tracing), although in a trace-n-dump style that can cost performance; SystemTap (in my experience, almost unusable on CentOS/Ubuntu, but much more stable on Fedora); DTrace-for-Linux, either the Paul Fox port (which I’ve tried) or the OEL port (which Adam has tried), both projects very much in beta.
Software Resources
| component | type | metric |
|---|---|---|
| Kernel mutex | utilization | With CONFIG_LOCK_STATS=y, /proc/lock_stat “holdtime-totat” / “acquisitions” (also see “holdtime-min”, “holdtime-max”) [8]; dynamic tracing of lock functions or instructions (maybe) |
| Kernel mutex | saturation | With CONFIG_LOCK_STATS=y, /proc/lock_stat “waittime-total” / “contentions” (also see “waittime-min”, “waittime-max”); dynamic tracing of lock functions or instructions (maybe); spinning shows up with profiling (perf record -a -g -F 997 ..., oprofile, dynamic tracing) |
| Kernel mutex | errors | dynamic tracing (eg, recusive mutex enter); other errors can cause kernel lockup/panic, debug with kdump/crash |
| User mutex | utilization | valgrind --tool=drd --exclusive-threshold=... (held time); dynamic tracing of lock to unlock function time |
| User mutex | saturation | valgrind --tool=drd to infer contention from held time; dynamic tracing of synchronization functions for wait time; profiling (oprofile, PEL, …) user stacks for spins |
| User mutex | errors | valgrind --tool=drd various errors; dynamic tracing of pthread_mutex_lock() for EAGAIN, EINVAL, EPERM, EDEADLK, ENOMEM, EOWNERDEAD, … |
| Task capacity | utilization | top/htop, "Tasks" (current); sysctl kernel.threads-max, /proc/sys/kernel/threads-max (max) |
| Task capacity | saturation | threads blocking on memory allocation; at this point the page scanner should be running (sar -B "pgscan*"), else examine using dynamic tracing |
| Task capacity | errors | "can't fork()" errors; user-level threads: pthread_create() failures with EAGAIN, EINVAL, ...; kernel: dynamic tracing of kernel_thread() ENOMEM |
| File descriptors | utilization | system-wide: sar -v, "file-nr" vs /proc/sys/fs/file-max; dstat --fs, "files"; or just /proc/sys/fs/file-nr; per-process: ls /proc/PID/fd | wc -l vs ulimit -n |
| File descriptors | saturation | does this make sense? I don't think there is any queueing or blocking, other than on memory allocation. |
| File descriptors | errors | strace errno == EMFILE on syscalls returning fds (eg, open(), accept(), ...). |
- [8] Kernel lock analysis used to be via lockmeter, which had an interface called "lockstat".
What's Next
See the USE Method for the follow-up strategies after identifying a possible bottleneck. If you complete this checklist but still have a performance issue, move onto other strategies: drill-down analysis and latency analysis.
Acknowledgements
Resources used:
- dstat documentation
- 20 Linux monitoring tools every sysadmin should know
- Rosetta Stone for Unix
- SystemTap Example Scripts for schedtimes.stp
- PerfUserGuide Linux profiling with Perf (LPE)
- perf announcements on LKML by Ingo Molnar, Peter Zijlstra, ...
- Perf homepage
- lock_stat by Peter Zijlstra
- lockmeter, for historical interest (now done via lock_stat)
- oprofile by John Levon
- Valgrind DRD which has a good list (8.2.4) of detected lock errors
- Linux kernel source
- man pages
Filling this this checklist has required a lot of research, testing and experimentation. Please reference back to this post if it helps you develop related material.
It's quite possible I've missed something or included the wrong metric somewhere (sorry); I'll update the post to fix these up as they are understood.
In: Performance · Tagged with: linux, usemethod
The USE Method: Solaris Performance Checklist
The USE Method provides a strategy for performing a complete a check of system health, identifying common bottlenecks and errors. For each system resource, metrics for utilization, saturation and errors are identified and checked. Any issues discovered are then investigated using further strategies.
In this post, I’ll provide an example of a USE-based metric list for the Solaris operating system (I’m writing this for later Solaris 10 or Oracle Solaris 11 systems; I’ll do illumos/SmartOS separately, later). This is primarily intended for system administrators of the physical systems.
Physical Resources
| component | type | metric |
|---|---|---|
| CPU | utilization | per-cpu: mpstat 1, “idl”; system-wide: vmstat 1, “id”; per-process: prstat -c 1 (“CPU” == recent), prstat -mLc 1 (“USR” + “SYS”); per-kernel-thread: lockstat -Ii rate, DTrace profile stack() |
| CPU | saturation | system-wide: uptime, load averages; vmstat 1, “r”; DTrace dispqlen.d (DTT) for a better “vmstat r”; per-process: prstat -mLc 1, “LAT” |
| CPU | errors | fmadm faulty; cpustat (CPC) for whatever error counters are supported (eg, thermal throttling) |
| Memory capacity | utilization | system-wide: vmstat 1, “free” (main memory), “swap” (virtual memory); per-process: prstat -c, “RSS” (main memory), “SIZE” (virtual memory) |
| Memory capacity | saturation | system-wide: vmstat 1, “sr” (bad now), “w” (was very bad); vmstat -p 1, “api” (anon page ins == pain), “apo”; per-process: prstat -mLc 1, “DFL”; DTrace anonpgpid.d (DTT), vminfo:::anonpgin on execname |
| Memory capacity | errors | fmadm faulty and prtdiag for physical failures; fmstat -s -m cpumem-retire (ECC events); DTrace failed malloc()s |
| Network Interfaces | utilization | nicstat (latest version here); kstat; dladm show-link -s -i 1 interface |
| Network Interfaces | saturation | nicstat; kstat for whatever custom statistics are available (eg, “nocanputs”, “defer”, “norcvbuf”, “noxmtbuf”); netstat -s, retransmits |
| Network Interfaces | errors | netstat -i, error counters; dladm show-phys; kstat for extended errors, look in the interface and “link” statistics (there are often custom counters for the card) |
| Storage device I/O | utilization | system-wide: iostat -xnz 1, “%b”; per-process: DTrace iotop |
| Storage device I/O | saturation | iostat -xnz 1, “wait”; DTrace iopending (DTT), sdqueue.d (DTB) |
| Storage device I/O | errors | iostat -En; DTrace I/O subsystem, eg, ideerr.d (DTB), satareasons.d (DTB), scsireasons.d (DTB), sdretry.d (DTB) |
| Storage capacity | utilization | swap: swap -s; file systems: “df -h”; plus other commands depending on FS type |
| Storage capacity | saturation | not sure this one makes sense – once its full, ENOSPC |
| Storage capacity | errors | DTrace; /var/adm/messages file system full messages |
| Storage controller | utilization | iostat -Cxnz 1, compare to known IOPS/tput limits per-card |
| Storage controller | saturation | look for kernel queueing: sd (iostat “wait” again), ZFS zio pipeline |
| Storage controller | errors | DTrace the driver, eg, mptevents.d (DTB); /var/adm/messages |
| Network controller | utilization | infer from nicstat and known controller max tput |
| Network controller | saturation | see network interface saturation |
| Network controller | errors | kstat for whatever is there / DTrace |
| CPU interconnect | utilization | cpustat (CPC) for CPU interconnect ports, tput / max (eg, see the amd64htcpu script) |
| CPU interconnect | saturation | cpustat (CPC) for stall cycles |
| CPU interconnect | errors | cpustat (CPC) for whatever is available |
| Memory interconnect | utilization | cpustat (CPC) for memory busses, tput / max; or CPI greater than, say, 5; CPC may also have local vs remote counters |
| Memory interconnect | saturation | cpustat (CPC) for stall cycles |
| Memory interconnect | errors | cpustat (CPC) for whatever is available |
| I/O interconnect | utilization | busstat (SPARC only); cpustat for tput / max if available; inference via known tput from iostat/nicstat/… |
| I/O interconnect | saturation | cpustat (CPC) for stall cycles |
| I/O interconnect | errors | cpustat (CPC) for whatever is available |
- CPU utilization: a single hot CPU can be caused by a single hot thread, or mapped hardware interrupt. Relief of the bottleneck usually involves tuning to use more CPUs in parallel.
- lockstat and plockstat are DTrace-based since Solaris 10 FCS.
- vmstat “r”: this is coarse as it is only updated once per second.
- CPC == CPU Performance Counters (aka “Performance Instrumentation Counters” (PICs), or “Performance Monitoring Events”), read via programmable registers on each CPU, by cpustat(1M) or the DTrace “cpc” provider. These have traditionally been hard to work with due to differences between CPUs, but are getting much easier with the PAPI standard. Still, expect to spend some quality time (days) with the processor vendor manuals (what “cpustat -h” tells you to read), and to post-process cpustat with awk or perl. See my short talk (video) about CPC (2010). (Many years ago, I made a toolkit including CPC scripts – CacheKit – that was too much work to maintain.)
- Memory capacity utilization: interpreting vmstat’s “free” has been tricky across different Solaris versions (we documented it in the Perf & Tools book), due to different ways it was calculated, and tunables that affect when the system will kick-off the page scanner. It’ll also typically shrink as the kernel uses unused memory for caching (ZFS ARC).
- Be aware that kstat can report bad data (so can any tool); there isn’t really a test suite for kstat data, and engineers can add new code paths and forget to add the counters.
- DTT == DTraceToolkit scripts, DTB == DTrace book scripts.
- CPI == Cycles Per Instruction (others use IPC == Instructions Per Cycle).
- I/O interconnect: this includes the CPU to I/O controller busses, the I/O controller(s), and device busses (eg, PCIe).
Software Resources
| component | type | metric |
|---|---|---|
| Kernel mutex | utilization | lockstat -H (held time); DTrace lockstat provider |
| Kernel mutex | saturation | lockstat -C (contention); DTrace lockstat provider; spinning shows up with dtrace -n 'profile-997 { @[stack()] = count(); }' |
| Kernel mutex | errors | lockstat -E, eg recusive mutex enter (other errors can cause kernel lockup/panic, debug with mdb -k) |
| User mutex | utilization | plockstat -H (held time); DTrace plockstat provider |
| User mutex | saturation | plockstat -C (contention); prstat -mLc 1, "LCK"; DTrace plockstat provider |
| User mutex | errors | DTrace plockstat and pid providers, for EAGAIN, EINVAL, EPERM, EDEADLK, ENOMEM, EOWNERDEAD, ... see pthread_mutex_lock(3C) |
| Process capacity | utilization | sar -v, “proc-sz”; kstat, “unix:0:var:v_proc” for max, “unix:0:system_misc:nproc” for current; DTrace (`nproc vs `max_nprocs) |
| Process capacity | saturation | not sure this makes sense; you might get queueing on pidlinklock in pid_allocate(), as it scans for available slots once the table gets full |
| Process capacity | errors | “can’t fork()” messages |
| Thread capacity | utilization | user-level: kstat, “unix:0:lwp_cache:buf_inuse” for current, prctl -n zone.max-lwps -i zone ZONE for max; kernel: mdb -k or DTrace, “nthread” for current, limited by memory |
| Thread capacity | saturation | threads blocking on memory allocation; at this point the page scanner should be running (vmstat “sr”), else examine using DTrace/mdb. |
| Thread capacity | errors | user-level: pthread_create() failures with EAGAIN, EINVAL, …; kernel: thread_create() blocks for memory but won’t fail. |
| File descriptors | utilization | system-wide (no limit other than RAM); per-process: pfiles vs ulimit or prctl -t basic -n process.max-file-descriptor PID; a quicker check than pfiles is ls /proc/PID/fd | wc -l |
| File descriptors | saturation | does this make sense? I don’t think there is any queueing or blocking, other than on memory allocation. |
| File descriptors | errors | truss or DTrace (better) to look for errno == EMFILE on syscalls returning fds (eg, open(), accept(), …). |
- lockstat/plockstat often drop events due to load; I often roll my own to avoid this using the DTrace lockstat/plockstat provider (examples in the DTrace book).
- File descriptor utilization: while other OSes have a system-wide limit, Solaris doesn’t (at least at the moment, this could change; see my writeup about it).
What’s Next
See the USE Method for the follow-up strategies after identifying a possible bottleneck. If you complete this checklist but still have a performance issue, move onto other strategies: drill-down analysis and latency analysis.
In: Performance · Tagged with: solaris, usemethod
The USE Method
A serious performance issue arises, and you suspect it’s caused by the server. What do you check first? Back when I was teaching operating system performance, I wanted a methodology my students could follow to find common issues quickly, without overlooking important areas. Like an emergency checklist in a flight manual, it would be something simple, straightforward, complete and fast. I eventually came up with the “USE” method (short for “Utilization Saturation and Errors”), which I’ve used many times successfully in enterprise environments, and more recently in cloud computing environments.
The goal of USE is to complete a quick check of server health, identifying resource bottlenecks. It provides a way to construct your own checklist, based on three metric types and a strategy for approaching a complex system. I find it solves about 80% of server issues with 5% of the effort, and, as I will demonstrate, it can be applied to systems other than servers.
The USE Method should be thought of as a tool, one that is part of larger toolbox. There are many problem types it doesn’t solve, which will require other methods and longer time spans.
Problem Statement
Before the USE Method, the usual questions can be asked:
- What makes you think there is a performance problem?
- Has it ever performed well?
- What changed recently? Software or hardware? Load?
- Can it be expressed in terms of latency or run time?
- Does the problem affect other people or applications?
- What is the environment? What software and hardware is used? Versions? Configuration?
These are typical questions that technical support staff ask when first handling performance issues. While they may seem obvious, they do solve many issues immediately. Once you are past these, you are more likely to have a genuine problem.
The USE Method
The USE Method can be summarized as:
For every resource, check utilization, saturation and errors.
It’s intended to be used early in a performance investigation, to identify systemic bottlenecks.
Terminology definitions:
- resource: all physical server functional components (CPUs, disks, busses, …) [1]
- utilization: the average time that the resource was busy servicing work [2]
- saturation: the degree to which the resource has extra work which it can’t service, often queued
- errors: the count of error events
[1] It can be useful to consider some software resources as well, and see which metrics are possible.
[2] There is another definition where utilization describes the proportion of a resource that is used, and so 100% utilization means no more work can be accepted, unlike with the “busy” definition above.
The metrics are usually expressed in the following terms:
- utilization: as a percent over a time interval. eg, “one disk is running at 90% utilization”.
- saturation: as a queue length. eg, “the CPUs have an average run queue length of four.”
- errors: scalar counts. eg, “this network interface has had fifty late collisions.”
Errors should be investigated because they can degrade performance, and may not be immediately noticed when the failure mode is recoverable. This includes operations that fail and are retried, and devices from a pool of redundant devices that fail.
Does Low Utilization Mean No Saturation?
A short burst of high utilization can cause saturation and performance issues, even though utilization is low over a long interval. This may be counter-intuitive.
I had a recent example of this where a customer had problems with CPU saturation (latency) even though their monitoring tools showed CPU utilization was never higher than 80%. The monitoring tool was reporting five minute averages, during which CPU utilization hit 100% for seconds at a time.
Resource List
To begin with, you need a list of resources to iterate through. Here is a generic list for servers:
- CPUs: sockets, cores, hardware threads (virtual CPUs)
- Memory: capacity
- Network interfaces
- Storage devices: I/O, capacity
- Controllers: storage, network cards
- Interconnects: CPUs, memory, I/O
Some components are two types of resources: storage devices are a service request resource (I/O) and also a capacity resource (population). Both types can become a system bottleneck. Request resources can be defined as queueing systems, which can queue and then service requests.
Some physical components have been left out, such as hardware caches (eg, MMU TLB/TSB, CPU). The USE Method is most effective for resources that suffer performance degradation under high utilization or saturation, leading to a bottleneck. Caches improve performance under high utilization.
Cache hit rates and other performance attributes can be checked after the USE Method – after systemic bottlenecks have been ruled out. If you are unsure whether to include a resource, include it, then see how well the metrics work.
Functional Block Diagram
Another way to iterate over resources is to find or draw a Functional Block Diagram for the system. These also show relationships, which can be very useful when looking for bottlenecks in the flow of data. Here is an example from the Sun Fire V480 Guide (page 82):

I love these diagrams, although they can be hard to come by. Hardware engineers can be the best resource – the people who actually build the things. Or you can try drawing your own.
While determining utilization for the various busses, annotate each bus on the functional diagram with its maximum bandwidth. This results in a diagram where systemic bottlenecks may be identified before a single measurement has been taken. (This is a useful exercise during hardware product design, when physical components can be changed.)
Interconnects
CPU, memory and I/O interconnects are often overlooked. Fortunately, they aren’t commonly the system bottleneck. Unfortunately, if they are, it can be difficult to do much about (maybe you can upgrade the main board, or reduce load: eg, “zero copy” projects lighten memory bus load). With the USE Method, at least you become aware of what you weren’t considering: interconnect performance. See Analyzing the HyperTransport for an example of an interconnect issue which I identified with the USE Method.
Metrics
Given the list of resources, consider the metric types: utilization, saturation and errors.
Here are some examples. In the table below, think about each resource and metric type, and see if you can fill in the blanks. Mousing over the empty cells will reveal some possible answers, described in generic Unix/Linux terms (you can be more specific):
| resource | type | metric |
|---|---|---|
| CPU | utilization |
CPU utilization (either per-CPU or a system-wide average)
|
| CPU | saturation |
dispatcher queue length (aka run-queue length)
|
| Memory capacity | utilization |
available free memory (system-wide)
|
| Memory capacity | saturation |
anonymous paging or thread swapping (maybe “page scanning” too)
|
| Network interface | utilization |
RX/TX throughput / max bandwidth
|
| Storage device I/O | utilization |
device busy percent
|
| Storage device I/O | saturation |
wait queue length
|
| Storage device I/O | errors |
device errors (“soft”, “hard”, …)
|
Click here to reveal all. I’ve left off timing: these metrics are either averages per interval or counts. I’ve also left off how to fetch them: for your custom checklist, include which OS tool or monitoring software to use, and which statistic to read. For those metrics that aren’t available, write “?”. You will end up with a checklist that is easy and quick to follow, and is as complete as possible for your system.
Harder Metrics
Now for some harder combinations (again, try to think about these first!):
| resource | type | metric |
|---|---|---|
| CPU | errors |
eg, correctable CPU cache ECC events or faulted CPUs (if the OS+HW supports that)
|
| Memory capacity | errors |
eg, failed malloc()s (although this is usually due to virtual memory exhaustion, not physical)
|
| Network | saturation |
saturation related NIC or OS errors; eg “nocanputs”
|
| Storage controller | utilization |
depends on the controller; it may have a max IOPS or throughput that can be checked vs current activity
|
| CPU interconnect | utilization |
per port throughput / max bandwidth (CPU performance counters)
|
| Memory interconnect | saturation |
memory stall cycles, high CPI (CPU performance counters)
|
| I/O interconnect | utilization |
bus throughput / max bandwidth (performance counters may exist on your HW; eg, Intel “uncore” events)
|
Click here to reveal all. These are getting tricky to measure – I often have to write my own software to do them (eg, the “amd64htcpu” script from Analyzing the HyperTransport).
Repeat for all combinations, and include instructions for fetching each metric. You’ll end up with a list of about thirty metrics, some of which can’t be measured, and some of which are tricky to measure. Fortunately, the most common issues are usually found with the easy ones (eg, CPU saturation, memory capacity saturation, network interface utilization, disk utilization), which can be checked first.
In follow-up posts I’ll include sample USE-derived checklists for different operating systems.
Software Resources
Some software resources can be considered in a similar way. This usually applies to smaller components of software, not entire applications. For example:
- mutex locks: utilization may be defined as the time the lock was held; saturation by those threads queued waiting on the lock.
- thread pools: utilization may be defined as the time threads were busy processing work; saturation by the number of requests waiting to be serviced by the thread pool.
- process/thread capacity: the system may have a limited number of processes or threads, the current usage of which may be defined as utilization; waiting on allocation may be saturation; and errors are when the allocation failed (eg, “cannot fork”).
- file descriptor capacity: similar to the above, but for file descriptors.
Don’t sweat this type. If the metrics work well, use them, otherwise software can be left to other methodologies (eg, latency).
Suggested Interpretations
The USE Method helps you identify which metrics to use. After learning how to read them from the operating system, your next task is to interpret their current values. For some, interpretation may be obvious (and well documented). Others, not so obvious, and may depend on workload requirements or expectations.
The following are some general suggestions for interpreting metric types:
- Utilization: 100% utilization is usually a sign of a bottleneck (check saturation and its effect to confirm). High utilization (eg, beyond 70%) can begin to be a problem for a couple of reasons:
- When utilization is measured over a relatively long time period (multiple seconds or minutes), a total utilization of, say, 70% can hide short bursts of 100% utilization.
- Some system resources, such as hard disks, cannot be interrupted during an operation, even for higher-priority work. Once their utilization is over 70%, queueing delays can become more frequent and noticeable. Compare this to CPUs, which can be interrupted (“preempted”) at almost any moment.
- Saturation: any degree of saturation can be a problem (non-zero). This may be measured as the length of a wait queue, or time spent waiting on the queue.
- Errors: non-zero error counters are worth investigating, especially if they are still increasing while performance is poor.
It’s easy to interpret the negative case: low utilization, no saturation, no errors. This is more useful than it sounds – narrowing down the scope of an investigation can quickly bring focus to the problem area.
Cloud Computing
In a cloud computing environment, software resource controls may be in place to limit or throttle tenants who are sharing one system. At Joyent we primarily use OS virtualization (SmartOS), which imposes memory limits, CPU limits and storage I/O throttling. Each of these resource limits can be examined with the USE Method, similar to examining the physical resources.
For example, in our environment “memory capacity utilization” can be the tenant’s memory usage vs its memory cap. “memory capacity saturation” can be seen by anonymous paging activity, even though the traditional Unix page scanner may be idle.
Strategy
The USE Method is pictured as a flowchart below. Note that errors can be checked before utilization and saturation, as a minor optimization (they are usually quicker and easier to interpret).

The USE Method identifies problems which are likely to be system bottlenecks. Unfortunately, systems can be suffering more than one performance problem, and so the first one you find may be a problem but not the problem. Each discovery can be investigated using further strategies, before returning to the USE Method as needed to iterate over more resources.
Strategies for further analysis include workload characterization and drill-down analysis. After completing these (if needed), you should have evidence for whether the corrective action is to adjust the load applied or to tune the resource itself.
Workload Characterization
The workload can be characterized by answering questions such as:
- Who is causing the load? Process ID, user ID, remote IP address?
- Why is the load being called? Code path?
- What are other characteristics of the load? IOPS, throughput, type?
- How is the load changing over time?
This helps separate problems of load from problems of architecture, by identifying the former.
The best performance wins are from eliminating unnecessary work. Sometimes these bottlenecks are caused by applications malfunctioning (eg, a thread stuck in a loop), or bad configurations (system-wide backups running during the day), and with maintenance or reconfiguration the work can be eliminated. Characterizing the load can identify these issues.
Each of the above questions can be answered by more metrics for that particular resource, and can be documented as further analysis steps at the end of the USE Method checklist.
Drill-Down Analysis
If needed, drill-down analysis on the resource and workload can be performed. This involves peeling away layers of software or hardware to find the core of the issue – moving from a high-level view to deeper details.
Static and Dynamic Tracing
I began using the USE Method during 2003, and had drawn up checklists and annotated functional block diagrams to show how to read each metric. These included many question marks for metrics which I couldn’t observe with the tools available at the time. One that particularly bothered me was disk I/O by-process – to characterize load causing disk bottlenecks.
To solve this I developed psio (process status with I/O) using a static kernel tracing framework (prex/tnf) to both trace and summarize disk I/O by process. While it worked, I had many more question marks to go, and prex/tnf only had a few dozen instrumentation points (“probes”).
Then came Dynamic Tracing with DTrace.
The possibilities with DTrace are so vast it can be hard to know where to start. I already had a starting point – those metrics that were previously impossible to see, especially for workload characterization. Many of my first DTrace scripts did this – showing who is doing what. I also rewrote my psio tool in DTrace, then split it into two tools: iosnoop in 2004 and iotop in 2005. These are now available in some form on many OSes.
Dynamic Tracing allows any software function to be traced, timed and examined. For the USE Method, it means that some of the missing metrics can be observed. The most useful of the dynamic probes have become available as static trace points, which improves their interface stability.
Apollo
I said earlier that the USE Method could be applied beyond servers. Looking for a fun example, I thought of a system in which I have no expertise at all, and no idea where to start: the Apollo Lunar Module guidance system. The USE Method provides a simple procedure to try.
The first step is to find a list of resources, or better still, a functional block diagram. I found the following in the “Lunar Module – LM10 Through LM14 Familiarization Manual” (1969):
Some of these components may not exhibit utilization or saturation characteristics. After iterating through them, this can be redrawn to only include relevant components. (I’d also include more: the “erasable storage” section of memory, the “core set area” and “vac area” registers.)
I’ll start with the Apollo guidance computer (AGC) itself. For each metric, I browsed various LM docs to see what might make sense:
- AGC utilization: This could be defined as the number of CPU cycles doing jobs (not the “DUMMY JOB”) divided by the clock rate (2.048 MHz). This metric appears to have been well understood.
- AGC saturation: This could be defined as the number of jobs in the “core set area”, which are seven sets of registers to store program state. These allow a job to be suspended (by the “EXECUTIVE” program – what we’d call a “kernel” these days) if an interrupt for a higher priority job arrives. Once exhausted, the AGC reports a 1202 “EXECUTIVE OVERFLOW-NO CORE SETS” alarm.
- AGC errors: Many alarms are defined. These include a 1203 alarm “WAITLIST OVERFLOW-TOO MANY TASKS”, which is a performance issue of a different type: too many timed tasks are being processed before returning to normal job scheduling.
Some of these details may be familiar to space ethusiasts: 1201 (“NO VAC AREAS”) and 1202 alarms famously occurred during the Apollo 11 descent. (“VAC” is short for “vector accumulator“, extra storage for jobs that process vector quantities; I think wikipedia’s description as “vacant” may be incorrect).
Given Apollo 11′s 1201 alarm, the suggested strategies for analysis begin with workload characterization. The workload is mostly applied via interrupts, many of which can be seen in the functional diagram. This includes the rendezvous radar, used to track the Command Module, which was interrupting the AGC with work even though the LM was performing descent. This is an example of finding unnecessary work (or low priority work; some updates from the radar may have been desirable so that the LM AGC could immediately calculate an abort trajectory and CM rendezvous if needed).
As a harder example, I’ll examine the rendezvous radar as a resource. Errors are the easiest to identify. There are three types: “DATA NO GOOD”, “NO TRACK”, and “SHAFT- AND TRUNNION-AXIS ERROR” signals. Utilization is harder: one type may be utilization of the drive motors – defined as the time they were busy responding to angle commands (seen in the functional diagram via the “COUPLING DATA UNIT”). I’ll need to read the LM docs more to see if there saturation characteristics either with the drive motors or with the returned radar data.
In a short amount of time, using this methodology, I’ve gone from having no idea where to start, to having specific metrics to look for and research.
Other Methodologies
While the USE Method may find 80% of server issues, latency-based methodologies (eg, Method R) can approach finding 100% of all issues. However, these can take much more time if you are unfamiliar with software internals. They may be more suited for database administrators or application developers, who already have this familiarity. The USE Method is more suited for junior or senior system administrators, whose responsibility and expertise includes the operating system (OS) and hardware. It can also be employed by these other staff when a quick check of system health is desired.
Tools Method
For comparison with the USE Method, I’ll describe a tools-based approach (I’ll call this “Tools Method”):
- List available performance tools (optionally install or purchase more).
- For each tool, list useful metrics it provides.
- For each metric, list possible interpretation rules.
The result of this is a prescriptive checklist showing which tool to run, which metrics to read, and how to interpret them. While this can be fairly effective, one problem is that it relies exclusively on available (or known) tools, which can provide an incomplete view of the system. The user is also unaware that they have an incomplete view – and so the problem will remain.
The USE Method, instead, iterates over the system resources to create a complete list of questions to ask, then searches for tools to answer them. A more complete view is constructed, and unknown areas are documented and their existence known (“known unknowns”). Based on USE, a similar checklist can be developed showing which tool to run (where available), which metric to read, and how to interpret it.
Another problem can be when iterating through a large number of tools distracts from the goal – to find bottlenecks. The USE Method provides a strategy to find bottlenecks and errors efficiently, even with an unwieldy number of available tools and metrics.
Conclusion
The USE Method is a simple strategy you can use to perform a complete a check of system health, identifying common bottlenecks and errors. It can be deployed early in the investigation before more time-consuming methodologies are used. The strength of USE is its speed and visibility: by considering all resources, you are unlikely to overlook any issues. Caveat: it will only find certain types of issues – bottlenecks and errors – and should be considered as one tool in a larger toolbox.
In this post, I explained the USE Method, provided generic examples of metrics, and suggested strategies for further analysis of performance issues: workload characterization and drill-down analysis. In follow-up posts, I’ll use the USE Method to develop checklists for specific operating systems.
Acknowledgments
- “Optimizing Oracle Performance” by Cary Millsap and Jeff Holt (2003) describes Method R (and other methodologies), which reminded me recently that I should write this methodology down.
- The PAE and ISV teams at Sun who helped apply the USE Method (before it was named) to the storage appliance series. We drew ASCII functional block diagrams annotated with metric names and bus speeds – these were harder to construct than you’d think (we should have asked the hardware teams for help sooner).
- My students from performance classes several years ago, to whom I taught this methodology and who provided feedback. (And I hope to teach occasional performance classes again at some point.)
- The Virtual AGC project, which became a fun distraction as I read through their document library, hosted by ibiblio.org. In particular was the LMA790-2 “Lunar Module LM-10 Through LM-14 Vehicle Familiarization Manual” (page 48 has the functional block diagram), and the “Apollo Guidance and Navigation Lunar Module Student Study Guide”, which has a good explanation of the EXECUTIVE program including flow charts. (These docs are 109 and 9 Mbytes in size.)
- Deirdré Straughan for helping with another one of my long blog posts.
In: Performance · Tagged with: methodology, usemethod





