The USE Method: Solaris Performance Checklist

The USE Method provides a strategy for performing a complete check of system health, identifying common bottlenecks and errors. For each system resource, metrics for utilization, saturation and errors are identified and checked. Any issues discovered are then investigated using further strategies.

In this post, I’ll provide an example of a USE-based metric list for the Solaris family of operating systems. I’m writing this for later Solaris 10, Oracle Solaris 11, and illumos-based systems: SmartOS, OmniOS. This is primarily intended for system administrators of the physical systems (not tenants of cloud or zone instances; I’ll write that checklist later).

Physical Resources

component type metric
CPU utilization per-cpu: mpstat 1, “idl”; system-wide: vmstat 1, “id”; per-process: prstat -c 1 (“CPU” == recent), prstat -mLc 1 (“USR” + “SYS”); per-kernel-thread: lockstat -Ii rate, DTrace profile stack()
CPU saturation system-wide: uptime, load averages; vmstat 1, “r”; DTrace dispqlen.d (DTT) for a better “vmstat r”; per-process: prstat -mLc 1, “LAT”
CPU errors fmadm faulty; cpustat (CPC) for whatever error counters are supported (eg, thermal throttling)
Memory capacity utilization system-wide: vmstat 1, “free” (main memory), “swap” (virtual memory); per-process: prstat -c, “RSS” (main memory), “SIZE” (virtual memory)
Memory capacity saturation system-wide: vmstat 1, “sr” (bad now), “w” (was very bad); vmstat -p 1, “api” (anon page ins == pain), “apo”; per-process: prstat -mLc 1, “DFL”; DTrace anonpgpid.d (DTT), vminfo:::anonpgin on execname
Memory capacity errors fmadm faulty and prtdiag for physical failures; fmstat -s -m cpumem-retire (ECC events); DTrace failed malloc()s
Network Interfaces utilization nicstat (latest version here); kstat; dladm show-link -s -i 1 interface
Network Interfaces saturation nicstat; kstat for whatever custom statistics are available (eg, “nocanputs”, “defer”, “norcvbuf”, “noxmtbuf”); netstat -s, retransmits
Network Interfaces errors netstat -i, error counters; dladm show-phys; kstat for extended errors, look in the interface and “link” statistics (there are often custom counters for the card); DTrace for driver internals
Storage device I/O utilization system-wide: iostat -xnz 1, “%b”; per-process: DTrace iotop
Storage device I/O saturation iostat -xnz 1, “wait”; DTrace iopending (DTT), sdqueue.d (DTB)
Storage device I/O errors iostat -En; DTrace I/O subsystem, eg, ideerr.d (DTB), satareasons.d (DTB), scsireasons.d (DTB), sdretry.d (DTB)
Storage capacity utilization swap: swap -s; file systems: df -h; plus other commands depending on FS type
Storage capacity saturation not sure this one makes sense – once its full, ENOSPC
Storage capacity errors DTrace; /var/adm/messages file system full messages
Storage controller utilization iostat -Cxnz 1, compare to known IOPS/tput limits per-card
Storage controller saturation look for kernel queueing: sd (iostat “wait” again), ZFS zio pipeline
Storage controller errors DTrace the driver, eg, mptevents.d (DTB); /var/adm/messages
Network controller utilization infer from nicstat and known controller max tput
Network controller saturation see network interface saturation
Network controller errors kstat for whatever is there / DTrace
CPU interconnect utilization cpustat (CPC) for CPU interconnect ports, tput / max (eg, see the amd64htcpu script)
CPU interconnect saturation cpustat (CPC) for stall cycles
CPU interconnect errors cpustat (CPC) for whatever is available
Memory interconnect utilization cpustat (CPC) for memory busses, tput / max; or CPI greater than, say, 5; CPC may also have local vs remote counters
Memory interconnect saturation cpustat (CPC) for stall cycles
Memory interconnect errors cpustat (CPC) for whatever is available
I/O interconnect utilization busstat (SPARC only); cpustat for tput / max if available; inference via known tput from iostat/nicstat/…
I/O interconnect saturation cpustat (CPC) for stall cycles
I/O interconnect errors cpustat (CPC) for whatever is available

Software Resources

component type metric
Kernel mutex utilization lockstat -H (held time); DTrace lockstat provider
Kernel mutex saturation lockstat -C (contention); DTrace lockstat provider; spinning shows up with dtrace -n 'profile-997 { @[stack()] = count(); }'
Kernel mutex errors lockstat -E, eg recusive mutex enter (other errors can cause kernel lockup/panic, debug with mdb -k)
User mutex utilization plockstat -H (held time); DTrace plockstat provider
User mutex saturation plockstat -C (contention); prstat -mLc 1, "LCK"; DTrace plockstat provider
User mutex errors DTrace plockstat and pid providers, for EAGAIN, EINVAL, EPERM, EDEADLK, ENOMEM, EOWNERDEAD, ... see pthread_mutex_lock(3C)
Process capacity utilization sar -v, “proc-sz”; kstat, “unix:0:var:v_proc” for max, “unix:0:system_misc:nproc” for current; DTrace (`nproc vs `max_nprocs)
Process capacity saturation not sure this makes sense; you might get queueing on pidlinklock in pid_allocate(), as it scans for available slots once the table gets full
Process capacity errors “can’t fork()” messages
Thread capacity utilization user-level: kstat, “unix:0:lwp_cache:buf_inuse” for current, prctl -n zone.max-lwps -i zone ZONE for max; kernel: mdb -k or DTrace, “nthread” for current, limited by memory
Thread capacity saturation threads blocking on memory allocation; at this point the page scanner should be running (vmstat “sr”), else examine using DTrace/mdb.
Thread capacity errors user-level: pthread_create() failures with EAGAIN, EINVAL, …; kernel: thread_create() blocks for memory but won’t fail.
File descriptors utilization system-wide (no limit other than RAM); per-process: pfiles vs ulimit or prctl -t basic -n process.max-file-descriptor PID; a quicker check than pfiles is ls /proc/PID/fd | wc -l
File descriptors saturation does this make sense? I don’t think there is any queueing or blocking, other than on memory allocation.
File descriptors errors truss or DTrace (better) to look for errno == EMFILE on syscalls returning fds (eg, open(), accept(), …).

What’s Next

See the USE Method for the follow-up strategies after identifying a possible bottleneck. If you complete this checklist but still have a performance issue, move onto other strategies: drill-down analysis and latency analysis.

Posted on March 1, 2012 at 7:30 am by Brendan Gregg · Permalink
In: Performance · Tagged with: , , , , ,

2 Responses

Subscribe to comments via RSS

  1. Written by Kebabbert
    on March 8, 2012 at 1:48 am
    Permalink

    Wow! Great list!!!! Thank you for sharing this info. :o)

  2. Written by Harsha Nippani
    on March 22, 2012 at 8:49 am
    Permalink

    Brendan,
    I have been using dtrace to successfully identifying bottlenecks on SUN servers (particularly Global zones with 20-30 containers). I am sure, the “USE” method takes it further in quickly isolating performance problems when dealing with resource contention. I like the template that USE method provides and we can always expand on these.

    I have been using “fishbone” methodology in my troubleshooting efforts in enterprise environments. I am now going to leverage both (Fishbone and USE) and I am sure, sysadmins will be thrilled to see the quick results that this tool will deliver – particularly when users are screaming of “slowness in the system”.

    Appreciate your breakdown of key metrics.

    -Harsha Nippani

Subscribe to comments via RSS