I recently wanted to gather some numbers on CPU and memory system performance, for AMD64 CPUs. I reached a point where I searched the Internet for other Solaris AMD64 PIC (Performance Instrumentation Counters) analysis, and found little. I hope to improve this with some blog entries. In this part I’ll introduce PIC observability, and demonstrate measuring CPI (cycles per instruction) for different workloads.

To see why PICs are important, the following are the sort of questions that PIC analysis can answer:

  1. What is the Level 2 cache hit rate?
  2. What is the Level 2 cache miss volume?
  3. What is the hit rate and miss volume for the TLB?
  4. What is my memory bus utilization?

Questions 1 and 2 relate to the CPU hardware cache, where Level 2 is the E$ (meaning either “external” cache or “embedded” cache, depending on the CPU architecture). For optimal performance we want to see a high hit rate, and more importantly, a low miss volume.

Question 3 concerns a component of the memory management unit – the translation lookaside buffer (TLB). This processes and caches virtual to physical memory page translations. It can consume a lot of CPU (the worst I’ve seen is 60%), and it can be tuned. A good document for understanding this further is Taming Your Emu by Richard McDougall.

Question 4 seems obvious – the memory bus can be a bottleneck for system performance, so, how utilized is it? Answering this isn’t easy, but it is usually possible by examining CPU PICs.


There are many AMD64 CPU PICs available, which can be viewed using tools such as cpustat and cputrack. Running cpustat -h dumps the list:

# cpustat -h
cpustat [-c events] [-p period] [-nstD] [interval [count]]
-c events specify processor events to be monitored
-n        suppress titles
-p period cycle through event list periodically
-s        run user soaker thread for system-only events
-t        include tsc register
-D        enable debug mode
-h        print extended usage information
Use cputrack(1) to monitor per-process statistics.
CPU performance counter interface: AMD Opteron & Athlon64
event specification syntax:
event[0-3]: FP_dispatched_fpu_ops FP_cycles_no_fpu_ops_retired
FP_dispatched_fpu_ops_ff LS_seg_reg_load
LS_uarch_resync_self_modify LS_uarch_resync_snoop
LS_buffer_2_full LS_locked_operation LS_retired_cflush
LS_retired_cpuid DC_access DC_miss DC_refill_from_L2
DC_refill_from_system DC_copyback DC_dtlb_L1_miss_L2_hit
DC_dtlb_L1_miss_L2_miss DC_misaligned_data_ref
DC_uarch_late_cancel_access DC_uarch_early_cancel_access
DC_1bit_ecc_error_found DC_dispatched_prefetch_instr
DC_dcache_accesses_by_locks BU_memory_requests
BU_data_prefetch BU_system_read_responses
BU_quadwords_written_to_system BU_cpu_clk_unhalted
BU_internal_L2_req BU_fill_req_missed_L2 BU_fill_into_L2
IC_fetch IC_miss IC_refill_from_L2 IC_refill_from_system
IC_itlb_L1_miss_L2_hit IC_itlb_L1_miss_L2_miss
IC_uarch_resync_snoop IC_instr_fetch_stall
IC_return_stack_hit IC_return_stack_overflow
FR_retired_x86_instr_w_excp_intr FR_retired_uops
FR_retired_branches_mispred FR_retired_taken_branches
FR_retired_far_ctl_transfer FR_retired_resyncs
FR_retired_near_rets FR_retired_near_rets_mispred
FR_retired_fpu_instr FR_retired_fastpath_double_op_instr
FR_intr_masked_cycles FR_intr_masked_while_pending_cycles
FR_taken_hardware_intrs FR_nothing_to_dispatch
FR_dispatch_stall_fpu_full FR_dispatch_stall_ls_full
FR_fpu_exception FR_num_brkpts_dr0 FR_num_brkpts_dr1
FR_num_brkpts_dr2 FR_num_brkpts_dr3
NB_mem_ctrlr_page_access NB_mem_ctrlr_page_table_overflow
NB_mem_ctrlr_bypass_counter_saturation NB_ECC_errors
NB_sized_commands NB_probe_result NB_gart_events
NB_ht_bus0_bandwidth NB_ht_bus1_bandwidth
NB_ht_bus2_bandwidth NB_sized_blocks NB_cpu_io_to_mem_io
attributes: edge pc inv cmask umask nouser sys
See Chapter 10 of the "BIOS and Kernel Developer's Guide for the
AMD Athlon 64 and AMD Opteron Processors," AMD publication #26094

There are around fifty names above such as “FP_dispatched_fpu_ops” which describe the PICs available. On my AMD Opteron CPUs you can measure four of these at a time, which can be provided in the arguments to cpustat, eg,

# cpustat -c IC_fetch,DC_access,DC_dtlb_L1_miss_L2_hit,DC_dtlb_L1_miss_L2_miss 0.25
time cpu event      pic0      pic1      pic2      pic3
0.257   0  tick   6406429   8333198     45826      5515
0.257   1  tick   3333442   3942694     24682      4409
0.507   1  tick   6450964   8229104     44046      5713
0.507   0  tick   2359697   2828683     14365      4415
0.757   0  tick   2490406   3060416     16458      4901
0.757   1  tick   7292986   9530806     68956      6490
1.007   0  tick   2514008   3063049     15037      3863
1.007   1  tick   6057048   7747580     42415      6083

In the above example I printed four PICs every 0.25 seconds, for each CPU (I’m on a 2 x virtual CPU server). The CPU column shows that the output is slightly shuffled – a harmless side effect from the way cpustat was coded (it pbinds a libcpc consumer onto each CPU in the available processor set, and all threads write to STDOUT in any order). These PICs are provided by programmable hardware registers – so there is no ideal way around the four-at-a-time limit. You can shuffle measurements between different sets of PICs, which cpustat supports.

Reference Documentation

Since different CPUs provide different PICs, the guide mentioned at the bottom of the cpustat -h output will list what PICs your CPU type provides. It is important to read these guides carefully – for example, PICs that track cache misses may have some exceptions to what is considered a “miss”.

I spent a while with AMD’s #26094 guide, but I found that the PIC descriptions raise more questions than answers. (try to find basics such as “instruction count”)… If you find yourself in a similar situation, it can help to create known workloads and then examine which metrics move by a similar amount. I used this approach to confirm what PICs provided cycle counts and instruction counts.

I did eventually find two good resources on AMD PICs,

You may notice some really interesting PICs mentioned, such as memory locality observability in the newer revs of AMD CPUs.

If you are interested in PIC analysis for any CPU type, see chapter 8 “Performance Counters” in Solaris Performance and Tools, by Richard McDougall, Jim Mauro and myself. One of the metrics we made sure to include in the book was CPI (cycles per instruction), as it proves to be a useful starting point for understanding CPU behavior.

Example – CPI

The cycles per instruction metric (sometimes measured as IPC – instructions per cycle) is a useful ratio and (depending on CPU type) fairly easy to measure. If the measured CPI ratio is low, more instructions can be dispatched in a given time, which usually means higher performance. High CPI means instructions are stalling, usually on main memory bus activity.

The output of cpustat can be formatted with a little scripting; the following script “amd64cpiu” uses a little shell and Perl to aggregate and print the output:

# amd64cpiu - measure CPI and Utilization on AMD64 processors.
# USAGE: amd64cpiu [interval]
#   eg,
#        amd64cpiu 0.1          # for 0.1 second intervals
# CPI is cycles per instruction, a metric that increases due to activity
# such as main memory bus lookups.
# ident "@(#)       1.1     07/02/17 SMI"
interval=${1:-1}        # default interval, 1 second
set -- `kstat -p unix:0:system_misc:ncpus`      # assuming no psets,
cpus=$2                                         # number of CPUs
pics='BU_cpu_clk_unhalted'                      # cycles
pics=$pics,'FR_retired_x86_instr_w_excp_intr'   # instructions
/usr/sbin/cpustat -tc $pics $interval | perl -e '
printf "%16s %8s %8s\n", "Instructions", "CPI", "%CPU";
while () {
next if ++$lines == 1;
$total += $_[3];
$cycles += $_[4];
$instructions += $_[5];
if ((($lines - 1) % '$cpus') == 0) {
printf "%16u %8.2f %8.2f\n", $instructions,
$cycles / $instructions, $total ?
100 * $cycles / $total : 0;
$total = 0;
$cycles = 0;
$instructions = 0;

This script prints a column for CPI and for percent CPU utilization. I’ve used the PICs that were suggested in the AMD article – and from testing they do appear to be the best ones for measuring CPI.

Here amd64cpiu is used to examine a CPU bound workload of fast register based instructions,

# ./
Instructions      CPI     %CPU
16509657954     0.34    97.56
16550162001     0.33    98.54
16523746049     0.34    98.41
16510783100     0.34    98.32
16497135723     0.34    98.29

The CPI is around 0.34. This is the maximum to be expected from the AMD64 architecture, which attempts to run three instructions per clock cycle.

Now for a memory bound workload of sequential 1 byte memory reads,

# ./
Instructions      CPI     %CPU
4883935299     1.12    97.60
4852961204     1.12    97.03
4884120645     1.13    97.69
4898818096     1.12    97.92
4895064839     1.12    97.80

Things are starting to become slower due to the extra overhead of memory requests. Many reads will satisfy from the level 1 cache, some from the slower level 2 cache, and occasionally a cache line will be read from main memory. This additional overhead slows us to 1.13 CPI, and we are running fewer instructions for the same %CPU.

Watch what happens when our memory workload performs 1 byte scattered reads (100 Kbytes apart),

# ./
Instructions      CPI     %CPU
653300388     8.53    98.36
648496314     8.53    98.37
644163952     8.54    97.75
648941939     8.53    98.35
648507176     8.53    98.37

Many of the reads will not be in the CPU caches, and so now most are requiring a memory bus lookup. Our CPI is around 8.53, some 25 times slower than register based CPU instructions. Our %CPU is still around the same, but this buys us fewer instructions in total.

As you can see, CPI is shedding light on memory bus activity – it’s very cool, and from such a simple metric.

Now for a real application: Here I watch as Sun’s C compiler chews through a source tree,

# ./
Instructions      CPI     %CPU
2624028943     1.26    58.52
2992167837     1.19    63.17
2327129316     1.26    52.08
2046997158     1.27    46.14
2414376864     1.23    52.80
3305351199     1.23    70.72

That’s not so bad – any memory access instructions must be hitting caches fairly often (something that we can confirm by measuring other PICs).

Beware of output such as the following:

# ./
Instructions      CPI     %CPU
22695257     1.82     0.73
22197894     1.75     0.69
49626271     2.16     1.90
102731779     2.21     4.04
104795796     1.49     2.78

The CPUs are fairly idle (less than 5% utilization), and so CPI is less likely to be useful to indicate performance issues.

Suggested Actions – CPI

While many PICs produce interesting measurements, it’s much more useful if there is some action we can take based on the results. The following is a list of suggestions based on CPI.

Firstly, to be even considering this list you need to have a known and measured performance issue. If one or more CPUs are 100% busy, then that may be a performance issue and it can be useful to check CPI; if your CPUs are idle, then it probably won’t be useful to check. As for measured performance issue – it can be especially helpful to be able to quantify an issue, eg average latency is 150 ms; tools such as DTrace can take these measurements.

I hope this has been helpful. And there are many more cool metrics to observe on AMD64 CPUs – CPI is just the beginning.

Print Friendly
Posted on February 27, 2007 at 6:51 pm by Brendan Gregg · Permalink
In: Performance

One Response

Subscribe to comments via RSS

  1. Written by Dieter an Mey
    on January 29, 2008 at 12:21 pm

    Hi Brendan,
    Richard Smith’s presentation has moved.
    Please add a "1" in
    best regards

Subscribe to comments via RSS