The Greatest Tool that Never Worked: har

har is the Hardware Activity Reporter. I’ve never seen it work, but it did help me solve a crucial performance issue.

I saw har’s output in an article written in 2001 by Frédéric Parienté, the tool’s author:

# har -r 1 3
mips	bus	cpi	dcm	icm	ecm	dsr	isr	fsr	bsr
2	0.0	3.1	33.7	25.1	15.4	15.6	31.9	0.0	3.1
0	0.0	3.6	31.3	22.8	1.0	24.4	25.4	0.1	6.1
0	0.0	3.3	34.2	21.9	0.0	24.6	19.6	0.0     5.3

har produces a rolling output like vmstat. The har metrics include:

bus percentage utilization of address bus from CPUs  
cpi cycles per instruction
dcm data-cache misses
ecm external-cache misses
dsr data stall rate

While I could access many of these metrics from other tools, har made them much easier to examine. But what really caught my eye was the “bus” metric – showing address bus utilization. I previously didn’t know that this was even possible to measure.

In 2009 I was working on the first ZFS-based storage product, and tuning its performance to be the best in the industry (our competitors included NetApp and EMC). To determine performance limiters, I was working through the USE Method. Checking resource types such as CPU and disks was easy; busses and interconnects were harder.

Using the system functional diagram (pictured right) as a checklist, I knew I had checked all physical components except the busses. Based on the known aggregated throughput to the I/O devices, we weren’t expecting the I/O busses to be an issue. But what about the memory busses and the HyperTrasport-based CPU interconnect?

I measured cycles-per-instruction (CPI), which reported over 11 under load. This is high, suggesting a memory bus issue.

That’s when I remembered har – which could report address bus (or memory bus) utilization directly.

har didn’t work on this platform.

But the memory of har – a tool I’d never run – taught me that I wasn’t crazy for wanting to measure this metric. It was possible on another platform. Could it also be measured here? What would it take to port har to this platform? (Assuming I could find the source code – I only had binaries.)

This motivated me to dig through the AMD BIOS and Kernel Developer’s Guide and learn more about the CPU performance counters, knowing that my time spent had a chance of paying off. It wasn’t easy, and required careful testing, but I did eventually develop tools for measuring the throughput and utilization of all the busses: I/O, memory, and CPU interconnect. These tools included amd64htcpu:

 walu# ./amd64htcpu 1
     Socket  HT0 TX MB/s  HT1 TX MB/s  HT2 TX MB/s  HT3 TX MB/s
          0      3170.82       595.28      2504.15         0.00
          1      2738.99      2051.82       562.56         0.00
          2      2218.48         0.00      2588.43         0.00
          3      2193.74      1852.61         0.00         0.00
[...]

Each CPU had four HyperTransport 1 (HT) ports, and the transmit (TX) for each is reported by this tool. It showed that it was the CPU interconnect that was approaching its limit. The decoder for this tool is the following, showing what each number measures:

     Socket  HT0 TX MB/s  HT1 TX MB/s  HT2 TX MB/s  HT3 TX MB/s
          0       CPU0-1        MCP55       CPU0-2         0.00
          1       CPU1-0       CPU1-3         IO55         0.00
          2       CPU2-3       CPU2-3       CPU2-0         0.00
          3       CPU3-2       CPU3-1       CPU3-2         0.00

Upgrading the system to HyperTransport 3 improved performance between 25% and 75% (I wrote about this previously, including more on amd64htcpu). While the USE Method identified that I should be examining the memory bus, it was the har screenshot that suggested that it was possible.

If you like my amd64htcpu tool (script here), the bad news is that it probably won’t work for you, as it is for a particular platform and OS. But har didn’t work for me, either. It’s valuable to know that measuring a certain metric is even possible: think of these screenshots as proof-of-concepts.

To learn about more tools that probably don’t work for you (especially if you’re using Linux or Windows today), I recommend my book on DTrace, which contains hundreds of tools and screenshots showing what dynamic tracing can do. Many of these won’t even work on a given version of Solaris without tweaking. The book’s value may not lie in the tools, but in the ideas that they encompass, and the screenshots that convey them. Just like har – the greatest tool that never worked (for me).

Print Friendly
Posted on May 27, 2013 at 6:34 pm by Brendan Gregg · Permalink
In: Performance

9 Responses

Subscribe to comments via RSS

  1. Written by Guy
    on May 27, 2013 at 7:20 pm
    Permalink

    Brendan, it’s articles like these that make me really glad I found and followed your blog. Always insightful.

  2. Written by Hazz
    on May 27, 2013 at 10:36 pm
    Permalink

    As always a thumb up for a detailed notes.Bought the Dtrace book but never had,after one year,the time to put in practise.

  3. Written by David Collier-Brown
    on May 28, 2013 at 6:32 am
    Permalink

    Excellent thought!
    One of the other sources of unexpected metrics is Teamquest, which gathers up time-related metrics like “(cal_monitor.)controller0 Effective Service Time” which hints at things that are available, but sometimes unexpected

  4. Written by Frederic Pariente
    on May 29, 2013 at 5:07 am
    Permalink

    Hi Brendan, the HAR source code is available at https://kenai.com/projects/har as long as Project Kenai remains up-n-running. AFAIK HAR itself is still distributed as part of dimStat at http://dimitrik.free.fr/. Hope that you will eventually see it work ;-) It needs porting to more recent hardware though. HTH, Frederic.

  5. Written by Anonymous
    on June 3, 2013 at 6:02 am
    Permalink

    Is it possible to write a generic script that would work any x86 hardware?

    • Written by Brendan Gregg
      on June 7, 2013 at 7:32 pm
      Permalink

      The PAPI interface for CPU performance counters provides generic names for counters, like PAPI_tot_cyc, PAPI_tot_ins, etc, which should work on any platform that supports them (which isn’t hard to add; once a system supports performance counters, the PAPI names are just extra aliases).

      PAPI support happened long after har, and has been improving in recent years. Although, I’ve yet to see PAPI counters for bus utilization or stall cycles, like I used in my amd64 scripts, so this amount of specific detail may only ever be process specific.

  6. Written by Kyle Hailey
    on June 7, 2013 at 5:25 pm
    Permalink

    Nice info.
    Never thought about measuring cycles per instruction as a proxy for memory bus issues. Cool.
    Never heard of har either.
    There have been times where I wanted to know when the system was running into bus saturation and wasn’t sure how to get that data. One system all the components were equivalent except the backplane and we were getting lower transactions on the slower backplane so it was pretty clear that was the issue, but had I only had the machine with the slower back plan, I wouldn’t have known the issue.

    BTW saw Tufte tweeted one of your blog posts. Awesome :)

    Best Wishes
    Kyle Hailey

  7. Written by norwind
    on June 11, 2013 at 4:18 am
    Permalink

    it is easy to get QPI utilization/Memory Controller throughput/IIO traffic now for intel CPU, you could get these message from the intel uncore program guide.

    • Written by Brendan Gregg
      on June 19, 2013 at 1:19 am
      Permalink

      Great. Do it. Write a little tool that has these columns:

      - QPI utilization, for each port
      - Memory controller throughput, for each port
      - I/O traffic throughput, for each I/O controller (preferably each expander port, but I’ll settle for just the controller)

      I’ve had issues in the past with kernel support for reading the Intel uncore registers. I hope it’s easy now.

Subscribe to comments via RSS