Catching disk latency in the act

Today, Brendan made a very interesting discovery about the potential sources of disk latency in the datacenter. Here’s a video we made of Brendan explaining (and demonstrating) his discovery:



This may seem silly, but it’s not farfetched: Brendan actually made this discovery while exploring drive latency that he had seen in a lab machine due to a missing screw on a drive bracket. (!) Brendan has more details on the discovery, demonstrating how he used the Fishworks analytics to understand and visualize it.

If this has piqued your curiosity about the nature of disk mechanics, I encourage you to read Jon Elerath’s excellent ACM Queue article, Hard disk drives: the good, the bad and the ugly! As Jon notes, noise is a known cause of what is called a non-repeatable runout (NRRO) — though it’s unclear if Brendan’s shouting is exactly the kind of noise-induced NRRO that Jon had in mind…

Posted on December 31, 2008 at 4:57 pm by bmc · Permalink
In: Fishworks

19 Responses

Subscribe to comments via RSS

  1. Written by Derek
    on December 31, 2008 at 7:06 pm
    Permalink

    Well, it was bound to happen sometime: Bryan’s incredible energy and hyper-speed spoken output has finally driven his co-workers crazy. ;)
    That’s interesting–exactly how much performance degradation were you seeing? And I mean in transactions or megabytes per second, not time lost due to co-workers rolling around the floor laughing…

  2. Written by Bryan Cantrill
    on December 31, 2008 at 7:40 pm
    Permalink

    Derek,
    Check out Brendan’s blog entry where he has the graphs posted; the hit to throughput is tremendous. (Throughput drops from ~1 GB/sec to practically nothing while Brendan is shouting.) The next experiment is obviously to take the biggest amp we can find and see if sustained loud noise will induce sustained high latency and low bandwidth. Another question we don’t know the answer to: is this due to frequency or volume or some combination? Science demands answers! I’m only half kidding, as there is one question which is legitimately on my mind: can the high noise levels found in most data centers be potentially responsible for NRROs? And can high noise levels shorten drive life? If so, are there ways to configure a datacenter such that this issue is either exacerbated or eliminated? Or is there just something magical about Brendan’s primal scream?

  3. Written by Derek
    on December 31, 2008 at 9:21 pm
    Permalink

    Interesting… I wonder if shouting at a single disk would result in as dramatic a drop in performance? (Now I’m going to have to try that.) Also, it looks like Brendan actually touches the drive brackets when he’s shouting. There’s another whole branch of research right there!
    >>can the high noise levels found in most data centers be potentially responsible for NRROs?
    There’s a thought, although I’d guess (as a complete failure at high school physics) that Brendan’s screaming is more intense and focused than the overall hum in a datacenter. The drives are already stabilized to some degree by way of the bracket and the chassis, so how much further can one stabilize a drive without burying it in bricks?
    Do you folks ever sleep? ;) Keep us posted if you discover anything and have a great (if hoarse) new year.

  4. Written by Kevin Hutchinson
    on December 31, 2008 at 10:58 pm
    Permalink

    As a corollary, can you increase the performance of your JBODs by playing them some gentle Mozart piano sonatas? ;-) Happy New Year!

  5. Written by benr
    on December 31, 2008 at 11:08 pm
    Permalink

    Thats awesome! :)

  6. Written by Karl Rossing
    on December 31, 2008 at 11:25 pm
    Permalink

    So does this mean I should go out and buy some noise cancellation headphones for our storage?
    A more serious question: What would be the effect of vibrations of a datacentre next to a major highway or railway when traffic shakes the datacentre/racks? Or when a bus, truck or car hits a manhole cover next to the datacentre.

  7. Written by Jacob Becker
    on January 1, 2009 at 12:02 am
    Permalink

    Wow, I think why this amazes me so much is that it makes sense when you think about it, but to actually have the instruments to measure it… wow. Nice work guys.

  8. Written by Joerg M.
    on January 1, 2009 at 6:10 am
    Permalink

    Karl … depends on the underground .. there is a reason why chip fabs can´t build everywhere … there are examples of defect numbers correlated with the time of the day, as of the urban train near of a fab.
    I would assume, that you could measure the effect as well in hard disk. But Brendans scream seems to be very effective.
    Maybe three harddisks are a formidable seismometer ;

  9. Written by Phillip Bruce
    on January 1, 2009 at 12:25 pm
    Permalink

    Now you found this on JBOD, have you tested this on other types of arrays? If so what ones?
    Just a suggestion:
    1. Use a db meter to measure your scream.
    2. Use a db meter to measure the sound of your systems
    Does location of the storage device matter if between other servers or between other storage devices makes a difference?
    Is storage device location near other type of datacenter infrastructure causing noise that impact storage like you have proven such as being near diesel generators.
    Sounds like to me for sure we need to keep storage devices away from exterior vibrations that could impact data lost. So question should we be thinking about how we layout our datacenters when exterior noise can cause data disruption?

  10. Written by Greg Price
    on January 1, 2009 at 1:17 pm
    Permalink

    > A more serious question: What would be the effect of vibrations of a datacentre next to a
    > major highway or railway when traffic shakes the datacentre/racks? Or when a bus, truck or
    > car hits a manhole cover next to the datacentre.
    I don’t think it’s a concern for two reasons: The axis the vibration is being delivered won’t be focused like it was in this case, but more importantly the energy level of the vibration experienced by the drive would be a *lot* less, particularly at the frequencies that are likely to cause problems.
    I’ll let you in on a secret – the fishworks lab is directly on the corner of two busy streets, with bus stops below. It also has a large bus terminus on the opposite side of the street! It’s not an issue.
    > Sounds like to me for sure we need to keep storage devices away from exterior vibrations
    > that could impact data lost. So question should we be thinking about how we layout our
    > datacenters when exterior noise can cause data disruption?
    Keep in perspective the amount of energy and frequency Brendan was directing (with cupped hands touching the drive) versus the energy level experienced in any typical building due to vibration – it’s not a problem unless your disk drive is in front of the speaker stack at a Van Halen concert.

  11. Written by Greg
    on January 1, 2009 at 1:28 pm
    Permalink

    Can you make the video available for direct download? I’d like to show coworkers, but YouTube is blocked.
    Thanks!

  12. Written by Phillip Bruce
    on January 1, 2009 at 3:46 pm
    Permalink

    Yes a Van Halen or any Hard Rock Concert certainly could pose a similar problem.
    Even Jimi Hendricks "Purple Haze" will send some good vibrations. :)
    Military environments would certainly need to understand those kind of impact.
    Besides Datacenters are noisy enough and if you listen to that video it is proof of that without the extra screaming at your disks. :)
    Also if you do download that video you need a VLC complaint player. You can get one from http://download.videolan.org/pub/videolan/vlc/0.9.8a/vlc-0.9.8a.tar.bz2
    Then you can use the YourTube Video Download Tool to get the video.
    http://www.downloadyoutubevideos.com/
    Phillip

  13. Written by Saravanan
    on January 1, 2009 at 10:07 pm
    Permalink

    Does it mean the existing noise in the data center already causes some disk latency? Or am i asking a stupid question?

  14. Written by Brendan Gregg
    on January 1, 2009 at 11:30 pm
    Permalink

    Derek, there is a throughput drop for one second – but that’s for the disk subsystem from ZFS, not the delivered performance over NFS. Since this is a heavy streaming write test, ZFS is asynchronously flushing data from DRAM to disk, but the clients don’t wait for that to complete. So whether that takes longer may not affect the client application performance at all (it can a little in this case, as it is a constant streaming write.) As for synchronous writes – the 7000 series supports Logzilla, which is SSD and should be immune to vibration (I assume – I’ve never shouted at an SSD to find out. :)
    It’s also worth noting that we believe that disks are more vulnerable to vibration during writes than reads, since for writes the disk must write the data properly – for reads the data must just pass the sector CRC.
    We doubt that data center noise can cause this – the video is shot in a very noisy data center such that I needed to shout the entire time! And we never notice the tell-tale outlier disk latency caused by vibration just from our data center alone (even when the blade server in the neighboring rack is doing POST, which sounds like a jet aircraft.) We only think this happens if you cup your hands to disks and shout very loudly, as they are doing a heavy write workload.
    Still, I’d rather have Analytics to confirm if vibration is an issue or not – which is what the video is about. People may have extreme circumstances where vibration is an issue, but lack the tools to identify it.

  15. Written by Phillip Bruce
    on January 2, 2009 at 12:41 pm
    Permalink

    Brendan,
    I would think SSD would never be an issue with this because it is requires no moving parts. Still when you have heavy I/O to such a point where even caching no longer makes sense to use. I have seen folks turn OFF caching simply because of this.
    So was caching is turned on, I would think you get the vibration issue regardless verses having a drive that is ENTIRELY SSD which vibrations should never occur.
    The only thing you have to worry about with vibrations is how well secure the memory is in the Drive Unit itself. Why? It would depend on the position of the memory simms in the drives. Example: Memory place in flat like on motherboards vs being vertically placed on a daughter board configuration. Or if there is no socket configuration but completely all solder to the motherboard be a better solution. Most SSD are still using 200 to 240 pin DIM sockets. Simply put if they are not secure enough that if a tech doesn’t locked them down can be a reason why
    memory errors to occur given in other NON-Data center environments.
    I tend to agree with you about the testing. I don’t know what I/O tool your using to generate the I/O, could be dd, bonnie, Medusa tools, iometer, vdbench or others that are available could better test the drive and see if you get the same results.

  16. Written by Bart
    on January 2, 2009 at 8:16 pm
    Permalink

    Try generating a simple sine wave into a .wav file and play that out your laptop into a IPod boombox… it will save Brendan’s voice, and permit more reproducible experiments. I’m curious
    as to which frequencies cause the problem; the acceleration due to sound waves is clearly preventing the heads from settling…
    - Bart

  17. Written by David Carlton
    on January 2, 2009 at 9:25 pm
    Permalink

    I’ve seen prototype disk arrays where the disks next to the fan had worse performance than the other disks due to vibration issues, too. Took a while to figure out the details there, a pity we didn’t have Fishworks then.

  18. Written by Liz
    on January 6, 2009 at 7:16 am
    Permalink

    Does apologizing and offering it flowers and RAM fix the problem?

  19. Written by Wordpress Themes
    on August 2, 2010 at 6:26 am
    Permalink

    Nice post and this enter helped me alot in my college assignement. Thank you seeking your information.

Subscribe to comments via RSS