Adam Leventhal's blog » ZFS trivia: metaslabs and growing vdevs

ZFS trivia: metaslabs and growing vdevs

Lately, I’ve been rooting around in the bowels of ZFS as we’ve explored some long-standing performance pathologies. To that end I’ve been fortunate to learn at the feet of Matt Ahrens who was half of the ZFS founding team and George Wilson who has forgotten more about ZFS than most people will ever know. I wanted to start sharing some of the interesting details I’ve unearthed.

For allocation purposes, ZFS carves vdevs (disks) into a number of “metaslabs” — simply smaller, more manageable chunks of the whole. How many metaslabs? Around 200:

void
vdev_metaslab_set_size(vdev_t *vd)
{
        /*
         * Aim for roughly 200 metaslabs per vdev.
         */
        vd->vdev_ms_shift = highbit(vd->vdev_asize / 200);
        vd->vdev_ms_shift = MAX(vd->vdev_ms_shift, SPA_MAXBLOCKSHIFT);
}

http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/vdev.c#1553

Why 200? Well, that just kinda worked and was never revisited. Is it optimal? Almost certainly not. Should there be more or less? Should metaslab size be independent of vdev size? How much better could we do? All completely unknown.

The space in the vdev is allotted proportionally, and contiguously to those metaslabs. But what happens when a vdev is expanded? This can happen when a disk is replaced by a larger disk or if an administrator grows a SAN-based LUN. It turns out that ZFS simply creates more metaslabs — an answer whose simplicity was only obvious in retrospect.

For example, let’s say we start with a 2T disk; then we’ll have 200 metaslabs of 10G each. If we then grow the LUN to 4TB then we’ll have 400 metaslabs. If we started instead from a 200GB LUN that we eventually grew to 4TB we’d end up with 4,000 metaslabs (each 1G). Further, if we started with a 40TB LUN (why not) and grew it by 100G ZFS would not have enough space to allocate a full metaslab and we’d therefore not be able to use that additional space.

At Delphix our metaslabs can become highly fragmented because most of our datasets use a 8K record size (read up on space maps to understand how metaslabs are managed — truly fascinating), and our customers often expand LUNs as a mechanism for adding more space. It’s not clear how much room there is for improvement, but these are curious phenomena that we intend to investigate along with the structure of space maps, the idiosyncrasies of the allocation path, and other aspects of ZFS as we continue to understand and improve performance. Stay tuned.

Posted on November 8, 2012 at 5:24 pm by ahl · Permalink
In: ZFS · Tagged with: , , , ,

8 Responses

Subscribe to comments via RSS

  1. Written by Peter Tribble
    on November 9, 2012 at 12:04 pm
    Permalink

    Which makes me wonder what the best strategy is for growing a pool based on LUNs presented from a SAN – add more empty LUNs to the pools, or grow the LUNs you already have?

    • Written by ahl
      on November 9, 2012 at 6:16 pm
      Permalink

      We typically advise our customers to expand by growing their LUNs rather than by adding more LUNs. ZFS at present doesn’t handle imbalanced LUNs especially well — especially when some LUNs are more than 80-90% full. It’s something that we’re working actively working on here at Delphix.

  2. Written by David
    on November 11, 2012 at 9:03 pm
    Permalink

    Is one of the reasons we don’t have the ability to shrink a zfs pool today due to the fact that one or more metaslabs may exist towards the end of the pool (in the shrinkage area), and there may be some complexity in dropping the metaslab[s] (not to mention deal with the data)?

    Any plans to deal with the inability to “shrink a zpool”?

    • Written by ahl
      on November 12, 2012 at 4:00 am
      Permalink

      Usually when people talk about shrinking a pool they’re talking about removing entire devices rather than shrinking those devices. Delphix doesn’t have any immediate plans to allow customers to shrink their pool; I don’t think that it’s likely that anyone in the open source community will do it; and I’d say it’s practically impossible that we’d see it in Oracle Solaris.

  3. Written by Ron
    on November 12, 2012 at 9:20 am
    Permalink

    Do the ZFS userspace tools provide a way (or some other interface) to view the size of metaslabs, and how many exist etc?

    • Written by ahl
      on November 12, 2012 at 4:34 pm
      Permalink

      If you can export your pool you can do: zdb -m -e

      On a running system you can do something like this:

      # echo ::spa -c | mdb -k | awk -F= ‘/metaslab_shift/{ print “1<<0x” $2 “=D”; }’ | mdb
      134217728

  4. [...] (typeof(addthis_share) == "undefined"){ addthis_share = [];}I’ve continued to explore ZFS as I try to understand performance pathologies, and improve performance. A [...]

  5. Written by SmartOS News: Dec 13, 2012 – SmartOS.org
    on December 13, 2012 at 10:43 pm
    Permalink

    [...] fundamentals: transaction groups – “I’ve continued to explore ZFS as I try to understand performance pathologies, and improve performance. A [...]

Subscribe to comments via RSS