Adam Leventhal's blog

Search
Close this search box.

ZFS trivia: metaslabs and growing vdevs

November 8, 2012

Lately, I’ve been rooting around in the bowels of ZFS as we’ve explored some long-standing performance pathologies. To that end I’ve been fortunate to learn at the feet of Matt Ahrens who was half of the ZFS founding team and George Wilson who has forgotten more about ZFS than most people will ever know. I wanted to start sharing some of the interesting details I’ve unearthed.

For allocation purposes, ZFS carves vdevs (disks) into a number of “metaslabs” — simply smaller, more manageable chunks of the whole. How many metaslabs? Around 200:

void
vdev_metaslab_set_size(vdev_t *vd)
{
        /*
         * Aim for roughly 200 metaslabs per vdev.
         */
        vd->vdev_ms_shift = highbit(vd->vdev_asize / 200);
        vd->vdev_ms_shift = MAX(vd->vdev_ms_shift, SPA_MAXBLOCKSHIFT);
}

http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/vdev.c#1553

Why 200? Well, that just kinda worked and was never revisited. Is it optimal? Almost certainly not. Should there be more or less? Should metaslab size be independent of vdev size? How much better could we do? All completely unknown.

The space in the vdev is allotted proportionally, and contiguously to those metaslabs. But what happens when a vdev is expanded? This can happen when a disk is replaced by a larger disk or if an administrator grows a SAN-based LUN. It turns out that ZFS simply creates more metaslabs — an answer whose simplicity was only obvious in retrospect.

For example, let’s say we start with a 2T disk; then we’ll have 200 metaslabs of 10G each. If we then grow the LUN to 4TB then we’ll have 400 metaslabs. If we started instead from a 200GB LUN that we eventually grew to 4TB we’d end up with 4,000 metaslabs (each 1G). Further, if we started with a 40TB LUN (why not) and grew it by 100G ZFS would not have enough space to allocate a full metaslab and we’d therefore not be able to use that additional space.

At Delphix our metaslabs can become highly fragmented because most of our datasets use a 8K record size (read up on space maps to understand how metaslabs are managed — truly fascinating), and our customers often expand LUNs as a mechanism for adding more space. It’s not clear how much room there is for improvement, but these are curious phenomena that we intend to investigate along with the structure of space maps, the idiosyncrasies of the allocation path, and other aspects of ZFS as we continue to understand and improve performance. Stay tuned.

8 Responses

  1. Which makes me wonder what the best strategy is for growing a pool based on LUNs presented from a SAN – add more empty LUNs to the pools, or grow the LUNs you already have?

    1. We typically advise our customers to expand by growing their LUNs rather than by adding more LUNs. ZFS at present doesn’t handle imbalanced LUNs especially well — especially when some LUNs are more than 80-90% full. It’s something that we’re working actively working on here at Delphix.

  2. Is one of the reasons we don’t have the ability to shrink a zfs pool today due to the fact that one or more metaslabs may exist towards the end of the pool (in the shrinkage area), and there may be some complexity in dropping the metaslab[s] (not to mention deal with the data)?

    Any plans to deal with the inability to “shrink a zpool”?

    1. Usually when people talk about shrinking a pool they’re talking about removing entire devices rather than shrinking those devices. Delphix doesn’t have any immediate plans to allow customers to shrink their pool; I don’t think that it’s likely that anyone in the open source community will do it; and I’d say it’s practically impossible that we’d see it in Oracle Solaris.

  3. Do the ZFS userspace tools provide a way (or some other interface) to view the size of metaslabs, and how many exist etc?

    1. If you can export your pool you can do: zdb -m -e

      On a running system you can do something like this:

      # echo ::spa -c | mdb -k | awk -F= ‘/metaslab_shift/{ print “1<<0x" $2 "=D"; }' | mdb 134217728

Recent Posts

January 13, 2024
December 29, 2023
February 12, 2017
December 18, 2016
August 9, 2016

Archives

Archives