ZFS trivia: metaslabs and growing vdevs

November 8, 2012

Lately, I’ve been rooting around in the bowels of ZFS as we’ve explored some long-standing performance pathologies. To that end I’ve been fortunate to learn at the feet of Matt Ahrens who was half of the ZFS founding team and George Wilson who has forgotten more about ZFS than most people will ever know. I wanted to start sharing some of the interesting details I’ve unearthed.

For allocation purposes, ZFS carves vdevs (disks) into a number of “metaslabs” — simply smaller, more manageable chunks of the whole. How many metaslabs? Around 200:

void
vdev_metaslab_set_size(vdev_t *vd)
{
        /*
         * Aim for roughly 200 metaslabs per vdev.
         */
        vd->vdev_ms_shift = highbit(vd->vdev_asize / 200);
        vd->vdev_ms_shift = MAX(vd->vdev_ms_shift, SPA_MAXBLOCKSHIFT);
}

http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/vdev.c#1553

Why 200? Well, that just kinda worked and was never revisited. Is it optimal? Almost certainly not. Should there be more or less? Should metaslab size be independent of vdev size? How much better could we do? All completely unknown.

The space in the vdev is allotted proportionally, and contiguously to those metaslabs. But what happens when a vdev is expanded? This can happen when a disk is replaced by a larger disk or if an administrator grows a SAN-based LUN. It turns out that ZFS simply creates more metaslabs — an answer whose simplicity was only obvious in retrospect.

For example, let’s say we start with a 2T disk; then we’ll have 200 metaslabs of 10G each. If we then grow the LUN to 4TB then we’ll have 400 metaslabs. If we started instead from a 200GB LUN that we eventually grew to 4TB we’d end up with 4,000 metaslabs (each 1G). Further, if we started with a 40TB LUN (why not) and grew it by 100G ZFS would not have enough space to allocate a full metaslab and we’d therefore not be able to use that additional space.

At Delphix our metaslabs can become highly fragmented because most of our datasets use a 8K record size (read up on space maps to understand how metaslabs are managed — truly fascinating), and our customers often expand LUNs as a mechanism for adding more space. It’s not clear how much room there is for improvement, but these are curious phenomena that we intend to investigate along with the structure of space maps, the idiosyncrasies of the allocation path, and other aspects of ZFS as we continue to understand and improve performance. Stay tuned.

8 Responses

Peter Tribble says:

November 9, 2012 at 12:04 pm

Which makes me wonder what the best strategy is for growing a pool based on LUNs presented from a SAN – add more empty LUNs to the pools, or grow the LUNs you already have?
1. ahl says:
  
  November 9, 2012 at 6:16 pm
  
  We typically advise our customers to expand by growing their LUNs rather than by adding more LUNs. ZFS at present doesn’t handle imbalanced LUNs especially well — especially when some LUNs are more than 80-90% full. It’s something that we’re working actively working on here at Delphix.
David says:

November 11, 2012 at 9:03 pm

Is one of the reasons we don’t have the ability to shrink a zfs pool today due to the fact that one or more metaslabs may exist towards the end of the pool (in the shrinkage area), and there may be some complexity in dropping the metaslab[s] (not to mention deal with the data)?

Any plans to deal with the inability to “shrink a zpool”?
1. ahl says:
  
  November 12, 2012 at 4:00 am
  
  Usually when people talk about shrinking a pool they’re talking about removing entire devices rather than shrinking those devices. Delphix doesn’t have any immediate plans to allow customers to shrink their pool; I don’t think that it’s likely that anyone in the open source community will do it; and I’d say it’s practically impossible that we’d see it in Oracle Solaris.
Ron says:

November 12, 2012 at 9:20 am

Do the ZFS userspace tools provide a way (or some other interface) to view the size of metaslabs, and how many exist etc?
1. ahl says:
  
  November 12, 2012 at 4:34 pm
  
  If you can export your pool you can do: zdb -m -e
  On a running system you can do something like this:
  
  # echo ::spa -c | mdb -k | awk -F= ‘/metaslab_shift/{ print “1<<0x" $2 "=D"; }' | mdb 134217728
Pingback: Adam Leventhal's blog » ZFS fundamentals: transaction groups
Pingback: SmartOS News: Dec 13, 2012 – SmartOS.org

Adam Leventhal's blog

ZFS trivia: metaslabs and growing vdevs

8 Responses

Recent Posts

Austin API Summit Wrap-up

Rust and JSON Schema: odd couple or perfect strangers

Oxide and Friends Season 4

DTrace probes in Rust

From Prometheus to Sisyphus

DTrace at Home

Archives

Archives