ZFS trivia: metaslabs and growing vdevs
Lately, I’ve been rooting around in the bowels of ZFS as we’ve explored some long-standing performance pathologies. To that end I’ve been fortunate to learn at the feet of Matt Ahrens who was half of the ZFS founding team and George Wilson who has forgotten more about ZFS than most people will ever know. I wanted to start sharing some of the interesting details I’ve unearthed.
For allocation purposes, ZFS carves vdevs (disks) into a number of “metaslabs” — simply smaller, more manageable chunks of the whole. How many metaslabs? Around 200:
void vdev_metaslab_set_size(vdev_t *vd) { /* * Aim for roughly 200 metaslabs per vdev. */ vd->vdev_ms_shift = highbit(vd->vdev_asize / 200); vd->vdev_ms_shift = MAX(vd->vdev_ms_shift, SPA_MAXBLOCKSHIFT); }
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/vdev.c#1553
Why 200? Well, that just kinda worked and was never revisited. Is it optimal? Almost certainly not. Should there be more or less? Should metaslab size be independent of vdev size? How much better could we do? All completely unknown.
The space in the vdev is allotted proportionally, and contiguously to those metaslabs. But what happens when a vdev is expanded? This can happen when a disk is replaced by a larger disk or if an administrator grows a SAN-based LUN. It turns out that ZFS simply creates more metaslabs — an answer whose simplicity was only obvious in retrospect.
For example, let’s say we start with a 2T disk; then we’ll have 200 metaslabs of 10G each. If we then grow the LUN to 4TB then we’ll have 400 metaslabs. If we started instead from a 200GB LUN that we eventually grew to 4TB we’d end up with 4,000 metaslabs (each 1G). Further, if we started with a 40TB LUN (why not) and grew it by 100G ZFS would not have enough space to allocate a full metaslab and we’d therefore not be able to use that additional space.
At Delphix our metaslabs can become highly fragmented because most of our datasets use a 8K record size (read up on space maps to understand how metaslabs are managed — truly fascinating), and our customers often expand LUNs as a mechanism for adding more space. It’s not clear how much room there is for improvement, but these are curious phenomena that we intend to investigate along with the structure of space maps, the idiosyncrasies of the allocation path, and other aspects of ZFS as we continue to understand and improve performance. Stay tuned.
In: ZFS · Tagged with: GeorgeWilson, MattAhrens, metaslab, spacemap, ZFS
Notice: get_the_author_email is deprecated since version 2.8! Use get_the_author_meta('email') instead. in /home/knmngmprl21d/public_html/blogs/wp-includes/functions.php on line 3467
on November 9, 2012 at 12:04 pm
Permalink
Which makes me wonder what the best strategy is for growing a pool based on LUNs presented from a SAN – add more empty LUNs to the pools, or grow the LUNs you already have?
Notice: get_the_author_email is deprecated since version 2.8! Use get_the_author_meta('email') instead. in /home/knmngmprl21d/public_html/blogs/wp-includes/functions.php on line 3467
on November 9, 2012 at 6:16 pm
Permalink
We typically advise our customers to expand by growing their LUNs rather than by adding more LUNs. ZFS at present doesn’t handle imbalanced LUNs especially well — especially when some LUNs are more than 80-90% full. It’s something that we’re working actively working on here at Delphix.
Notice: get_the_author_email is deprecated since version 2.8! Use get_the_author_meta('email') instead. in /home/knmngmprl21d/public_html/blogs/wp-includes/functions.php on line 3467
on November 11, 2012 at 9:03 pm
Permalink
Is one of the reasons we don’t have the ability to shrink a zfs pool today due to the fact that one or more metaslabs may exist towards the end of the pool (in the shrinkage area), and there may be some complexity in dropping the metaslab[s] (not to mention deal with the data)?
Any plans to deal with the inability to “shrink a zpool”?
Notice: get_the_author_email is deprecated since version 2.8! Use get_the_author_meta('email') instead. in /home/knmngmprl21d/public_html/blogs/wp-includes/functions.php on line 3467
on November 12, 2012 at 4:00 am
Permalink
Usually when people talk about shrinking a pool they’re talking about removing entire devices rather than shrinking those devices. Delphix doesn’t have any immediate plans to allow customers to shrink their pool; I don’t think that it’s likely that anyone in the open source community will do it; and I’d say it’s practically impossible that we’d see it in Oracle Solaris.
Notice: get_the_author_email is deprecated since version 2.8! Use get_the_author_meta('email') instead. in /home/knmngmprl21d/public_html/blogs/wp-includes/functions.php on line 3467
on November 12, 2012 at 9:20 am
Permalink
Do the ZFS userspace tools provide a way (or some other interface) to view the size of metaslabs, and how many exist etc?
Notice: get_the_author_email is deprecated since version 2.8! Use get_the_author_meta('email') instead. in /home/knmngmprl21d/public_html/blogs/wp-includes/functions.php on line 3467
on November 12, 2012 at 4:34 pm
Permalink
If you can export your pool you can do: zdb -m -e
On a running system you can do something like this:
# echo ::spa -c | mdb -k | awk -F= ‘/metaslab_shift/{ print “1<<0x” $2 “=D”; }’ | mdb
134217728
Notice: get_the_author_email is deprecated since version 2.8! Use get_the_author_meta('email') instead. in /home/knmngmprl21d/public_html/blogs/wp-includes/functions.php on line 3467
on December 13, 2012 at 6:18 am
Permalink
[...] (typeof(addthis_share) == "undefined"){ addthis_share = [];}I’ve continued to explore ZFS as I try to understand performance pathologies, and improve performance. A [...]
Notice: get_the_author_email is deprecated since version 2.8! Use get_the_author_meta('email') instead. in /home/knmngmprl21d/public_html/blogs/wp-includes/functions.php on line 3467
on December 13, 2012 at 10:43 pm
Permalink
[...] fundamentals: transaction groups – “I’ve continued to explore ZFS as I try to understand performance pathologies, and improve performance. A [...]