Adam Leventhal's blog

Search
Close this search box.

[youtube_sc url=”https://youtu.be/4RV3RXMNGVs?list=FLnl7t6fBvSzzUoPv0pn9G1g” end=”20″ modestbranding=”1″ width=”244″ class=”align right”]

Lots of jargon sloshes around the conference rooms at tech firms; plenty of it seeps into other domains as well. Most of it is fairly unobjectionable. We’re all happy to be submariners, forever sending pings at each other. Taking things offline is probably preferable to taking them outside. And I’ll patiently wait for data to page into a brain that knows little to nothing about virtual memory. We all collectively look the other way when people utilize things that could have more simply been used, or leverage things that probably didn’t even bear mentioning.

What I can’t stand is resourcing.

Resources can be mined, drilled, or pumped out of the ground. They can be traded on exchanges. You can find them in libraries. You can have closets filled with resources: paper clips, toilet paper, white board makers (but where are the damned erasers?!). You might earn resources from a lucky roll of the dice. Resources are the basic stuff of planning and budgeting. But why oh why do we insist on referring to engineers as resources?

I'll trade my sheep for your ore.

An engineering manager asked me the other day, “does that project have the right resources?” What resources are those? Pens? Computers? Rare earth magnets? No, of course he meant engineers! And referring to engineers as resources suggests that they’re just as interchangeable and just as undifferentiated. While each engineer is not such a delicate snowflake—unique and beautiful—as to preclude some overlap, no engineer wants to be thought of as interchangeable; no engineer should be thought of as interchangeable as few engineers are interchangeable.

The folks in Human Resources at least deign to acknowledge that the resources that preoccupy their tabulations and ministrations are, after all, humans, and for that reason alone worthy of specialization. They attract a different type of specialist than, say, the resource-minders in the IT department who similarly need to keep their resources happy, cool, and supplied with a high bandwidth Internet connection. Yet we are all rendered resources in the eyes of Finance who more than once have let me trade real estate savings for engineering hires. FTEs (our preferred label) are still a unique type of resource, one that tends to appreciate over time. Which is just as well because otherwise we’d all be given away to underprivileged schools after three years, boxed up with the old laptops and other resources.

Referring to our colleagues as resources is dehumanizing, callous, and offensive. Language influences perception; these aren’t cogs, and they can’t be swapped like for like. Treating them like cogs leads to mistakes in judgement and I’ve seen it: smart engineers and smart managers who move columns around in a spreadsheet forgetting that satisfying formulas is only one goal and not the most primary one. These cogs have their own hopes, dreams, faults, and skills.

Let’s kill this one off. Let’s staff projects for success. When we need help let’s ask for additional people, or, if we’re more discerning than that, let’s ask for developers or program managers or masseurs. Managers, let’s manage teams of engineers; let’s learn what makes them different and celebrate those differences rather than guiding them to sameness. While we’re being magnanimous we can even extend this courtesy to contractors—yes, Finance, I know, we don’t pay for the warranty (health care plan). And when possible try to remember a name or two; the resources tend to like it.

Project Mayhem suffers a resourcing gap through unwanted attrition.

Like many programmers I like to try out new languages. After lunch with Alex Crichton, one of the Rust contributors, I started writing my favorite program in Rust. Rust is a “safe” systems language that introduces concepts of data ownership and mutability to semantically prevent whole categories of problems. It’s primarily developed at Mozilla Research in service of a next generation rendering engine, and while I presume that the name is a poke in the eye of Google’s Chrome, no one was brave enough to confirm that lest their next Uber ride reroute them to Bagram.

My standard “hello world” is a anagrammer / Scrabble cheater. Why? In most languages you can get it done in a few dozen lines of code, and it uses a variety of important language and library features: lists, maps, file IO, console IO, strings, sorting, etc. Rust is great, interesting in the way that I found objected-oriented or functional programming interesting when I first learned about them. It’s notions of data ownership, borrowing, and mutability I think lead to some of the same aha moments as closures for example. I found Rust to be quirky enough though that I thought I might be able to save others the pain of their first program, advancing them to the glorious, safe efficiency of their second by relating my experience.

So with the help of Stack Overflow I wrote the first chunk:

  1 use std::fs::File;
  2 use std::path::Path;
  3 use std::error::Error;
  4 use std::io::BufReader;
  5 use std::io::BufRead;
  6
  7 fn main() {
  8         let path = Path::new("../word.lst");
  9         let file = match File::open(&path) {
 10                 Err(why) => panic!("failed to open {}: {}", path.display(),
 11                     Error::description(&why)),
 12                 Ok(f) => f,
 13         };
 14
 15         let mut b = BufReader::new(file);
 16         let mut s = String::new();
 17
 18         while b.read_line(&mut s).is_ok() {
 19                 println!("{}", s);
 20         }
 21 }

So far so good? Well I ran it and it didn’t seem to be terminating…

$ time ./scrabble >/dev/null
<time passes>

What’s happening?

$ ./scrabble | head
aa

aa
aah

aa
aah
aahed

aa
thread '<main>' panicked at 'failed printing to stdout: Broken pipe (os error 32)', /Users/rustbuild/src/rust-buildbot/slave/nightly-dist-rustc-mac/build/src/libstd/io/stdio.rs:404

Okay — first lesson: String::clear(). As the documentation clearly states, BufReader::read_line() appends to an existing string; my own expectations and preconceptions are beside the point.

  1 use std::fs::File;
  2 use std::path::Path;
  3 use std::error::Error;
  4 use std::io::BufReader;
  5 use std::io::BufRead;
  6
  7 fn main() {
  8         let path = Path::new("../word.lst");
  9         let file = match File::open(&path) {
 10                 Err(why) => panic!("failed to open {}: {}", path.display(),
 11                     Error::description(&why)),
 12                 Ok(f) => f,
 13         };
 14
 15         let mut b = BufReader::new(file);
 16         let mut s = String::new();
 17
 18         while b.read_line(&mut s).is_ok() {
 19                s.pop();
 20                 println!("{}", s);
 21                 s.clear();
 22         }
 23 }

Better? Yes:

$ ./scrabble | head
aa
aah
aahed
aahing
aahs
aal
aalii
aaliis
aals
aardvark
thread '<main>' panicked at 'failed printing to stdout: Broken pipe (os error 32)', /Users/rustbuild/src/rust-buildbot/slave/nightly-dist-rustc-mac/build/src/libstd/io/stdio.rs:404

Correct? No:

$ time ./scrabble >/dev/null
<time passes>

It turns out that BufReader::read_line() indeed is_ok() even at EOF. Again, documented but—to me—counter-intuitive. And it turns out that this is a somewhat divisive topic. No matter; how about something else? Well it works, but the ever persnickety rustc finds ‘while true’ too blue-collar of a construct:

$ rustc scrabble.rs
scrabble.rs:18:2: 25:3 warning: denote infinite loops with loop { ... }, #[warn(while_true)] on by default
scrabble.rs:18     while true {
scrabble.rs:19         if !b.read_line(&mut s).is_ok() || s.len() == 0 {
scrabble.rs:20             break;
scrabble.rs:21         }
scrabble.rs:22         s.pop();
scrabble.rs:23         println!("{}", s);
                ...

Trying to embrace the fastidious methodology (while ever temped to unsafe-and-let-execution-be-the-judge) I gave up on read_line() and its controversial EOF and error semantics to try out BufReader::lines():

 18         for s in b.lines() {
 19                 println!("{}", s);
 20         }
$ rustc scrabble2.rs
scrabble2.rs:18:18: 18:19 error: the trait `core::fmt::Display` is not implemented for the type `core::result::Result<collections::string::String, std::io::error::Error>` [E0277]
scrabble2.rs:18         println!("{}", s);
                                       ^
note: in expansion of format_args!
<std macros>:2:25: 2:58 note: expansion site
<std macros>:1:1: 2:62 note: in expansion of print!
<std macros>:3:1: 3:54 note: expansion site
<std macros>:1:1: 3:58 note: in expansion of println!
scrabble2.rs:18:3: 18:21 note: expansion site
scrabble2.rs:18:18: 18:19 note: `core::result::Result<collections::string::String, std::io::error::Error>` cannot be formatted with the default formatter; try using `:?` instead if you are using a format string
scrabble2.rs:18         println!("{}", s);
                                       ^
note: in expansion of format_args!
<std macros>:2:25: 2:58 note: expansion site
<std macros>:1:1: 2:62 note: in expansion of print!
<std macros>:3:1: 3:54 note: expansion site
<std macros>:1:1: 3:58 note: in expansion of println!
scrabble2.rs:18:3: 18:21 note: expansion site
error: aborting due to previous error

Okay; that was apparently very wrong. The BufReader::lines() iterator gives us Result<String>s which we need to unwrap(). No problem.

 18         for line in b.lines() {
 19                 let s = line.unwrap();
 20                 println!("{}", s);
 21         }
scrabble.rs:15:6: 15:11 warning: variable does not need to be mutable, #[warn(unused_mut)] on by default
scrabble.rs:15     let mut b = BufReader::new(file);

Fine, rustc, you’re the boss. Now it’s simpler and it’s cranking:

  1 use std::fs::File;
  2 use std::path::Path;
  3 use std::error::Error;
  4 use std::io::BufReader;
  5 use std::io::BufRead;
  6 use std::collections::HashMap;
  7
  8 fn main() {
  9         let path = Path::new("../word.lst");
 10         let file = match File::open(&path) {
 11                 Err(why) => panic!("failed to open {}: {}", path.display(),
 12                     Error::description(&why)),
 13                 Ok(f) => f,
 14         };
 15
 16         let b = BufReader::new(file);
 17
 18         for line in b.lines() {
 19                 let s = line.unwrap();
 20                 println!("{}", s);
 21         }
 22 }

Now let’s build up our map. We’ll create a map from the sorted characters to the list of anagrams. For that we’ll use matching, another handy construct.

 23                 let mut v: Vec<char> = s.chars().collect();
 24                 v.sort();
 25                 let ss: String = v.into_iter().collect();
 26
 27                 match dict.get(&ss) {
 28                         Some(mut v) => v.push(s),
 29                         _ => {
 30                                 let mut v = Vec::new();
 31                                 v.push(s);
 32                                 dict.insert(ss, v);
 33                         },
 34                 }

What could be simpler? I love this language! But not so fast…

scrabble.rs:28:19: 28:20 error: cannot borrow immutable borrowed content `*v` as mutable
scrabble.rs:28             Some(mut v) => v.push(s),
                                           ^
scrabble.rs:32:5: 32:9 error: cannot borrow `dict` as mutable because it is also borrowed as immutable
scrabble.rs:32                 dict.insert(ss, v);
                                ^~~~
scrabble.rs:27:9: 27:13 note: previous borrow of `dict` occurs here; the immutable borrow prevents subsequent moves or mutable borrows of `dict` until the borrow ends
scrabble.rs:27         match dict.get(&ss) {
                              ^~~~
scrabble.rs:34:4: 34:4 note: previous borrow ends here
scrabble.rs:27         match dict.get(&ss) {
...
scrabble.rs:34         }
                        ^
error: aborting due to 2 previous errors

This is where in C I’d start casting away const. Not an option here. Okay, but I remember these notions of ownership, borrowing, and mutability as concepts early in the Rust overview. At the time it seemed like one of those explanations of git that sounds like more of a functional analysis of cryptocurrency. But perhaps there were some important nuggets in there…

Mutability, check! The Hashmap::get() yielded an immutable borrow that would exist for as long as its return value was in scope. Easily solved by changing it to a get_mut():

scrabble.rs:32:5: 32:9 error: cannot borrow `dict` as mutable more than once at a time
scrabble.rs:32                 dict.insert(ss, v);
                               ^~~~
scrabble.rs:27:9: 27:13 note: previous borrow of `dict` occurs here; the mutable borrow prevents subsequent moves, borrows, or modification of `dict` until the borrow ends
scrabble.rs:27         match dict.get_mut(&ss) {
                             ^~~~
scrabble.rs:34:4: 34:4 note: previous borrow ends here
scrabble.rs:27         match dict.get_mut(&ss) {
...
scrabble.rs:34         }
                       ^
error: aborting due to previous error

Wrong again. Moving me right down the Kübler-Ross model from anger into bargaining. You’re saying that I can’t mutate it because I can already mutate it? What do I have, rustc, that you want? How about if I pull the insert() out of the context of that get_mut()?

 27                 let mut bb = false;
 28
 29                 match dict.get_mut(&ss) {
 30                         Some(mut v) => v.push(s),
 31                         _ => {
 32                                 bb = true;
 33                         },
 34                 }
 35                 if bb {
 36                         let mut v = Vec::new();
 37                         v.push(s);
 38                         dict.insert(ss, v);
 39                 }

Inelegant, yes, but Rust was billed as safe-C, not elegant-C, right?

scrabble.rs:37:11: 37:12 error: use of moved value: `s`
scrabble.rs:37             v.push(s);
                                  ^
scrabble.rs:30:26: 30:27 note: `s` moved here because it has type `collections::string::String`, which is non-copyable
scrabble.rs:30             Some(mut v) => v.push(s),
                                                 ^
error: aborting due to previous error

So by pushing the anagram into the list at line 30 we lost ownership, and even though that definitely didn’t happen in the case of us reaching line 37, rustc isn’t having it. Indeed there doesn’t seem to be a way to both get an existing value and to insert a value in one lexical vicinity. At this point I felt like I was in some bureaucratic infinite loop, doomed to shuttle to and fro between windows at the DMV, always holding the wrong form. Any crazy person will immediately be given an mutable map, but asking for a mutable map immediately classifies you a sane.

After walking away for day to contemplate, here’s the compromise I came to:

 27                 if dict.contains_key(&ss) {
 28                         dict.get_mut(&ss).unwrap().push(s);
 29                 } else {
 30                         let mut v = Vec::new();
 31                         v.push(s);
 32                         dict.insert(ss, v);
 33                 }

And everyone was happy! But it turns out that there’s an even Rustier way of doing this (thanks to Delphix intern, John Ericson) with a very specific API:

                let mut v = dict.entry(sort_str(&s)).or_insert(Vec::new());
                v.push(s);

This is starting to look at lot less like safe C and a lot more like the stacking magic of C++. No matter; I’m just trying to cheat at Scrabble, not debate philosophy. Now that I’ve got my map built, let’s prompt the user and do the lookup. We’ll put the string sorting logic into a function:

  8 fn sort_str(s: String) -> String {
  9         let mut v: Vec<char> = s.chars().collect();
 10         v.sort();
 11         let ss: String = v.into_iter().collect();
 12         ss
 13 }
scrabble.rs:32:36: 32:37 error: use of moved value: `s`
scrabble.rs:32             dict.get_mut(&ss).unwrap().push(s);
                                                           ^
scrabble.rs:29:21: 29:22 note: `s` moved here because it has type `collections::string::String`, which is non-copyable
scrabble.rs:29         let ss = sort_str(s);
                                         ^
scrabble.rs:35:11: 35:12 error: use of moved value: `s`
scrabble.rs:35             v.push(s);
                                  ^
scrabble.rs:29:21: 29:22 note: `s` moved here because it has type `collections::string::String`, which is non-copyable
scrabble.rs:29         let ss = sort_str(s);
                                         ^
error: aborting due to 2 previous errors

This was wrong because we need to pass s as a reference or else its borrowed and destroyed; this needs to happen both in the function signature and call site.

  8 fn sort_str(s: &String) -> String {
  9         let mut v: Vec<char> = s.chars().collect();
 10         v.sort();
 11         let ss: String = v.into_iter().collect();
 12         ss
 13 }

As an aside I’d note how goofy I think it is that the absence of a semi-colon denotes function return. And that using an explicit return is sneered at as “un-idiomatic”. I’ve been told that this choice enables deeply elegant constructs with closures and that I’m simply behind the times. Fair enough. Now we’ll read the user-input:

 41         for line in stdin().lock().lines() {
 42                 let s = line.unwrap();
 43
 44                 match dict.get(&sort_str(&s)) {
 45                         Some(v) => {
 46                                 print!("anagrams for {}: ", s);
 47                                 for a in v {
 48                                         print!("{} ", a);
 49                                 }
 50                                 println!("");
 51                         },
 52                         _ => println!("no dice"),
 53                 }
 54         }
scrabble.rs:43:14: 43:21 error: borrowed value does not live long enough
scrabble.rs:43     for line in stdin().lock().lines() {
                               ^~~~~~~
scrabble.rs:43:2: 57:2 note: reference must be valid for the destruction scope surrounding statement at 43:1...
scrabble.rs:43     for line in stdin().lock().lines() {
scrabble.rs:44         let s = line.unwrap();
scrabble.rs:45
scrabble.rs:46         match dict.get(&sort_str(&s)) {
scrabble.rs:47             Some(v) => {
scrabble.rs:48                 print!("anagrams for {}: ", s);
               ...
scrabble.rs:43:2: 57:2 note: ...but borrowed value is only valid for the statement at 43:1
scrabble.rs:43     for line in stdin().lock().lines() {
scrabble.rs:44         let s = line.unwrap();
scrabble.rs:45
scrabble.rs:46         match dict.get(&sort_str(&s)) {
scrabble.rs:47             Some(v) => {
scrabble.rs:48                 print!("anagrams for {}: ", s);
               ...
scrabble.rs:43:2: 57:2 help: consider using a `let` binding to increase its lifetime
scrabble.rs:43     for line in stdin().lock().lines() {
scrabble.rs:44         let s = line.unwrap();
scrabble.rs:45
scrabble.rs:46         match dict.get(&sort_str(&s)) {
scrabble.rs:47             Some(v) => {
scrabble.rs:48                 print!("anagrams for {}: ", s);
               ...
error: aborting due to previous error

Okay! Too cute! Got it. Here’s the final program with some clean up here and there:

  1 use std::fs::File;
  2 use std::path::Path;
  3 use std::error::Error;
  4 use std::io::BufReader;
  5 use std::io::BufRead;
  6 use std::collections::HashMap;
  7 use std::io::stdin;
  8
  9 fn sort_str(s: &String) -> String {
 10         let mut v: Vec<char> = s.chars().collect();
 11         v.sort();
 12         v.into_iter().collect()
 13 }
 14
 15 fn main() {
 16         let path = Path::new("../word.lst");
 17         let file = match File::open(&path) {
 18                 Err(why) => panic!("failed to open {}: {}", path.display(),
 19                     Error::description(&why)),
 20                 Ok(f) => f,
 21         };
 22
 23         let b = BufReader::new(file);
 24
 25         let mut dict: HashMap<String, Vec<String>> = HashMap::new();
 26
 27         for line in b.lines() {
 28                 let s = line.unwrap();
 29                 dict.entry(sort_str(&s)).or_insert(Vec::new()).push(s);
 30         }
 31
 32         let sin = stdin();
 33
 34         for line in sin.lock().lines() {
 35                 let s = line.unwrap();
 36
 37                 match dict.get(&sort_str(&s)) {
 38                         Some(v) => {
 39                                 print!("anagrams for {}: ", s);
 40                                 for a in v {
 41                                         print!("{} ", a);
 42                                 }
 43                                 println!("");
 44                         },
 45                         _ => println!("no dice"),
 46                 }
 47         }
 48 }

Lessons

Rust is not Python. I knew that Rust wasn’t Python… or Java, or Perl, etc. But it still took me a while to remember and embrace that. You have to think about memory management even when you get to do less of it explicitly. For programs with messy notions of data ownership I can see Rust making for significantly cleaner code, easier to understand, and more approachable to new engineers. The concepts of ownership, borrowing, and mutability aren’t “like” anything. It took the mistakes of that first program to teach me that. Hopefully you can skip straight to your second Rust program.

Postscript

Before I posted this I received some suggestions from my colleagues at Delphix about how to improve the final code. I resolved to focus on the process—the journey if you will—rather than the result. That said I now realize that I was myself a victim of learning from some poor examples (from stack overflow in particular). There’s nothing more durable than poor but serviceable examples; we’ve all seen inefficient copy/pasta littered throughout a code base. So with the help again from John Ericson and the Twitterverse at large here’s my final version as a github gist (if I was going to do it over again I’d stick each revision in github for easier navigation). Happy copying!

I gave a presentation today on the methods and reasons of blogging for Delphix Engineering.

One of my points was that presentations make for simple blog posts–practice what you preach!

We built DTrace to solve problems; at the start, the problems we understood best were our own. In the Solaris Kernel Group we started by instrumenting the kernel and system calls, the user/kernel boundary. Early use required detailed knowledge of kernel internals. As DTrace use grew—within the team, in Sun and then beyond—we extended DTrace to turn every function and every instruction in user programs into probes. We added stable points of instrumentation both in the kernel and in user-land so that no deep knowledge of program or kernel internals would be required.

Oracle has been evolving their port of DTrace to OEL, prioritizing the stable points of instrumentation most relevant for the widest group of users. While DTrace started with providers that unlocked tens of thousands of points of instrumentation, the Oracle port enables a small number of comprehensible probes. Since I last tried out their port they’ve fixed some bugs, and added support for stable I/O and process probes, as well as user-land static probes.

[root@screven ~]# uname -a
Linux screven 3.8.13-16.el6uek.x86_64 #1 SMP Fri Sep 20 11:54:42 PDT 2013 x86_64 x86_64 x86_64 GNU/Linux
[root@screven ~]# cat test.d
provider test {
        probe foo(int);
};
[root@screven ~]# cat main.c
#include "test.h"

int
main(int argc, char **argv)
{
        TEST_FOO(100);
        return (0);
}
[root@screven ~]# dtrace -h -s test.d
[root@screven ~]# gcc -c main.c
[root@screven ~]# dtrace -G -s test.d main.o
[root@screven ~]# gcc -o main main.o test.o
[root@screven ~]# dtrace -c ./main -n 'test$target:::foo{ trace(arg0); }'
dtrace: description 'test$target:::foo' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0    643                         main:foo               100

USDT, as it’s called, was a relatively late addition in the initial development of DTrace. We added it initially to support probes in user-land locking primitives (the plockstat(1M) command uses it just as the lockstat(1M) command was converted to use kernel SDT probes). We were right in thinking that USDT would be useful for providing probes in infrastructure software such as Apache and MySQL; we didn’t anticipate how incredibly valuable it would be for supporting dynamic languages such as javascript (including Node), python, java, and bash.

USDT built on both the learning and code from years of DTrace development. By effectively starting there, OEL benefits from a decade of integrations and investigations. DTrace users on all platforms will benefit from the growth of our community. I look forward to seeing the new investigations on OEL and new integrations in all types of applications.

In the frenzied, insular world of a Silicon Valley startup it can be easy to lose perspective on the broader community in which we live and work. Among the great hackathon projects to come from our bi-annual engineering event was the idea of “Angel Sharks”, a group of volunteers at Delphix who provide opportunities for volunteering and community giving. Earlier this year, this group organized volunteer events around the launch of new Delphix releases.

We just completed our first “Week of Giving”. While many at Delphix already donate their time and money, the Angel Sharks organized giving and corporate matching. Our theme for 2014 was hunger; we focused on the SF-Marin Food Bank as our featured organization.

Over 50% of Delphix employees participated worldwide; a high bar that I’d like to see us exceed next year. Some activities of note were volunteering at food banks in the SF Bay Area, Atlanta and Boulder, toy donations to Toys for Tots, the Salvation Army Giving Tree, and the Starlight Foundation, and a silent auction that both brought the Delphix community closer together and raised over $3,000 for the SF-Marin Food Bank. More than $21,000 was raised in total with 30% of employees making matching requests in just three weeks! The Week of Giving brought a great energy and community spirit to the company; I’m excited to have giving as part of our DNA as a young company.

The SF-Marin Food Bank feeds 225,000 people annually with 47m lbs of food, and 96% of donations go directly to their programs. Donations are down for the year while need has increased by 1m lbs. You can donate here. I volunteered twice this year with my Delphix colleagues, and once with my wife and son (8 years old); I highly recommend it for both corporate and family outings.

Happy holidays from the Delphix family!

 

In previous posts I discussed the problems with the legacy ZFS write throttle that cause degraded performance and wildly variable latencies. I then presented the new OpenZFS write throttle and I/O scheduler that Matt Ahrens and I designed. In addition to solving several problems in ZFS, the new approach was designed to be easy to reason about, measure, and adjust. In this post I’ll cover performance analysis and tuning — using DTrace of course. These details are intended for those using OpenZFS and trying to optimize performance — if you have only a casual interest in ZFS consider yourself warned!

Buffering dirty data

OpenZFS limits the amount of dirty data on the system according to the tunable zfs_dirty_data_max. It’s default value is 10% of memory up to 4GB. The tradeoffs are pretty simple:

Lower Higher
Less memory reserved for use by OpenZFS More memory reserved for use by OpenZFS
Able to absorb less workload variation before throttling Able to absorb more workload variation before throttling
Less data in each transaction group More data in each transaction group
Less time spent syncing out each transaction group More time spent syncing out each transaction group
More metadata written due to less amortization Less metadata written due to more amortization

 

Most workloads contain variability. Think of the dirty data as a buffer for that variability. Let’s say the LUNs assigned to your OpenZFS storage pool are able to sustain 100MB/s in aggregate. If a workload consistently writes at 100MB/s then only a very small buffer would be required. If instead the workload oscillates between 200MB/s and 0MB/s for 10 seconds each, then a small buffer would limit performance. A buffer of 800MB would be large enough to absorb the full 20 second cycle over which the average is 100MB/s. A buffer of only 200MB would cause OpenZFS to start to throttle writes — inserting artificial delays — after less than 2 seconds during which the LUNs could flush 200MB of dirty data while the client tried to generate 400MB.

Track the amount of outstanding dirty data within your storage pool to know which way to adjust zfs_dirty_data_max:

txg-syncing
{
        this->dp = (dsl_pool_t *)arg0;
}

txg-syncing
/this->dp->dp_spa->spa_name == $$1/
{
        printf("%4dMB of %4dMB used", this->dp->dp_dirty_total / 1024 / 1024,
            `zfs_dirty_data_max / 1024 / 1024);
}

# dtrace -s dirty.d pool
dtrace: script 'dirty.d' matched 2 probes
CPU ID FUNCTION:NAME
11 8730 txg_sync_thread:txg-syncing 966MB of 4096MB used
0 8730 txg_sync_thread:txg-syncing 774MB of 4096MB used
10 8730 txg_sync_thread:txg-syncing 954MB of 4096MB used
0 8730 txg_sync_thread:txg-syncing 888MB of 4096MB used
0 8730 txg_sync_thread:txg-syncing 858MB of 4096MB used

The write throttle kicks in once the amount of dirty data exceeds zfs_delay_min_dirty_percent of the limit (60% by default). If the the amount of dirty data fluctuates above and below that threshold, it might be possible to avoid throttling by increasing the size of the buffer. If the metric stays low, you may reduce zfs_dirty_data_max. Weigh this tuning against other uses of memory on the system (a larger value means that there’s less memory for applications or the OpenZFS ARC for example).

A larger buffer also means that flushing a transaction group will take longer. This is relevant for certain OpenZFS administrative operations (sync tasks) that occur when a transaction group is committed to stable storage such as creating or cloning a new dataset. If the interactive latency of these commands is important, consider how long it would take to flush zfs_dirty_data_max bytes to disk. You can measure the time to sync transaction groups (recall, there are up to three active at any given time) like this:

txg-syncing
/((dsl_pool_t *)arg0)->dp_spa->spa_name == $$1/
{
        start = timestamp;
}

txg-synced
/start && ((dsl_pool_t *)arg0)->dp_spa->spa_name == $$1/
{
        this->d = timestamp - start;
        printf("sync took %d.%02d seconds", this->d / 1000000000,
            this->d / 10000000 % 100);
}

# dtrace -s duration.d pool
dtrace: script 'duration.d' matched 2 probes
CPU ID FUNCTION:NAME
5 8729 txg_sync_thread:txg-synced sync took 5.86 seconds
2 8729 txg_sync_thread:txg-synced sync took 6.85 seconds
11 8729 txg_sync_thread:txg-synced sync took 6.25 seconds
1 8729 txg_sync_thread:txg-synced sync took 6.32 seconds
11 8729 txg_sync_thread:txg-synced sync took 7.20 seconds
1 8729 txg_sync_thread:txg-synced sync took 5.14 seconds

Note that the value of zfs_dirty_data_max is relevant when sizing a separate intent log device (SLOG). zfs_dirty_data_max puts a hard limit on the amount of data in memory that has yet been written to the main pool; at most, that much data is active on the SLOG at any given time. This is why small, fast devices such as the DDRDrive make for great log devices. As an aside, consider the ostensible upgrade that Oracle brought to the ZFS Storage Appliance a few years ago replacing the 18GB “Logzilla” with a 73GB upgrade.

I/O scheduler

Where ZFS had a single IO queue for all IO types, OpenZFS has five IO queues for each of the different IO types: sync reads (for normal, demand reads), async reads (issued from the prefetcher), sync writes (to the intent log), async writes (bulk writes of dirty data), and scrub (scrub and resilver operations). Note that bulk dirty data described above are scheduled in the async write queue. See vdev_queue.c for the related tunables:

uint32_t zfs_vdev_sync_read_min_active = 10;
uint32_t zfs_vdev_sync_read_max_active = 10;
uint32_t zfs_vdev_sync_write_min_active = 10;
uint32_t zfs_vdev_sync_write_max_active = 10;
uint32_t zfs_vdev_async_read_min_active = 1;
uint32_t zfs_vdev_async_read_max_active = 3;
uint32_t zfs_vdev_async_write_min_active = 1;
uint32_t zfs_vdev_async_write_max_active = 10;
uint32_t zfs_vdev_scrub_min_active = 1;
uint32_t zfs_vdev_scrub_max_active = 2;

Each of these queues has tunable values for the min and max number of outstanding operations of the given type that can be issued to a leaf vdev (LUN). The tunable zfs_vdev_max_active limits the number of IOs issued to a single vdev. If its value is less than the sum of the zfs_vdev_*_max_active tunables, then the minimums come into play. The minimum number of each queue will be scheduled and the remainder of zfs_vdev_max_active is issued from the queues in priority order.

At a high level, the appropriate values for these tunables will be specific to your LUNs. Higher maximums lead to higher throughput with potentially higher latency. On some devices such as storage arrays with distinct hardware for reads and writes, some of the queues can be thought of as independent; on other devices such as traditional HDDs, reads and writes will likely impact each other.

A simple way to tune these values is to monitor I/O throughput and latency under load. Increase values by 20-100% until you find a point where throughput no longer increases, but latency is acceptable.

#pragma D option quiet

BEGIN
{
        start = timestamp;
}

io:::start
{
        ts[args[0]->b_edev, args[0]->b_lblkno] = timestamp;
}

io:::done
/ts[args[0]->b_edev, args[0]->b_lblkno]/
{
        this->delta = (timestamp - ts[args[0]->b_edev, args[0]->b_lblkno]) / 1000;
        this->name = (args[0]->b_flags & (B_READ | B_WRITE)) == B_READ ?
            "read " : "write ";

        @q[this->name] = quantize(this->delta);
        @a[this->name] = avg(this->delta);
        @v[this->name] = stddev(this->delta);
        @i[this->name] = count();
        @b[this->name] = sum(args[0]->b_bcount);

        ts[args[0]->b_edev, args[0]->b_lblkno] = 0;
}

END
{
        printa(@q);

        normalize(@i, (timestamp - start) / 1000000000);
        normalize(@b, (timestamp - start) / 1000000000 * 1024);

        printf("%-30s %11s %11s %11s %11s\n", "", "avg latency", "stddev",
            "iops", "throughput");
        printa("%-30s %@9uus %@9uus %@9u/s %@8uk/s\n", @a, @v, @i, @b);
}

# dtrace -s rw.d -c 'sleep 60'

  read
           value  ------------- Distribution ------------- count
              32 |                                         0
              64 |                                         23
             128 |@                                        655
             256 |@@@@                                     1638
             512 |@@                                       743
            1024 |@                                        380
            2048 |@@@                                      1341
            4096 |@@@@@@@@@@@@                             5295
            8192 |@@@@@@@@@@@                              5033
           16384 |@@@                                      1297
           32768 |@@                                       684
           65536 |@                                        400
          131072 |                                         225
          262144 |                                         206
          524288 |                                         127
         1048576 |                                         19
         2097152 |                                         0

  write
           value  ------------- Distribution ------------- count
              32 |                                         0
              64 |                                         47
             128 |                                         469
             256 |                                         591
             512 |                                         327
            1024 |                                         924
            2048 |@                                        6734
            4096 |@@@@@@@                                  43416
            8192 |@@@@@@@@@@@@@@@@@                        102013
           16384 |@@@@@@@@@@                               60992
           32768 |@@@                                      20312
           65536 |@                                        6789
          131072 |                                         860
          262144 |                                         208
          524288 |                                         153
         1048576 |                                         36
         2097152 |                                         0

                               avg latency      stddev        iops  throughput
write                              19442us     32468us      4064/s   261889k/s
read                               23733us     88206us       301/s    13113k/s

Async writes

Dirty data governed by zfs_dirty_data_max is written to disk via async writes. The I/O scheduler treats async writes a little differently than other operations. The number of concurrent async writes scheduled depends on the amount of dirty data on the system. Recall that there is a fixed (but tunable) limit of dirty data in memory. With a small amount of dirty data, the scheduler will only schedule a single operation (zfs_vdev_async_write_min); the idea is to preserve low latency of synchronous operations when there isn’t much write load on the system. As the amount of dirty data increases, the scheduler will push the LUNs harder to flush it out by issuing more concurrent operations.

The old behavior was to schedule a fixed number of operations regardless of the load. This meant that the latency of synchronous operations could fluctuate significantly. While writing out dirty data ZFS would slam the LUNs with writes, contending with synchronous operations and increasing their latency. After the syncing transaction group had completed, there would be a period of relatively low async write activity during which synchronous operations would complete more quickly. This phenomenon was known as “picket fencing” due to the square wave pattern of latency over time. The new OpenZFS I/O scheduler is optimized for consistency.

In addition to tuning the minimum and maximum number of concurrent operations sent to the device, there are two other tunables related to asynchronous writes: zfs_vdev_async_write_active_min_dirty_percent and zfs_vdev_async_write_active_max_dirty_percent. Along with the min and max operation counts (zfs_vdev_async_write_min_active and zfs_vdev_aysync_write_max_active), these four tunables define a piece-wise linear function that determines the number of operations scheduled as depicted in this lovely ASCII art graph excerpted from the comments:

 * The number of concurrent operations issued for the async write I/O class
 * follows a piece-wise linear function defined by a few adjustable points.
 *
 *        |                   o---------| <-- zfs_vdev_async_write_max_active
 *   ^    |                  /^         |
 *   |    |                 / |         |
 * active |                /  |         |
 *  I/O   |               /   |         |
 * count  |              /    |         |
 *        |             /     |         |
 *        |------------o      |         | <-- zfs_vdev_async_write_min_active
 *       0|____________^______|_________|
 *        0%           |      |       100% of zfs_dirty_data_max
 *                     |      |
 *                     |      `-- zfs_vdev_async_write_active_max_dirty_percent
 *                     `--------- zfs_vdev_async_write_active_min_dirty_percent

In a relatively steady state we’d like to see the amount of outstanding dirty data stay in a narrow band between the min and max percentages, by default 30% and 60% respectively.

Tune zfs_vdev_async_write_max_active as described above to maximize throughput without hurting latency. The only reason to increase zfs_vdev_async_write_min_active is if additional writes have little to no impact on latency. While this could be used to make sure data reaches disk sooner, an alternative approach is to decrease zfs_vdev_async_write_active_min_dirty_percent thereby starting to flush data despite less dirty data accumulating.

To tune the min and max percentages, watch both latency and the number of scheduled async write operations. If the operation count fluctuates wildly and impacts latency, you may want to flatten the slope by decreasing the min and/or increasing the max (note below that you will likely want to increase zfs_delay_min_dirty_percent if you increase zfs_vdev_async_write_active_max_dirty_percent — see below).

#pragma D option aggpack
#pragma D option quiet

fbt::vdev_queue_max_async_writes:entry
{
        self->spa = args[0];
}
fbt::vdev_queue_max_async_writes:return
/self->spa && self->spa->spa_name == $$1/
{
        @ = lquantize(args[1], 0, 30, 1);
}

tick-1s
{
        printa(@);
        clear(@);
}

fbt::vdev_queue_max_async_writes:return
/self->spa/
{
        self->spa = 0;
}

# dtrace -s q.d dcenter

min .--------------------------------. max | count
< 0 : ▃▆ : >= 30 | 23279

min .--------------------------------. max | count
< 0 : █ : >= 30 | 18453

min .--------------------------------. max | count
< 0 : █ : >= 30 | 27741

min .--------------------------------. max | count
< 0 : █ : >= 30 | 3455

min .--------------------------------. max | count
< 0 : : >= 30 | 0

Write delay

In situations where LUNs cannot keep up with the incoming write rate, OpenZFS artificially delays writes to ensure consistent latency (see the previous post in this series). Until a certain amount of dirty data accumulates there is no delay. When enough dirty data accumulates OpenZFS gradually increases the delay. By delaying writes OpenZFS effectively pushes back on the client to limit the rate of writes by forcing artificially higher latency. There are two tunables that pertain to delay: how much dirty data there needs to be before the delay kicks in, and the factor by which that delay increases as the amount of outstanding dirty data increases.

The tunable zfs_delay_min_dirty_percent determines when OpenZFS starts delaying writes. The default is 60%; note that we don’t start delaying client writes until the IO scheduler is pushing out data as fast as it can (zfs_vdev_async_write_active_max_dirty_percent also defaults to 60%).

The other relevant tunable is zfs_delay_scale is really the only magic number here. It roughly corresponds to the inverse of the maximum number of operations per second (denominated in nanoseconds), and is used as a scaling factor.

Delaying writes is an aggressive step to ensure consistent latency. It is required if the client really is pushing more data than the system can handle, but unnecessarily delaying writes degrades overall throughput. There are two goals to tuning delay: reduce or remove unnecessary delay, and ensure consistent delays when needed.

First check to see how often writes are delayed. This simple DTrace one-liner does the trick:

# dtrace -n fbt::dsl_pool_need_dirty_delay:return'{ @[args[1] == 0 ? "no delay" : "delay"] = count(); }'

If a relatively small percentage of writes are delayed, increasing the amount of dirty data allowed (zfs_dirty_data_max) or even pushing out the point at which delays start (zfs_delay_min_dirty_percent). When increasing zfs_dirty_data_max consider the other users of DRAM on the system, and also note that a small amount of small delays does not impact performance significantly.

If many writes are being delayed, the client really is trying to push data faster than the LUNs can handle. In that case, check for consistent latency, again, with a DTrace one-liner:

# dtrace -n delay-mintime'{ @ = quantize(arg2); }'

With high variance or if many write operations are being delayed for the maximum zfs_delay_max_ns (100ms by default) then try increasing zfs_delay_scale by a factor of 2 or more, or try delaying earlier by reducing zfs_delay_min_dirty_percent (remember to also reduce zfs_vdev_async_write_active_max_dirty_percent).

Summing up

Our experience at Delphix tuning the new write throttle has been so much better than in the old ZFS world: each tunable has a clear and comprehensible purpose, their relationships are well-defined, and the issues in tension pulling values up or down are both easy to understand and — most importantly — easy to measure. I hope that this tuning guide helps others trying to get the most out of their OpenZFS systems whether on Linux, FreeBSD, Mac OS X, illumos — not to mention the support engineers for the many products that incorporate OpenZFS into a larger solution.

Recent Posts

January 22, 2024
January 13, 2024
December 29, 2023
February 12, 2017
December 18, 2016
August 9, 2016

Archives

Archives