Delphix Technology Scholarship for Women

Walk around almost any software development shop or university CS department and you’ll be struck by the underrepresentation of women. At least you would be were this not an expected norm of our industry. And of course much has been written about this recently hot topic in Silicon Valley. What do companies and organizations do about it? At Delphix our culture is one of focus and purpose; our approach to diversity follows in that spirit.

We’re now in the third year of offering the Delphix Scholarship for Women, our contribution to encouraging more women to pursue a technical degree and enter the industry. The scholarship recognizes women whose projects embody some of the cultural attributes we value highly at Delphix such as exploration, creativity, and end-to-end ownership.

In previous years we’ve particularly recognized women whose projects addressed needs they saw first-hand, solutions that they wanted to use themselves. We love seeing how applicants identify problems and work creatively on novel solutions owning the whole process from inception to delivery. Submissions close on Sunday, November 15, 2015 at 11:59pm PST; I can’t wait to see what you come up with.

Posted on November 5, 2015 at 11:04 pm by ahl · Permalink · Leave a comment
In: Delphix

Delphix Sync

Oracle OpenWorld is always a huge event for us at Delphix; it brings together our customers and partners like nothing else. And so it was the perfect venue to launch our new Delphix user group: Sync.

In its history Delphix has learned best from our customers. As I heard Marc Benioff opine at Dropbox’s inaugural event this morning, customers are the greatest source of a company’s innovation. At Delphix we’ve built an incredible multi-tool; our customers have taught us dozens of new ways of using it to deliver value to their organizations. Further, the most compelling education has always been user-to-user. Sync is the forum for local, global, live, and virtual events to build connections between users as well as directly into Delphix development.

Our Sync event last week featured customer speakers from McGraw Hill Financial, US Foods and Stubhub. A major theme was Delphix as a catalyst for devops implementation in the enterprise. We also shared performance best practices from Vinay Srihari, our VP of architecture, and engaged in a discussion about the product roadmap with deep-dives in particular areas.

There are many more events to come. We’ve already got a virtual event on masking and a local Seattle event in the works. I look forward to growing the community of Delphix users, and bringing them together. Stay connected or offer to host a Sync event by sending mail to

Posted on November 4, 2015 at 11:14 pm by ahl · Permalink · Leave a comment
In: Delphix

I am not a resource

Lots of jargon sloshes around the conference rooms at tech firms; plenty of it seeps into other domains as well. Most of it is fairly unobjectionable. We’re all happy to be submariners, forever sending pings at each other. Taking things offline is probably preferable to taking them outside. And I’ll patiently wait for data to page into a brain that knows little to nothing about virtual memory. We all collectively look the other way when people utilize things that could have more simply been used, or leverage things that probably didn’t even bear mentioning.

What I can’t stand is resourcing.

Resources can be mined, drilled, or pumped out of the ground. They can be traded on exchanges. You can find them in libraries. You can have closets filled with resources: paper clips, toilet paper, white board makers (but where are the damned erasers?!). You might earn resources from a lucky roll of the dice. Resources are the basic stuff of planning and budgeting. But why oh why do we insist on referring to engineers as resources?

I'll trade my sheep for your ore.

An engineering manager asked me the other day, “does that project have the right resources?” What resources are those? Pens? Computers? Rare earth magnets? No, of course he meant engineers! And referring to engineers as resources suggests that they’re just as interchangeable and just as undifferentiated. While each engineer is not such a delicate snowflake—unique and beautiful—as to preclude some overlap, no engineer wants to be thought of as interchangeable; no engineer should be thought of as interchangeable as few engineers are interchangeable.

The folks in Human Resources at least deign to acknowledge that the resources that preoccupy their tabulations and ministrations are, after all, humans, and for that reason alone worthy of specialization. They attract a different type of specialist than, say, the resource-minders in the IT department who similarly need to keep their resources happy, cool, and supplied with a high bandwidth Internet connection. Yet we are all rendered resources in the eyes of Finance who more than once have let me trade real estate savings for engineering hires. FTEs (our preferred label) are still a unique type of resource, one that tends to appreciate over time. Which is just as well because otherwise we’d all be given away to underprivileged schools after three years, boxed up with the old laptops and other resources.

Referring to our colleagues as resources is dehumanizing, callous, and offensive. Language influences perception; these aren’t cogs, and they can’t be swapped like for like. Treating them like cogs leads to mistakes in judgement and I’ve seen it: smart engineers and smart managers who move columns around in a spreadsheet forgetting that satisfying formulas is only one goal and not the most primary one. These cogs have their own hopes, dreams, faults, and skills.

Let’s kill this one off. Let’s staff projects for success. When we need help let’s ask for additional people, or, if we’re more discerning than that, let’s ask for developers or program managers or masseurs. Managers, let’s manage teams of engineers; let’s learn what makes them different and celebrate those differences rather than guiding them to sameness. While we’re being magnanimous we can even extend this courtesy to contractors—yes, Finance, I know, we don’t pay for the warranty (health care plan). And when possible try to remember a name or two; the resources tend to like it.

Project Mayhem suffers a resourcing gap through unwanted attrition.

Posted on September 24, 2015 at 3:52 pm by ahl · Permalink · 2 Comments
In: Other

First Rust Program Pain (So you can avoid it…)

Like many programmers I like to try out new languages. After lunch with Alex Crichton, one of the Rust contributors, I started writing my favorite program in Rust. Rust is a “safe” systems language that introduces concepts of data ownership and mutability to semantically prevent whole categories of problems. It’s primarily developed at Mozilla Research in service of a next generation rendering engine, and while I presume that the name is a poke in the eye of Google’s Chrome, no one was brave enough to confirm that lest their next Uber ride reroute them to Bagram.

My standard “hello world” is a anagrammer / Scrabble cheater. Why? In most languages you can get it done in a few dozen lines of code, and it uses a variety of important language and library features: lists, maps, file IO, console IO, strings, sorting, etc. Rust is great, interesting in the way that I found objected-oriented or functional programming interesting when I first learned about them. It’s notions of data ownership, borrowing, and mutability I think lead to some of the same aha moments as closures for example. I found Rust to be quirky enough though that I thought I might be able to save others the pain of their first program, advancing them to the glorious, safe efficiency of their second by relating my experience.

So with the help of Stack Overflow I wrote the first chunk:

  1 use std::fs::File;
  2 use std::path::Path;
  3 use std::error::Error;
  4 use std::io::BufReader;
  5 use std::io::BufRead;
  7 fn main() {
  8         let path = Path::new("../word.lst");
  9         let file = match File::open(&path) {
 10                 Err(why) => panic!("failed to open {}: {}", path.display(),
 11                     Error::description(&why)),
 12                 Ok(f) => f,
 13         };
 15         let mut b = BufReader::new(file);
 16         let mut s = String::new();
 18         while b.read_line(&mut s).is_ok() {
 19                 println!("{}", s);
 20         }
 21 }

So far so good? Well I ran it and it didn’t seem to be terminating…

$ time ./scrabble >/dev/null
<time passes>

What’s happening?

$ ./scrabble | head



thread '<main>' panicked at 'failed printing to stdout: Broken pipe (os error 32)', /Users/rustbuild/src/rust-buildbot/slave/nightly-dist-rustc-mac/build/src/libstd/io/

Okay — first lesson: String::clear(). As the documentation clearly states, BufReader::read_line() appends to an existing string; my own expectations and preconceptions are beside the point.

  1 use std::fs::File;
  2 use std::path::Path;
  3 use std::error::Error;
  4 use std::io::BufReader;
  5 use std::io::BufRead;
  7 fn main() {
  8         let path = Path::new("../word.lst");
  9         let file = match File::open(&path) {
 10                 Err(why) => panic!("failed to open {}: {}", path.display(),
 11                     Error::description(&why)),
 12                 Ok(f) => f,
 13         };
 15         let mut b = BufReader::new(file);
 16         let mut s = String::new();
 18         while b.read_line(&mut s).is_ok() {
 19                s.pop();
 20                 println!("{}", s);
 21                 s.clear();
 22         }
 23 }

Better? Yes:

$ ./scrabble | head
thread '<main>' panicked at 'failed printing to stdout: Broken pipe (os error 32)', /Users/rustbuild/src/rust-buildbot/slave/nightly-dist-rustc-mac/build/src/libstd/io/

Correct? No:

$ time ./scrabble >/dev/null
<time passes>

It turns out that BufReader::read_line() indeed is_ok() even at EOF. Again, documented but—to me—counter-intuitive. And it turns out that this is a somewhat divisive topic. No matter; how about something else? Well it works, but the ever persnickety rustc finds ‘while true’ too blue-collar of a construct:

$ rustc 25:3 warning: denote infinite loops with loop { ... }, #[warn(while_true)] on by default     while true {         if !b.read_line(&mut s).is_ok() || s.len() == 0 {             break;         }         s.pop();         println!("{}", s);

Trying to embrace the fastidious methodology (while ever temped to unsafe-and-let-execution-be-the-judge) I gave up on read_line() and its controversial EOF and error semantics to try out BufReader::lines():

 18         for s in b.lines() {
 19                 println!("{}", s);
 20         }
$ rustc 18:19 error: the trait `core::fmt::Display` is not implemented for the type `core::result::Result<collections::string::String, std::io::error::Error>` [E0277]         println!("{}", s);
note: in expansion of format_args!
<std macros>:2:25: 2:58 note: expansion site
<std macros>:1:1: 2:62 note: in expansion of print!
<std macros>:3:1: 3:54 note: expansion site
<std macros>:1:1: 3:58 note: in expansion of println! 18:21 note: expansion site 18:19 note: `core::result::Result<collections::string::String, std::io::error::Error>` cannot be formatted with the default formatter; try using `:?` instead if you are using a format string         println!("{}", s);
note: in expansion of format_args!
<std macros>:2:25: 2:58 note: expansion site
<std macros>:1:1: 2:62 note: in expansion of print!
<std macros>:3:1: 3:54 note: expansion site
<std macros>:1:1: 3:58 note: in expansion of println! 18:21 note: expansion site
error: aborting due to previous error

Okay; that was apparently very wrong. The BufReader::lines() iterator gives us Result<String>s which we need to unwrap(). No problem.

 18         for line in b.lines() {
 19                 let s = line.unwrap();
 20                 println!("{}", s);
 21         } 15:11 warning: variable does not need to be mutable, #[warn(unused_mut)] on by default     let mut b = BufReader::new(file);

Fine, rustc, you’re the boss. Now it’s simpler and it’s cranking:

  1 use std::fs::File;
  2 use std::path::Path;
  3 use std::error::Error;
  4 use std::io::BufReader;
  5 use std::io::BufRead;
  6 use std::collections::HashMap;
  8 fn main() {
  9         let path = Path::new("../word.lst");
 10         let file = match File::open(&path) {
 11                 Err(why) => panic!("failed to open {}: {}", path.display(),
 12                     Error::description(&why)),
 13                 Ok(f) => f,
 14         };
 16         let b = BufReader::new(file);
 18         for line in b.lines() {
 19                 let s = line.unwrap();
 20                 println!("{}", s);
 21         }
 22 }

Now let’s build up our map. We’ll create a map from the sorted characters to the list of anagrams. For that we’ll use matching, another handy construct.

 23                 let mut v: Vec<char> = s.chars().collect();
 24                 v.sort();
 25                 let ss: String = v.into_iter().collect();
 27                 match dict.get(&ss) {
 28                         Some(mut v) => v.push(s),
 29                         _ => {
 30                                 let mut v = Vec::new();
 31                                 v.push(s);
 32                                 dict.insert(ss, v);
 33                         },
 34                 }

What could be simpler? I love this language! But not so fast… 28:20 error: cannot borrow immutable borrowed content `*v` as mutable             Some(mut v) => v.push(s),
                                           ^ 32:9 error: cannot borrow `dict` as mutable because it is also borrowed as immutable                 dict.insert(ss, v);
                                ^~~~ 27:13 note: previous borrow of `dict` occurs here; the immutable borrow prevents subsequent moves or mutable borrows of `dict` until the borrow ends         match dict.get(&ss) {
                              ^~~~ 34:4 note: previous borrow ends here         match dict.get(&ss) {
...         }
error: aborting due to 2 previous errors

This is where in C I’d start casting away const. Not an option here. Okay, but I remember these notions of ownership, borrowing, and mutability as concepts early in the Rust overview. At the time it seemed like one of those explanations of git that sounds like more of a functional analysis of cryptocurrency. But perhaps there were some important nuggets in there…

Mutability, check! The Hashmap::get() yielded an immutable borrow that would exist for as long as its return value was in scope. Easily solved by changing it to a get_mut(): 32:9 error: cannot borrow `dict` as mutable more than once at a time                 dict.insert(ss, v);
                               ^~~~ 27:13 note: previous borrow of `dict` occurs here; the mutable borrow prevents subsequent moves, borrows, or modification of `dict` until the borrow ends         match dict.get_mut(&ss) {
                             ^~~~ 34:4 note: previous borrow ends here         match dict.get_mut(&ss) {
...         }
error: aborting due to previous error

Wrong again. Moving me right down the Kübler-Ross model from anger into bargaining. You’re saying that I can’t mutate it because I can already mutate it? What do I have, rustc, that you want? How about if I pull the insert() out of the context of that get_mut()?

 27                 let mut bb = false;
 29                 match dict.get_mut(&ss) {
 30                         Some(mut v) => v.push(s),
 31                         _ => {
 32                                 bb = true;
 33                         },
 34                 }
 35                 if bb {
 36                         let mut v = Vec::new();
 37                         v.push(s);
 38                         dict.insert(ss, v);
 39                 }

Inelegant, yes, but Rust was billed as safe-C, not elegant-C, right? 37:12 error: use of moved value: `s`             v.push(s);
                                  ^ 30:27 note: `s` moved here because it has type `collections::string::String`, which is non-copyable             Some(mut v) => v.push(s),
error: aborting due to previous error

So by pushing the anagram into the list at line 30 we lost ownership, and even though that definitely didn’t happen in the case of us reaching line 37, rustc isn’t having it. Indeed there doesn’t seem to be a way to both get an existing value and to insert a value in one lexical vicinity. At this point I felt like I was in some bureaucratic infinite loop, doomed to shuttle to and fro between windows at the DMV, always holding the wrong form. Any crazy person will immediately be given an mutable map, but asking for a mutable map immediately classifies you a sane.

After walking away for day to contemplate, here’s the compromise I came to:

 27                 if dict.contains_key(&ss) {
 28                         dict.get_mut(&ss).unwrap().push(s);
 29                 } else {
 30                         let mut v = Vec::new();
 31                         v.push(s);
 32                         dict.insert(ss, v);
 33                 }

And everyone was happy! But it turns out that there’s an even Rustier way of doing this (thanks to Delphix intern, John Ericson) with a very specific API:

                let mut v = dict.entry(sort_str(&s)).or_insert(Vec::new());

This is starting to look at lot less like safe C and a lot more like the stacking magic of C++. No matter; I’m just trying to cheat at Scrabble, not debate philosophy. Now that I’ve got my map built, let’s prompt the user and do the lookup. We’ll put the string sorting logic into a function:

  8 fn sort_str(s: String) -> String {
  9         let mut v: Vec<char> = s.chars().collect();
 10         v.sort();
 11         let ss: String = v.into_iter().collect();
 12         ss
 13 } 32:37 error: use of moved value: `s`             dict.get_mut(&ss).unwrap().push(s);
                                                           ^ 29:22 note: `s` moved here because it has type `collections::string::String`, which is non-copyable         let ss = sort_str(s);
                                         ^ 35:12 error: use of moved value: `s`             v.push(s);
                                  ^ 29:22 note: `s` moved here because it has type `collections::string::String`, which is non-copyable         let ss = sort_str(s);
error: aborting due to 2 previous errors

This was wrong because we need to pass s as a reference or else its borrowed and destroyed; this needs to happen both in the function signature and call site.

  8 fn sort_str(s: &String) -> String {
  9         let mut v: Vec<char> = s.chars().collect();
 10         v.sort();
 11         let ss: String = v.into_iter().collect();
 12         ss
 13 }

As an aside I’d note how goofy I think it is that the absence of a semi-colon denotes function return. And that using an explicit return is sneered at as “un-idiomatic”. I’ve been told that this choice enables deeply elegant constructs with closures and that I’m simply behind the times. Fair enough. Now we’ll read the user-input:

 41         for line in stdin().lock().lines() {
 42                 let s = line.unwrap();
 44                 match dict.get(&sort_str(&s)) {
 45                         Some(v) => {
 46                                 print!("anagrams for {}: ", s);
 47                                 for a in v {
 48                                         print!("{} ", a);
 49                                 }
 50                                 println!("");
 51                         },
 52                         _ => println!("no dice"),
 53                 }
 54         } 43:21 error: borrowed value does not live long enough     for line in stdin().lock().lines() {
                               ^~~~~~~ 57:2 note: reference must be valid for the destruction scope surrounding statement at 43:1...     for line in stdin().lock().lines() {         let s = line.unwrap();         match dict.get(&sort_str(&s)) {             Some(v) => {                 print!("anagrams for {}: ", s);
               ... 57:2 note: ...but borrowed value is only valid for the statement at 43:1     for line in stdin().lock().lines() {         let s = line.unwrap();         match dict.get(&sort_str(&s)) {             Some(v) => {                 print!("anagrams for {}: ", s);
               ... 57:2 help: consider using a `let` binding to increase its lifetime     for line in stdin().lock().lines() {         let s = line.unwrap();         match dict.get(&sort_str(&s)) {             Some(v) => {                 print!("anagrams for {}: ", s);
error: aborting due to previous error

Okay! Too cute! Got it. Here’s the final program with some clean up here and there:

  1 use std::fs::File;
  2 use std::path::Path;
  3 use std::error::Error;
  4 use std::io::BufReader;
  5 use std::io::BufRead;
  6 use std::collections::HashMap;
  7 use std::io::stdin;
  9 fn sort_str(s: &String) -> String {
 10         let mut v: Vec<char> = s.chars().collect();
 11         v.sort();
 12         v.into_iter().collect()
 13 }
 15 fn main() {
 16         let path = Path::new("../word.lst");
 17         let file = match File::open(&path) {
 18                 Err(why) => panic!("failed to open {}: {}", path.display(),
 19                     Error::description(&why)),
 20                 Ok(f) => f,
 21         };
 23         let b = BufReader::new(file);
 25         let mut dict: HashMap<String, Vec<String>> = HashMap::new();
 27         for line in b.lines() {
 28                 let s = line.unwrap();
 29                 dict.entry(sort_str(&s)).or_insert(Vec::new()).push(s);
 30         }
 32         let sin = stdin();
 34         for line in sin.lock().lines() {
 35                 let s = line.unwrap();
 37                 match dict.get(&sort_str(&s)) {
 38                         Some(v) => {
 39                                 print!("anagrams for {}: ", s);
 40                                 for a in v {
 41                                         print!("{} ", a);
 42                                 }
 43                                 println!("");
 44                         },
 45                         _ => println!("no dice"),
 46                 }
 47         }
 48 }


Rust is not Python. I knew that Rust wasn’t Python… or Java, or Perl, etc. But it still took me a while to remember and embrace that. You have to think about memory management even when you get to do less of it explicitly. For programs with messy notions of data ownership I can see Rust making for significantly cleaner code, easier to understand, and more approachable to new engineers. The concepts of ownership, borrowing, and mutability aren’t “like” anything. It took the mistakes of that first program to teach me that. Hopefully you can skip straight to your second Rust program.


Before I posted this I received some suggestions from my colleagues at Delphix about how to improve the final code. I resolved to focus on the process—the journey if you will—rather than the result. That said I now realize that I was myself a victim of learning from some poor examples (from stack overflow in particular). There’s nothing more durable than poor but serviceable examples; we’ve all seen inefficient copy/pasta littered throughout a code base. So with the help again from John Ericson and the Twitterverse at large here’s my final version as a github gist (if I was going to do it over again I’d stick each revision in github for easier navigation). Happy copying!

Posted on June 22, 2015 at 7:07 pm by ahl · Permalink · 16 Comments
In: Software · Tagged with: ,

On Blogging (Briefly)

I gave a presentation today on the methods and reasons of blogging for Delphix Engineering.

One of my points was that presentations make for simple blog posts–practice what you preach!

Posted on March 4, 2015 at 11:26 pm by ahl · Permalink · One Comment
In: Delphix

DTrace OEL Dynamic Language Support

We built DTrace to solve problems; at the start, the problems we understood best were our own. In the Solaris Kernel Group we started by instrumenting the kernel and system calls, the user/kernel boundary. Early use required detailed knowledge of kernel internals. As DTrace use grew—within the team, in Sun and then beyond—we extended DTrace to turn every function and every instruction in user programs into probes. We added stable points of instrumentation both in the kernel and in user-land so that no deep knowledge of program or kernel internals would be required.

Oracle has been evolving their port of DTrace to OEL, prioritizing the stable points of instrumentation most relevant for the widest group of users. While DTrace started with providers that unlocked tens of thousands of points of instrumentation, the Oracle port enables a small number of comprehensible probes. Since I last tried out their port they’ve fixed some bugs, and added support for stable I/O and process probes, as well as user-land static probes.

[root@screven ~]# uname -a
Linux screven 3.8.13-16.el6uek.x86_64 #1 SMP Fri Sep 20 11:54:42 PDT 2013 x86_64 x86_64 x86_64 GNU/Linux
[root@screven ~]# cat test.d
provider test {
        probe foo(int);
[root@screven ~]# cat main.c
#include "test.h"

main(int argc, char **argv)
        return (0);
[root@screven ~]# dtrace -h -s test.d
[root@screven ~]# gcc -c main.c
[root@screven ~]# dtrace -G -s test.d main.o
[root@screven ~]# gcc -o main main.o test.o
[root@screven ~]# dtrace -c ./main -n 'test$target:::foo{ trace(arg0); }'
dtrace: description 'test$target:::foo' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0    643                         main:foo               100

USDT, as it’s called, was a relatively late addition in the initial development of DTrace. We added it initially to support probes in user-land locking primitives (the plockstat(1M) command uses it just as the lockstat(1M) command was converted to use kernel SDT probes). We were right in thinking that USDT would be useful for providing probes in infrastructure software such as Apache and MySQL; we didn’t anticipate how incredibly valuable it would be for supporting dynamic languages such as javascript (including Node), python, java, and bash.

USDT built on both the learning and code from years of DTrace development. By effectively starting there, OEL benefits from a decade of integrations and investigations. DTrace users on all platforms will benefit from the growth of our community. I look forward to seeing the new investigations on OEL and new integrations in all types of applications.

Posted on December 27, 2014 at 8:27 pm by ahl · Permalink · One Comment
In: DTrace

Delphix Week of Giving

In the frenzied, insular world of a Silicon Valley startup it can be easy to lose perspective on the broader community in which we live and work. Among the great hackathon projects to come from our bi-annual engineering event was the idea of “Angel Sharks”, a group of volunteers at Delphix who provide opportunities for volunteering and community giving. Earlier this year, this group organized volunteer events around the launch of new Delphix releases.

We just completed our first “Week of Giving”. While many at Delphix already donate their time and money, the Angel Sharks organized giving and corporate matching. Our theme for 2014 was hunger; we focused on the SF-Marin Food Bank as our featured organization.

Over 50% of Delphix employees participated worldwide; a high bar that I’d like to see us exceed next year. Some activities of note were volunteering at food banks in the SF Bay Area, Atlanta and Boulder, toy donations to Toys for Tots, the Salvation Army Giving Tree, and the Starlight Foundation, and a silent auction that both brought the Delphix community closer together and raised over $3,000 for the SF-Marin Food Bank. More than $21,000 was raised in total with 30% of employees making matching requests in just three weeks! The Week of Giving brought a great energy and community spirit to the company; I’m excited to have giving as part of our DNA as a young company.

The SF-Marin Food Bank feeds 225,000 people annually with 47m lbs of food, and 96% of donations go directly to their programs. Donations are down for the year while need has increased by 1m lbs. You can donate here. I volunteered twice this year with my Delphix colleagues, and once with my wife and son (8 years old); I highly recommend it for both corporate and family outings.

Happy holidays from the Delphix family!


Posted on December 24, 2014 at 10:09 pm by ahl · Permalink · Comments Closed
In: Delphix

Tuning the OpenZFS write throttle

In previous posts I discussed the problems with the legacy ZFS write throttle that cause degraded performance and wildly variable latencies. I then presented the new OpenZFS write throttle and I/O scheduler that Matt Ahrens and I designed. In addition to solving several problems in ZFS, the new approach was designed to be easy to reason about, measure, and adjust. In this post I’ll cover performance analysis and tuning — using DTrace of course. These details are intended for those using OpenZFS and trying to optimize performance — if you have only a casual interest in ZFS consider yourself warned!

Buffering dirty data

OpenZFS limits the amount of dirty data on the system according to the tunable zfs_dirty_data_max. It’s default value is 10% of memory up to 4GB. The tradeoffs are pretty simple:

Lower Higher
Less memory reserved for use by OpenZFS More memory reserved for use by OpenZFS
Able to absorb less workload variation before throttling Able to absorb more workload variation before throttling
Less data in each transaction group More data in each transaction group
Less time spent syncing out each transaction group More time spent syncing out each transaction group
More metadata written due to less amortization Less metadata written due to more amortization


Most workloads contain variability. Think of the dirty data as a buffer for that variability. Let’s say the LUNs assigned to your OpenZFS storage pool are able to sustain 100MB/s in aggregate. If a workload consistently writes at 100MB/s then only a very small buffer would be required. If instead the workload oscillates between 200MB/s and 0MB/s for 10 seconds each, then a small buffer would limit performance. A buffer of 800MB would be large enough to absorb the full 20 second cycle over which the average is 100MB/s. A buffer of only 200MB would cause OpenZFS to start to throttle writes — inserting artificial delays — after less than 2 seconds during which the LUNs could flush 200MB of dirty data while the client tried to generate 400MB.

Track the amount of outstanding dirty data within your storage pool to know which way to adjust zfs_dirty_data_max:

        this->dp = (dsl_pool_t *)arg0;

/this->dp->dp_spa->spa_name == $$1/
        printf("%4dMB of %4dMB used", this->dp->dp_dirty_total / 1024 / 1024,
            `zfs_dirty_data_max / 1024 / 1024);

# dtrace -s dirty.d pool
dtrace: script 'dirty.d' matched 2 probes
11 8730 txg_sync_thread:txg-syncing 966MB of 4096MB used
0 8730 txg_sync_thread:txg-syncing 774MB of 4096MB used
10 8730 txg_sync_thread:txg-syncing 954MB of 4096MB used
0 8730 txg_sync_thread:txg-syncing 888MB of 4096MB used
0 8730 txg_sync_thread:txg-syncing 858MB of 4096MB used

The write throttle kicks in once the amount of dirty data exceeds zfs_delay_min_dirty_percent of the limit (60% by default). If the the amount of dirty data fluctuates above and below that threshold, it might be possible to avoid throttling by increasing the size of the buffer. If the metric stays low, you may reduce zfs_dirty_data_max. Weigh this tuning against other uses of memory on the system (a larger value means that there’s less memory for applications or the OpenZFS ARC for example).

A larger buffer also means that flushing a transaction group will take longer. This is relevant for certain OpenZFS administrative operations (sync tasks) that occur when a transaction group is committed to stable storage such as creating or cloning a new dataset. If the interactive latency of these commands is important, consider how long it would take to flush zfs_dirty_data_max bytes to disk. You can measure the time to sync transaction groups (recall, there are up to three active at any given time) like this:

/((dsl_pool_t *)arg0)->dp_spa->spa_name == $$1/
        start = timestamp;

/start && ((dsl_pool_t *)arg0)->dp_spa->spa_name == $$1/
        this->d = timestamp - start;
        printf("sync took %d.%02d seconds", this->d / 1000000000,
            this->d / 10000000 % 100);

# dtrace -s duration.d pool
dtrace: script 'duration.d' matched 2 probes
5 8729 txg_sync_thread:txg-synced sync took 5.86 seconds
2 8729 txg_sync_thread:txg-synced sync took 6.85 seconds
11 8729 txg_sync_thread:txg-synced sync took 6.25 seconds
1 8729 txg_sync_thread:txg-synced sync took 6.32 seconds
11 8729 txg_sync_thread:txg-synced sync took 7.20 seconds
1 8729 txg_sync_thread:txg-synced sync took 5.14 seconds

Note that the value of zfs_dirty_data_max is relevant when sizing a separate intent log device (SLOG). zfs_dirty_data_max puts a hard limit on the amount of data in memory that has yet been written to the main pool; at most, that much data is active on the SLOG at any given time. This is why small, fast devices such as the DDRDrive make for great log devices. As an aside, consider the ostensible upgrade that Oracle brought to the ZFS Storage Appliance a few years ago replacing the 18GB “Logzilla” with a 73GB upgrade.

I/O scheduler

Where ZFS had a single IO queue for all IO types, OpenZFS has five IO queues for each of the different IO types: sync reads (for normal, demand reads), async reads (issued from the prefetcher), sync writes (to the intent log), async writes (bulk writes of dirty data), and scrub (scrub and resilver operations). Note that bulk dirty data described above are scheduled in the async write queue. See vdev_queue.c for the related tunables:

uint32_t zfs_vdev_sync_read_min_active = 10;
uint32_t zfs_vdev_sync_read_max_active = 10;
uint32_t zfs_vdev_sync_write_min_active = 10;
uint32_t zfs_vdev_sync_write_max_active = 10;
uint32_t zfs_vdev_async_read_min_active = 1;
uint32_t zfs_vdev_async_read_max_active = 3;
uint32_t zfs_vdev_async_write_min_active = 1;
uint32_t zfs_vdev_async_write_max_active = 10;
uint32_t zfs_vdev_scrub_min_active = 1;
uint32_t zfs_vdev_scrub_max_active = 2;

Each of these queues has tunable values for the min and max number of outstanding operations of the given type that can be issued to a leaf vdev (LUN). The tunable zfs_vdev_max_active limits the number of IOs issued to a single vdev. If its value is less than the sum of the zfs_vdev_*_max_active tunables, then the minimums come into play. The minimum number of each queue will be scheduled and the remainder of zfs_vdev_max_active is issued from the queues in priority order.

At a high level, the appropriate values for these tunables will be specific to your LUNs. Higher maximums lead to higher throughput with potentially higher latency. On some devices such as storage arrays with distinct hardware for reads and writes, some of the queues can be thought of as independent; on other devices such as traditional HDDs, reads and writes will likely impact each other.

A simple way to tune these values is to monitor I/O throughput and latency under load. Increase values by 20-100% until you find a point where throughput no longer increases, but latency is acceptable.

#pragma D option quiet

        start = timestamp;

        ts[args[0]->b_edev, args[0]->b_lblkno] = timestamp;

/ts[args[0]->b_edev, args[0]->b_lblkno]/
        this->delta = (timestamp - ts[args[0]->b_edev, args[0]->b_lblkno]) / 1000;
        this->name = (args[0]->b_flags & (B_READ | B_WRITE)) == B_READ ?
            "read " : "write ";

        @q[this->name] = quantize(this->delta);
        @a[this->name] = avg(this->delta);
        @v[this->name] = stddev(this->delta);
        @i[this->name] = count();
        @b[this->name] = sum(args[0]->b_bcount);

        ts[args[0]->b_edev, args[0]->b_lblkno] = 0;


        normalize(@i, (timestamp - start) / 1000000000);
        normalize(@b, (timestamp - start) / 1000000000 * 1024);

        printf("%-30s %11s %11s %11s %11s\n", "", "avg latency", "stddev",
            "iops", "throughput");
        printa("%-30s %@9uus %@9uus %@9u/s %@8uk/s\n", @a, @v, @i, @b);

# dtrace -s rw.d -c 'sleep 60'

           value  ------------- Distribution ------------- count
              32 |                                         0
              64 |                                         23
             128 |@                                        655
             256 |@@@@                                     1638
             512 |@@                                       743
            1024 |@                                        380
            2048 |@@@                                      1341
            4096 |@@@@@@@@@@@@                             5295
            8192 |@@@@@@@@@@@                              5033
           16384 |@@@                                      1297
           32768 |@@                                       684
           65536 |@                                        400
          131072 |                                         225
          262144 |                                         206
          524288 |                                         127
         1048576 |                                         19
         2097152 |                                         0        

           value  ------------- Distribution ------------- count
              32 |                                         0
              64 |                                         47
             128 |                                         469
             256 |                                         591
             512 |                                         327
            1024 |                                         924
            2048 |@                                        6734
            4096 |@@@@@@@                                  43416
            8192 |@@@@@@@@@@@@@@@@@                        102013
           16384 |@@@@@@@@@@                               60992
           32768 |@@@                                      20312
           65536 |@                                        6789
          131072 |                                         860
          262144 |                                         208
          524288 |                                         153
         1048576 |                                         36
         2097152 |                                         0        

                               avg latency      stddev        iops  throughput
write                              19442us     32468us      4064/s   261889k/s
read                               23733us     88206us       301/s    13113k/s

Async writes

Dirty data governed by zfs_dirty_data_max is written to disk via async writes. The I/O scheduler treats async writes a little differently than other operations. The number of concurrent async writes scheduled depends on the amount of dirty data on the system. Recall that there is a fixed (but tunable) limit of dirty data in memory. With a small amount of dirty data, the scheduler will only schedule a single operation (zfs_vdev_async_write_min); the idea is to preserve low latency of synchronous operations when there isn’t much write load on the system. As the amount of dirty data increases, the scheduler will push the LUNs harder to flush it out by issuing more concurrent operations.

The old behavior was to schedule a fixed number of operations regardless of the load. This meant that the latency of synchronous operations could fluctuate significantly. While writing out dirty data ZFS would slam the LUNs with writes, contending with synchronous operations and increasing their latency. After the syncing transaction group had completed, there would be a period of relatively low async write activity during which synchronous operations would complete more quickly. This phenomenon was known as “picket fencing” due to the square wave pattern of latency over time. The new OpenZFS I/O scheduler is optimized for consistency.

In addition to tuning the minimum and maximum number of concurrent operations sent to the device, there are two other tunables related to asynchronous writes: zfs_vdev_async_write_active_min_dirty_percent and zfs_vdev_async_write_active_max_dirty_percent. Along with the min and max operation counts (zfs_vdev_async_write_min_active and zfs_vdev_aysync_write_max_active), these four tunables define a piece-wise linear function that determines the number of operations scheduled as depicted in this lovely ASCII art graph excerpted from the comments:

 * The number of concurrent operations issued for the async write I/O class
 * follows a piece-wise linear function defined by a few adjustable points.
 *        |                   o---------| <-- zfs_vdev_async_write_max_active
 *   ^    |                  /^         |
 *   |    |                 / |         |
 * active |                /  |         |
 *  I/O   |               /   |         |
 * count  |              /    |         |
 *        |             /     |         |
 *        |------------o      |         | <-- zfs_vdev_async_write_min_active
 *       0|____________^______|_________|
 *        0%           |      |       100% of zfs_dirty_data_max
 *                     |      |
 *                     |      `-- zfs_vdev_async_write_active_max_dirty_percent
 *                     `--------- zfs_vdev_async_write_active_min_dirty_percent

In a relatively steady state we’d like to see the amount of outstanding dirty data stay in a narrow band between the min and max percentages, by default 30% and 60% respectively.

Tune zfs_vdev_async_write_max_active as described above to maximize throughput without hurting latency. The only reason to increase zfs_vdev_async_write_min_active is if additional writes have little to no impact on latency. While this could be used to make sure data reaches disk sooner, an alternative approach is to decrease zfs_vdev_async_write_active_min_dirty_percent thereby starting to flush data despite less dirty data accumulating.

To tune the min and max percentages, watch both latency and the number of scheduled async write operations. If the operation count fluctuates wildly and impacts latency, you may want to flatten the slope by decreasing the min and/or increasing the max (note below that you will likely want to increase zfs_delay_min_dirty_percent if you increase zfs_vdev_async_write_active_max_dirty_percent — see below).

#pragma D option aggpack
#pragma D option quiet

        self->spa = args[0];
/self->spa && self->spa->spa_name == $$1/
        @ = lquantize(args[1], 0, 30, 1);


        self->spa = 0;

# dtrace -s q.d dcenter

min .--------------------------------. max | count
< 0 : ▃▆ : >= 30 | 23279

min .--------------------------------. max | count
< 0 : █ : >= 30 | 18453

min .--------------------------------. max | count
< 0 : █ : >= 30 | 27741

min .--------------------------------. max | count
< 0 : █ : >= 30 | 3455

min .--------------------------------. max | count
< 0 : : >= 30 | 0

Write delay

In situations where LUNs cannot keep up with the incoming write rate, OpenZFS artificially delays writes to ensure consistent latency (see the previous post in this series). Until a certain amount of dirty data accumulates there is no delay. When enough dirty data accumulates OpenZFS gradually increases the delay. By delaying writes OpenZFS effectively pushes back on the client to limit the rate of writes by forcing artificially higher latency. There are two tunables that pertain to delay: how much dirty data there needs to be before the delay kicks in, and the factor by which that delay increases as the amount of outstanding dirty data increases.

The tunable zfs_delay_min_dirty_percent determines when OpenZFS starts delaying writes. The default is 60%; note that we don’t start delaying client writes until the IO scheduler is pushing out data as fast as it can (zfs_vdev_async_write_active_max_dirty_percent also defaults to 60%).

The other relevant tunable is zfs_delay_scale is really the only magic number here. It roughly corresponds to the inverse of the maximum number of operations per second (denominated in nanoseconds), and is used as a scaling factor.

Delaying writes is an aggressive step to ensure consistent latency. It is required if the client really is pushing more data than the system can handle, but unnecessarily delaying writes degrades overall throughput. There are two goals to tuning delay: reduce or remove unnecessary delay, and ensure consistent delays when needed.

First check to see how often writes are delayed. This simple DTrace one-liner does the trick:

# dtrace -n fbt::dsl_pool_need_dirty_delay:return'{ @[args[1] == 0 ? "no delay" : "delay"] = count(); }'

If a relatively small percentage of writes are delayed, increasing the amount of dirty data allowed (zfs_dirty_data_max) or even pushing out the point at which delays start (zfs_delay_min_dirty_percent). When increasing zfs_dirty_data_max consider the other users of DRAM on the system, and also note that a small amount of small delays does not impact performance significantly.

If many writes are being delayed, the client really is trying to push data faster than the LUNs can handle. In that case, check for consistent latency, again, with a DTrace one-liner:

# dtrace -n delay-mintime'{ @ = quantize(arg2); }'

With high variance or if many write operations are being delayed for the maximum zfs_delay_max_ns (100ms by default) then try increasing zfs_delay_scale by a factor of 2 or more, or try delaying earlier by reducing zfs_delay_min_dirty_percent (remember to also reduce zfs_vdev_async_write_active_max_dirty_percent).

Summing up

Our experience at Delphix tuning the new write throttle has been so much better than in the old ZFS world: each tunable has a clear and comprehensible purpose, their relationships are well-defined, and the issues in tension pulling values up or down are both easy to understand and — most importantly — easy to measure. I hope that this tuning guide helps others trying to get the most out of their OpenZFS systems whether on Linux, FreeBSD, Mac OS X, illumos — not to mention the support engineers for the many products that incorporate OpenZFS into a larger solution.

Posted on August 31, 2014 at 4:16 pm by ahl · Permalink · 3 Comments
In: ZFS · Tagged with: , , , , ,

Lessons from a decade of blogging

I started my blog June 17, 2004, tempted by the opportunity of Sun’s blogging policy, and cajoled by Bryan Cantrill’s presentation to the Solaris Kernel Team “Guerrilla Marketing” (net: Sun has forgotten about Solaris so let’s get the word out). I was a skeptical blogger. I even resisted the contraction “blog”, insisting on calling it “Adam Leventhal’s Weblog” as if linguistic purity would somehow elevate me above the vulgar blogspotter opining over toothpaste brands. (That linguistic purity did not, however, carry over into my early writing — my goodness it was painful to open that unearthed time capsule.)

A little about my blog. When I started blogging I was worried that I’d need to post frequently to build a readership. That was never going to happen. Fortunately aggregators (RSS feeds then; Twitter now) and web searches are far more relevant. My blog is narrow. There’s a lot about DTrace (a technology I helped develop), plenty in the last four years about Delphix (my employer), and samplings of flash memory, Galois fields, RAID, and musings on software and startups. The cumulative intersection consists of a single person. But — and this is hard to fathom — I’ve hosted a few hundred thousand unique visitors over the years. Aggregators pick up posts soon after posting; web searches drive traffic for years even on esoteric topics.

Ten years and 172 posts later, I wanted to see what lessons I could discern. So I turned to Google Analytics.

Most popular

3. I was surprised to see that my posts on double- and triple-parity RAID for ZFS have been among the most consistently read over the years since posting in 2006 and 2009 respectively. The former is almost exclusively an explanation of abstract algebra that I was taught in 2000, applied in 2006, and didn’t understand properly until 2009 — when wrote the post. The latter is catharsis from discovering errors in the published basis for our RAID implementation. I apparently considered it a personal affront.

2. When Oracle announced their DTrace port to Linux in 2011 a pair of posts broke the news and then deflated expectations — another personal affront — as the Oracle Linux efforts fell short of expectations (and continue to today). I had learned the lesson earlier that DTrace + a more popular operating system always garnered more interest.

1. In 2008 I posted about a defect in Apple’s DTrace implementation that was the result of it’s paranoid DRM protection. This was my perfect storm of blogging popularity: DTrace, more popular OS (Max OS X!), Apple-bashing, and DRM! The story was snapped up by Slashdot (Reddit of the mid-2000s) as “Apple Crippled Its DTrace Port” and by The Register’s Ashlee Vance (The Register’s Chris Mellor of the mid-2000s) as “Apple cripples Sun’s open source jewel: Hollywood love inspires DTrace bomb.” It’s safe to say that I’m not going to see another week with 49,312 unique visitors any time soon. And to be clear I’m deeply grateful to that original DTrace team at Apple — the subject of a different post.

And many more…

Some favorites of mine and of readers (views, time on site, and tweets) over the years:

2004 Solaris 10 11-20. Here was a fun one. Solaris 10 was a great release. Any of the top ten features would have been the headliner in a previous release so I did a series on some of the lesser features that deserved to make the marquee. (If anyone would like to fill in number 14, dynamic System V IPC, I’d welcome the submission.)

2004 Inside nohup -p. The nohup command had remained virtual untouched since being developed at Bell Labs by the late Joseph Ossanna (described as “a peach and a ramrod”). I enjoyed adding some 21st century magic, and suffocating the reader with the details.

2005 DTrace is open. It truly was an honor to have DTrace be the first open source component of Solaris. That I took the opportunity to descend to crush depth was a testament to the pride I took in that code. (tsj and Kamen, I’m seeing your comments now for the first time and will respond shortly.)

2005 Sanity and FUD. This one is honestly adorable. Only a naive believer could have been such a passionate defender of what would become Oracle Solaris.

2005 DTrace in the JavaOne Keynote. It was a trip to present to over 10,000 people at Moscone. I still haven’t brought myself to watch the video. Presentation tip: to get comfortable speaking to an audience of size N simply speak to an audience of size 10N.

2005 The mysteries of _init. I geeked out about some of the voodoo within the linker. And I’m glad I did because a few weeks ago that very post solved a problem for one of my colleagues. I found myself reading the post with fascination (of course having forgotten it completely).

2008 Hybrid Storage Pools in CACM. In one of my first published articles, I discussed how we were using flash memory — a niche product at the time — as a component in enterprise storage. Now, of course, flash has always been the obvious future of storage; no one had yet realized that at the time.

2012 Hardware Engineer. At Fishworks (building the ZFS Storage Appliance at Sun) I got the nickname “Adam Leventhal, Hardware Engineer” for my preternatural ability to fit round pegs in square holes; this post catalogued some of those experiments.

2013 The Holistic Engineer. My thoughts on what constitutes a great engineer; this has become a frequently referenced guidepost within Delphix engineering.

2013 Delphix plus three years. Obviously I enjoy anniversaries. This was both a fun one to plan and write, and the type of advice I wish I had taken to heart years ago.

You said something about lessons?

The popularity of those posts about DTrace for Mac OS X and Linux had suggested to me that controversy is more interesting than data. While that may be true, I think the real driver was news. With most tech publications regurgitating press releases, people appreciate real investigation and real analysis. (Though Google Analytics does show that popularity is inversely proportional to time on site i.e. thorough reading.)

If you want people to read (and understand) your posts, run a draft through one of those online grade-level calculators. Don’t be proud of writing at a 12th grade level; rewrite until 6th graders can understand. For complex subjects that may be difficult, but edit for clarity. Simpler is better.

Everyone needs an editor. I find accepting feedback to be incredibly difficult — painful — but it yields a better result. Find someone you trust to provide the right kind of feedback.

Early on blogging seemed hokey. Today it still can feel hokey — dispatches that feel directed at no one in particular. But I’d encourage just about any engineer to start a blog. It forces you to organize your ideas in a different and useful way, and it connects you with the broader community of users, developers, employees, and customers. For the past ten years I’ve walked into many customers who now start the conversation aware of topics and technology I care about.

Finally, reading those old blog posts was painful. I got (slightly) better the only way I knew how: repetition. Get the first 100 posts out of the way so that you can move on to the next 100. Don’t worry about readership. Don’t worry about popularity. Interesting content will find an audience, but think about your reader. Just start writing.

Posted on June 17, 2014 at 9:24 am by ahl · Permalink · Comments Closed
In: Other · Tagged with: 

Enterprise support and the term abroad

Delphix custsignsomers include top companies across a wide range of industries, most of them executing around the clock. Should a problem arise they require support from Delphix around the clock as well. To serve our customers’ needs we’ve drawn from industry best-practices while recently mixing in an unconventional approach to providing the best possible customer service regardless of when a customer encounters a problem.

There are three common approaches to support: outsourcing, shifts, and “follow the sun”. Outsourcing is economical but quality and consistency suffer especially for difficult cases. Asking outstanding engineers to cover undesirable shifts is unappealing. An on-call rotation (shifts “lite”) may be more tolerable but can be inadequate — and stressful — in a crisis. Hiring a geographically dispersed team — whose natural work day “follows the sun” — provides a more durable solution but has its own challenges. Interviewing is tough. Training is tougher. And maintaining education and consistency across the globe is nearly impossible.

Live communication simplifies training. New support engineers learn faster with live — ideally local — mentors, experts on a wide range of relevant technologies. The team is more able to stay current on the product and tools by working collaboratively. In a traditional “follow the sun” model, the first support engineer in a new locale is doubly disadvantaged — the bulk of the team is unavailable during the work day, and there’s no local experienced team for collaboration.

At Delphix, we don’t outsource our support engineering. We do hire around the globe, and we do have an on-call schedule. We’ve also drawn inspiration from an innovative approach employed by Moneypenny, a UK-based call center. Moneypenny had resisted extending their service to off-hours because they didn’t want to incur the detrimental effects of shift work to employee’s health and attitude. They didn’t want to outsource work because they were afraid customer satisfaction would suffer. Instead they took the novel step of opening an Auckland office — 12 hours offset — and sending employees for 4-6 months on a voluntary basis.

I was idly listening to NPR in the car when I heard the BBC report on Moneypenny. Their customers and employees raved about the approach. It was such a simple and elegant solution to the problem of around the clock support; I pulled over to consider the implications for Delphix Support. The cost of sending a support engineer to a remote destination would be paltry compared with the negative consequences associated with other approaches to support: weak hires, inconsistent methodologies, insufficient mentorship, not to mention underserved, angry, or lost customers. And the benefits to customers and the rest of the team would again far exceed the expense.

We call it the Delphix Support “term abroad.” As with a term abroad in school, it’s an opportunity for one of our experienced support engineers to work in a foreign locale. Delphix provides lodging in a sufficiently remote timezone with the expectation of a fairly normal work schedule. As with Moneypenny, that means that Delphix is able to provide the same high level of technical support at all times of day. In addition, that temporarily remote engineer can help to build a local team by recruiting, interviewing, and mentoring.

David — the longest tenured member of the Delphix support team — recently returned from a term abroad to the UK where he joined Scott, a recent hire and UK native. Scott spent a month working with David and others at our Menlo Park headquarters. Then David joined Scott in the UK to continue his mentorship and training. Both worked cases that would have normally paged the on-call engineer. A day after arriving in the UK, in fact, David and Scott handled two cases that would have otherwise woken up an engineer based in the US.

Early results give us confidence that the term abroad is going to be a powerful and complementary tool. Delphix provides the same high quality support at all hours, while expanding globally and increasing the satisfaction of the team. And it makes Delphix Support an even more attractive place to work for those who want to opt in to a little global adventure.

Posted on June 13, 2014 at 12:03 am by ahl · Permalink · One Comment
In: Delphix