Adam Leventhal's blog

Search
Close this search box.

Category: Software

This series of posts covers APFS, Apple’s new filesystem announced at WWDC 2016. See the first post for the table of contents.

Encryption

Encryption is clearly a core feature of APFS. This comes from diverse requirements from the various devices, for example multiple keys within file systems on the iPhone or per-user keys on laptops. I heard the term “innovative” quite a bit at WWDC, but here the term is aptly applied to APFS. It supports several different encryption choices for a file system:

  • Unencrypted
  • Single-key for metadata and user data
  • Multi-key with different choices for metadata, files, and even sections of a file (“extents”)

Multi-key encryption is particularly relevant for portables where all data might be encrypted, but unlocking your phone provides access to an additional key and therefore additional data. Unfortunately this doesn’t seem to be working in the first beta of macOS Sierra (specifying fileEncryption when creating a new volume with diskutil results in a file system that reports “Is Encrypted” as “No”).

Related to encryption, I noticed an undocumented feature while playing around with diskutil (which prompts you for interactive confirmation of the destructive power of APFS unless this is added to the command-line: -IHaveBeenWarnedThatAPFSIsPreReleaseAndThatIMayLoseData; I’m not making this up). APFS (apparently) supports constant time cryptographic file system erase, called “effaceable” in the diskutil output. This presumably builds a secret key that cannot be extracted from APFS and encrypts the file system with it. A secure erase then need only delete the key rather than needing to scramble and re-scramble the full disk to ensure total eradication. Various iOS docs refer to this capability requiring some specialized hardware; it will be interesting to see what the option means on macOS. Either way, let’s not mention this to the FBI or NSA, agreed?

Snapshots and Backup

APFS brings a much-desired file system feature: snapshots. A snapshot lets you freeze the state of a file system at a particular moment and continue to use and modify that file system while preserving the old data. It does so in a space-efficient fashion where, effectively, changes are tracked and only new data takes up additional space. This has the potential to be extremely valuable for backup by efficiently tracking the data that has changed since the last backup.

ZFS includes snapshots and serialization mechanisms that make it efficient to backup file systems or transfer file systems to a remote location. Will APFS work like that? Probably not, answered Dominic Giampaolo, APFS lead developer. ZFS sends all changed data while Time Machine can have exclusion lists and the like. That seems surmountable, but we’ll see what Apple does. APFS right now is incompatible with Time Machine due to the lack of directory hard links, a fairly disgusting implementation that likely contributes to Time Machine’s questionable reliability. Hopefully APFS will create some efficient serialization for Time Machine backup.

While Eric Tamura, APFS dev manager, demonstrated snapshots at WWDC, the required utilities aren’t included in the macOS Sierra beta. I used DTrace (technology I’m increasingly amazed that Apple ported from OpenSolaris) to find a tantalizingly-named new system call fs_snapshot; I’ll leave it to others to reverse engineer its proper use.

Management

APFS brings another new feature known as space sharing. A single APFS “container” that spans a device can have multiple “volumes” (file systems) within it. Apple contrasts this with the static allocation of disk space to support multiple HFS+ instances, which seems both specious and an uncommon use case. Both ZFS and btrfs have a similar concept of a shared pool of storage with nested file systems for administration and management.

Speaking with Dominic and other members of the APFS team, we discussed how volumes are the unit by which users can control things like snapshots and encryption. You’d want multiple volumes to correspond with different policies around those settings. For example while you might want to snapshot and backup your system each day, the massive /private/var/vm/sleepimage (for saving memory when hibernating) should live on its own and not be backed up.

Space sharing is more like an operational detail than a game changing feature. You can think of it like special folders with snapshot and encryption controls… which is probably why Apple’s marketing department has yet to make me a job offer. Unfortunately this feature isn’t working in the macOS Sierra beta, so I wasn’t able to have more than one volume per container. Adding new volumes can fail with an opaque error (-69625 mean anything to you?), but using a larger disk image resolve the problem.

 

Next in this series: Space Efficiency and Clones

Apple announced a new file system that will make its way into all of its OS variants (macOS, tvOS, iOS, watchOS) in the coming years. Media coverage to this point has been mostly breathless elongations of Apple’s developer documentation. With a dearth of detail I decided to attend the presentation and Q&A with the APFS team at WWDC. Dominic Giampaolo and Eric Tamura, two members of the APFS team, gave an overview to a packed room; along with other members of the team, they patiently answered questions later in the day. With those data points and some first hand usage I wanted to provide an overview and analysis both as a user of Apple-ecosystem products and as a long-time operating system and file system developer.

I’ve divided my review into several sections that span a few posts. I’d encourage you to jump around to topics of interest or skip right to the conclusion (or to the tweet summary). Highest praise goes to encryption; ire to data integrity.

Basics

APFS, the Apple File System, was itself started in 2014 with Dominic as its lead engineer. It’s a stand-alone, from-scratch implementation (an earlier version of this post noted a dependency on Core Storage, but Dominic set me straight). I asked him about looking for inspiration in other modern file systems such as BSD’s HAMMER, Linux’s btrfs, or OpenZFS (Solaris, illumos, FreeBSD, Mac OS X, Ubuntu Linux, etc.), all of which have features similar to what APFS intends to deliver. (And note that Apple built a fairly complete port of ZFS, though Dominic was not apparently part of the group advocating for it.) Dominic explained that while, as a self-described file system guy (he built the file system in BeOS, unfairly relegated to obscurity when Apple opted to purchase NeXTSTEP instead), he was aware of them, but didn’t delve too deeply for fear, he said, of tainting himself.

Dominic praised the APFS testing team as being exemplary. This is absolutely critical. A common adage is that it takes a decade to mature a file system. And my experience with ZFS more or less confirms this. Apple will be delivering APFS broadly with 3-4 years of development so will need to accelerate quickly to maturity.

Paying Down Debt

HFS was introduced in 1985 when the Mac 512K (of memory! Holy smokes!) was Apple’s flagship. HFS+, a significant iteration, shipped in 1998 on the G3 PowerMacs with 4GB hard drives. Since then storage capacities have increased by factors of 1,000,000 and 1,000 respectively. HFS+ has been pulled in a bunch of competing directions with different forks for different devices (e.g. the iOS team created their own HFS variant, working so covertly that not even the Mac OS team knew) and different features (e.g. journaling, case insensitive). It’s old; it’s a mess; and, critically, it’s missing a bunch of features that are really considered the basic cost of doing business for most operating systems. Wikipedia lists nanosecond timestamps, checksums, snapshots, and sparse file support among those missing features. Add to that the obvious gap of large device support and you’ve got a big chunk of the APFS feature list.

APFS first and foremost pays down the unsustainable technical debt that Apple has been carrying in HFS+. (In 2001 ZFS grew from a similar need where UFS had been evolved since 1977.) It unifies the multifarious forks. It introduces the expected features. In general it first brings the derelict building up to code.

Compression is an obvious gap in the APFS feature list that is common in many file systems. It’s conceptually quite easy, I told the development team (we had it in ZFS from the outset), so why not include it? To appeal to Dominic’s BeOS nostalgia I even recalled my job interview with Be in 2000 when they talked about how compression actually improved overall performance since data I/O is far more expensive than computation (obvious now, but novel then). The Apple folks agreed, and—in typical Apple fashion—neither confirmed nor denied while strongly implying that it’s definitely a feature we can expect in APFS. I’ll be surprised if compression isn’t included in its public launch.

 

Next in this series: Encryption, Snapshots, and Backup

Like many programmers I like to try out new languages. After lunch with Alex Crichton, one of the Rust contributors, I started writing my favorite program in Rust. Rust is a “safe” systems language that introduces concepts of data ownership and mutability to semantically prevent whole categories of problems. It’s primarily developed at Mozilla Research in service of a next generation rendering engine, and while I presume that the name is a poke in the eye of Google’s Chrome, no one was brave enough to confirm that lest their next Uber ride reroute them to Bagram.

My standard “hello world” is a anagrammer / Scrabble cheater. Why? In most languages you can get it done in a few dozen lines of code, and it uses a variety of important language and library features: lists, maps, file IO, console IO, strings, sorting, etc. Rust is great, interesting in the way that I found objected-oriented or functional programming interesting when I first learned about them. It’s notions of data ownership, borrowing, and mutability I think lead to some of the same aha moments as closures for example. I found Rust to be quirky enough though that I thought I might be able to save others the pain of their first program, advancing them to the glorious, safe efficiency of their second by relating my experience.

So with the help of Stack Overflow I wrote the first chunk:

  1 use std::fs::File;
  2 use std::path::Path;
  3 use std::error::Error;
  4 use std::io::BufReader;
  5 use std::io::BufRead;
  6
  7 fn main() {
  8         let path = Path::new("../word.lst");
  9         let file = match File::open(&path) {
 10                 Err(why) => panic!("failed to open {}: {}", path.display(),
 11                     Error::description(&why)),
 12                 Ok(f) => f,
 13         };
 14
 15         let mut b = BufReader::new(file);
 16         let mut s = String::new();
 17
 18         while b.read_line(&mut s).is_ok() {
 19                 println!("{}", s);
 20         }
 21 }

So far so good? Well I ran it and it didn’t seem to be terminating…

$ time ./scrabble >/dev/null
<time passes>

What’s happening?

$ ./scrabble | head
aa

aa
aah

aa
aah
aahed

aa
thread '<main>' panicked at 'failed printing to stdout: Broken pipe (os error 32)', /Users/rustbuild/src/rust-buildbot/slave/nightly-dist-rustc-mac/build/src/libstd/io/stdio.rs:404

Okay — first lesson: String::clear(). As the documentation clearly states, BufReader::read_line() appends to an existing string; my own expectations and preconceptions are beside the point.

  1 use std::fs::File;
  2 use std::path::Path;
  3 use std::error::Error;
  4 use std::io::BufReader;
  5 use std::io::BufRead;
  6
  7 fn main() {
  8         let path = Path::new("../word.lst");
  9         let file = match File::open(&path) {
 10                 Err(why) => panic!("failed to open {}: {}", path.display(),
 11                     Error::description(&why)),
 12                 Ok(f) => f,
 13         };
 14
 15         let mut b = BufReader::new(file);
 16         let mut s = String::new();
 17
 18         while b.read_line(&mut s).is_ok() {
 19                s.pop();
 20                 println!("{}", s);
 21                 s.clear();
 22         }
 23 }

Better? Yes:

$ ./scrabble | head
aa
aah
aahed
aahing
aahs
aal
aalii
aaliis
aals
aardvark
thread '<main>' panicked at 'failed printing to stdout: Broken pipe (os error 32)', /Users/rustbuild/src/rust-buildbot/slave/nightly-dist-rustc-mac/build/src/libstd/io/stdio.rs:404

Correct? No:

$ time ./scrabble >/dev/null
<time passes>

It turns out that BufReader::read_line() indeed is_ok() even at EOF. Again, documented but—to me—counter-intuitive. And it turns out that this is a somewhat divisive topic. No matter; how about something else? Well it works, but the ever persnickety rustc finds ‘while true’ too blue-collar of a construct:

$ rustc scrabble.rs
scrabble.rs:18:2: 25:3 warning: denote infinite loops with loop { ... }, #[warn(while_true)] on by default
scrabble.rs:18     while true {
scrabble.rs:19         if !b.read_line(&mut s).is_ok() || s.len() == 0 {
scrabble.rs:20             break;
scrabble.rs:21         }
scrabble.rs:22         s.pop();
scrabble.rs:23         println!("{}", s);
                ...

Trying to embrace the fastidious methodology (while ever temped to unsafe-and-let-execution-be-the-judge) I gave up on read_line() and its controversial EOF and error semantics to try out BufReader::lines():

 18         for s in b.lines() {
 19                 println!("{}", s);
 20         }
$ rustc scrabble2.rs
scrabble2.rs:18:18: 18:19 error: the trait `core::fmt::Display` is not implemented for the type `core::result::Result<collections::string::String, std::io::error::Error>` [E0277]
scrabble2.rs:18         println!("{}", s);
                                       ^
note: in expansion of format_args!
<std macros>:2:25: 2:58 note: expansion site
<std macros>:1:1: 2:62 note: in expansion of print!
<std macros>:3:1: 3:54 note: expansion site
<std macros>:1:1: 3:58 note: in expansion of println!
scrabble2.rs:18:3: 18:21 note: expansion site
scrabble2.rs:18:18: 18:19 note: `core::result::Result<collections::string::String, std::io::error::Error>` cannot be formatted with the default formatter; try using `:?` instead if you are using a format string
scrabble2.rs:18         println!("{}", s);
                                       ^
note: in expansion of format_args!
<std macros>:2:25: 2:58 note: expansion site
<std macros>:1:1: 2:62 note: in expansion of print!
<std macros>:3:1: 3:54 note: expansion site
<std macros>:1:1: 3:58 note: in expansion of println!
scrabble2.rs:18:3: 18:21 note: expansion site
error: aborting due to previous error

Okay; that was apparently very wrong. The BufReader::lines() iterator gives us Result<String>s which we need to unwrap(). No problem.

 18         for line in b.lines() {
 19                 let s = line.unwrap();
 20                 println!("{}", s);
 21         }
scrabble.rs:15:6: 15:11 warning: variable does not need to be mutable, #[warn(unused_mut)] on by default
scrabble.rs:15     let mut b = BufReader::new(file);

Fine, rustc, you’re the boss. Now it’s simpler and it’s cranking:

  1 use std::fs::File;
  2 use std::path::Path;
  3 use std::error::Error;
  4 use std::io::BufReader;
  5 use std::io::BufRead;
  6 use std::collections::HashMap;
  7
  8 fn main() {
  9         let path = Path::new("../word.lst");
 10         let file = match File::open(&path) {
 11                 Err(why) => panic!("failed to open {}: {}", path.display(),
 12                     Error::description(&why)),
 13                 Ok(f) => f,
 14         };
 15
 16         let b = BufReader::new(file);
 17
 18         for line in b.lines() {
 19                 let s = line.unwrap();
 20                 println!("{}", s);
 21         }
 22 }

Now let’s build up our map. We’ll create a map from the sorted characters to the list of anagrams. For that we’ll use matching, another handy construct.

 23                 let mut v: Vec<char> = s.chars().collect();
 24                 v.sort();
 25                 let ss: String = v.into_iter().collect();
 26
 27                 match dict.get(&ss) {
 28                         Some(mut v) => v.push(s),
 29                         _ => {
 30                                 let mut v = Vec::new();
 31                                 v.push(s);
 32                                 dict.insert(ss, v);
 33                         },
 34                 }

What could be simpler? I love this language! But not so fast…

scrabble.rs:28:19: 28:20 error: cannot borrow immutable borrowed content `*v` as mutable
scrabble.rs:28             Some(mut v) => v.push(s),
                                           ^
scrabble.rs:32:5: 32:9 error: cannot borrow `dict` as mutable because it is also borrowed as immutable
scrabble.rs:32                 dict.insert(ss, v);
                                ^~~~
scrabble.rs:27:9: 27:13 note: previous borrow of `dict` occurs here; the immutable borrow prevents subsequent moves or mutable borrows of `dict` until the borrow ends
scrabble.rs:27         match dict.get(&ss) {
                              ^~~~
scrabble.rs:34:4: 34:4 note: previous borrow ends here
scrabble.rs:27         match dict.get(&ss) {
...
scrabble.rs:34         }
                        ^
error: aborting due to 2 previous errors

This is where in C I’d start casting away const. Not an option here. Okay, but I remember these notions of ownership, borrowing, and mutability as concepts early in the Rust overview. At the time it seemed like one of those explanations of git that sounds like more of a functional analysis of cryptocurrency. But perhaps there were some important nuggets in there…

Mutability, check! The Hashmap::get() yielded an immutable borrow that would exist for as long as its return value was in scope. Easily solved by changing it to a get_mut():

scrabble.rs:32:5: 32:9 error: cannot borrow `dict` as mutable more than once at a time
scrabble.rs:32                 dict.insert(ss, v);
                               ^~~~
scrabble.rs:27:9: 27:13 note: previous borrow of `dict` occurs here; the mutable borrow prevents subsequent moves, borrows, or modification of `dict` until the borrow ends
scrabble.rs:27         match dict.get_mut(&ss) {
                             ^~~~
scrabble.rs:34:4: 34:4 note: previous borrow ends here
scrabble.rs:27         match dict.get_mut(&ss) {
...
scrabble.rs:34         }
                       ^
error: aborting due to previous error

Wrong again. Moving me right down the Kübler-Ross model from anger into bargaining. You’re saying that I can’t mutate it because I can already mutate it? What do I have, rustc, that you want? How about if I pull the insert() out of the context of that get_mut()?

 27                 let mut bb = false;
 28
 29                 match dict.get_mut(&ss) {
 30                         Some(mut v) => v.push(s),
 31                         _ => {
 32                                 bb = true;
 33                         },
 34                 }
 35                 if bb {
 36                         let mut v = Vec::new();
 37                         v.push(s);
 38                         dict.insert(ss, v);
 39                 }

Inelegant, yes, but Rust was billed as safe-C, not elegant-C, right?

scrabble.rs:37:11: 37:12 error: use of moved value: `s`
scrabble.rs:37             v.push(s);
                                  ^
scrabble.rs:30:26: 30:27 note: `s` moved here because it has type `collections::string::String`, which is non-copyable
scrabble.rs:30             Some(mut v) => v.push(s),
                                                 ^
error: aborting due to previous error

So by pushing the anagram into the list at line 30 we lost ownership, and even though that definitely didn’t happen in the case of us reaching line 37, rustc isn’t having it. Indeed there doesn’t seem to be a way to both get an existing value and to insert a value in one lexical vicinity. At this point I felt like I was in some bureaucratic infinite loop, doomed to shuttle to and fro between windows at the DMV, always holding the wrong form. Any crazy person will immediately be given an mutable map, but asking for a mutable map immediately classifies you a sane.

After walking away for day to contemplate, here’s the compromise I came to:

 27                 if dict.contains_key(&ss) {
 28                         dict.get_mut(&ss).unwrap().push(s);
 29                 } else {
 30                         let mut v = Vec::new();
 31                         v.push(s);
 32                         dict.insert(ss, v);
 33                 }

And everyone was happy! But it turns out that there’s an even Rustier way of doing this (thanks to Delphix intern, John Ericson) with a very specific API:

                let mut v = dict.entry(sort_str(&s)).or_insert(Vec::new());
                v.push(s);

This is starting to look at lot less like safe C and a lot more like the stacking magic of C++. No matter; I’m just trying to cheat at Scrabble, not debate philosophy. Now that I’ve got my map built, let’s prompt the user and do the lookup. We’ll put the string sorting logic into a function:

  8 fn sort_str(s: String) -> String {
  9         let mut v: Vec<char> = s.chars().collect();
 10         v.sort();
 11         let ss: String = v.into_iter().collect();
 12         ss
 13 }
scrabble.rs:32:36: 32:37 error: use of moved value: `s`
scrabble.rs:32             dict.get_mut(&ss).unwrap().push(s);
                                                           ^
scrabble.rs:29:21: 29:22 note: `s` moved here because it has type `collections::string::String`, which is non-copyable
scrabble.rs:29         let ss = sort_str(s);
                                         ^
scrabble.rs:35:11: 35:12 error: use of moved value: `s`
scrabble.rs:35             v.push(s);
                                  ^
scrabble.rs:29:21: 29:22 note: `s` moved here because it has type `collections::string::String`, which is non-copyable
scrabble.rs:29         let ss = sort_str(s);
                                         ^
error: aborting due to 2 previous errors

This was wrong because we need to pass s as a reference or else its borrowed and destroyed; this needs to happen both in the function signature and call site.

  8 fn sort_str(s: &String) -> String {
  9         let mut v: Vec<char> = s.chars().collect();
 10         v.sort();
 11         let ss: String = v.into_iter().collect();
 12         ss
 13 }

As an aside I’d note how goofy I think it is that the absence of a semi-colon denotes function return. And that using an explicit return is sneered at as “un-idiomatic”. I’ve been told that this choice enables deeply elegant constructs with closures and that I’m simply behind the times. Fair enough. Now we’ll read the user-input:

 41         for line in stdin().lock().lines() {
 42                 let s = line.unwrap();
 43
 44                 match dict.get(&sort_str(&s)) {
 45                         Some(v) => {
 46                                 print!("anagrams for {}: ", s);
 47                                 for a in v {
 48                                         print!("{} ", a);
 49                                 }
 50                                 println!("");
 51                         },
 52                         _ => println!("no dice"),
 53                 }
 54         }
scrabble.rs:43:14: 43:21 error: borrowed value does not live long enough
scrabble.rs:43     for line in stdin().lock().lines() {
                               ^~~~~~~
scrabble.rs:43:2: 57:2 note: reference must be valid for the destruction scope surrounding statement at 43:1...
scrabble.rs:43     for line in stdin().lock().lines() {
scrabble.rs:44         let s = line.unwrap();
scrabble.rs:45
scrabble.rs:46         match dict.get(&sort_str(&s)) {
scrabble.rs:47             Some(v) => {
scrabble.rs:48                 print!("anagrams for {}: ", s);
               ...
scrabble.rs:43:2: 57:2 note: ...but borrowed value is only valid for the statement at 43:1
scrabble.rs:43     for line in stdin().lock().lines() {
scrabble.rs:44         let s = line.unwrap();
scrabble.rs:45
scrabble.rs:46         match dict.get(&sort_str(&s)) {
scrabble.rs:47             Some(v) => {
scrabble.rs:48                 print!("anagrams for {}: ", s);
               ...
scrabble.rs:43:2: 57:2 help: consider using a `let` binding to increase its lifetime
scrabble.rs:43     for line in stdin().lock().lines() {
scrabble.rs:44         let s = line.unwrap();
scrabble.rs:45
scrabble.rs:46         match dict.get(&sort_str(&s)) {
scrabble.rs:47             Some(v) => {
scrabble.rs:48                 print!("anagrams for {}: ", s);
               ...
error: aborting due to previous error

Okay! Too cute! Got it. Here’s the final program with some clean up here and there:

  1 use std::fs::File;
  2 use std::path::Path;
  3 use std::error::Error;
  4 use std::io::BufReader;
  5 use std::io::BufRead;
  6 use std::collections::HashMap;
  7 use std::io::stdin;
  8
  9 fn sort_str(s: &String) -> String {
 10         let mut v: Vec<char> = s.chars().collect();
 11         v.sort();
 12         v.into_iter().collect()
 13 }
 14
 15 fn main() {
 16         let path = Path::new("../word.lst");
 17         let file = match File::open(&path) {
 18                 Err(why) => panic!("failed to open {}: {}", path.display(),
 19                     Error::description(&why)),
 20                 Ok(f) => f,
 21         };
 22
 23         let b = BufReader::new(file);
 24
 25         let mut dict: HashMap<String, Vec<String>> = HashMap::new();
 26
 27         for line in b.lines() {
 28                 let s = line.unwrap();
 29                 dict.entry(sort_str(&s)).or_insert(Vec::new()).push(s);
 30         }
 31
 32         let sin = stdin();
 33
 34         for line in sin.lock().lines() {
 35                 let s = line.unwrap();
 36
 37                 match dict.get(&sort_str(&s)) {
 38                         Some(v) => {
 39                                 print!("anagrams for {}: ", s);
 40                                 for a in v {
 41                                         print!("{} ", a);
 42                                 }
 43                                 println!("");
 44                         },
 45                         _ => println!("no dice"),
 46                 }
 47         }
 48 }

Lessons

Rust is not Python. I knew that Rust wasn’t Python… or Java, or Perl, etc. But it still took me a while to remember and embrace that. You have to think about memory management even when you get to do less of it explicitly. For programs with messy notions of data ownership I can see Rust making for significantly cleaner code, easier to understand, and more approachable to new engineers. The concepts of ownership, borrowing, and mutability aren’t “like” anything. It took the mistakes of that first program to teach me that. Hopefully you can skip straight to your second Rust program.

Postscript

Before I posted this I received some suggestions from my colleagues at Delphix about how to improve the final code. I resolved to focus on the process—the journey if you will—rather than the result. That said I now realize that I was myself a victim of learning from some poor examples (from stack overflow in particular). There’s nothing more durable than poor but serviceable examples; we’ve all seen inefficient copy/pasta littered throughout a code base. So with the help again from John Ericson and the Twitterverse at large here’s my final version as a github gist (if I was going to do it over again I’d stick each revision in github for easier navigation). Happy copying!

A prospective new college hire recently related an odd comment from his professor: systems programming is dead. I was nonplussed; what could the professor have meant? Systems is clearly very much alive. Interesting and important projects march under the banner of systems. But as I tried to construct a less emotional rebuttal, I realized I lacked a crisp definition of what systems programming is.

Wikipedia defines systems software in the narrowest terms: the stuff that interacts with hardware. But that covers a tiny fraction of modern systems. So what is systems software? It depends on when you’re asking the question. At one time, the web server was the application; now it’s the systems software on which many web-facing applications are built. At one time a database was the application; now it’s systems software that supports a variety of custom and off-the-shelf applications. Before my time, shells were probably considered a bleeding edge application; now they’re systems software on which some of the lowest-level plumbing of modern operating systems are built.

Any layer on which people build applications of increasing complexity is systems software. Most software that endures the transition to systems software does so whether its authors intended it or not. People in the software industry often talk about standing on the shoulders of giants; the systems software accumulated and refined over decades are those giants.

Stable interfaces define systems software. The programs that consume those interfaces expect the underlying systems software to be perfect every time. Initially innovation might happen in the interfaces themselves — the concurrent model of Node.js is a great example. As software matures, the interfaces become commodified; innovation happens behind those stable interfaces. Systems is only “dead” at its edges. Interfaces might be flexible and well-designed, or sclerotic and poorly designed. Regardless, new or improved systems software can increase performance, enhance observability, or simply fit a different economic niche.

There are a few different types of systems software. First there’s supporting systems software, systems software written as necessary foundation for some new application. This is systems software written with a purpose and designed to solve an unsolved — or poorly solved — problem. Chronologically, examples include UNIX, compilers, and libraries like jQuery. You write it because you need it, but it’s solving a problem that’s likely not unique to your particular application.

Then there’s accidental systems software. Stick everything from Apache to Excel to the Bourne shell in that category. These didn’t necessarily set out to be the foundation on which increasingly complex software would be written, but they definitely are. I’m sure there were times when indoctrination into systems-hood was painful, where the authors wanted to change interfaces, but good systems software respects its consumers and carries them forward. Somewhat famously make preserved its arcane syntax because two consumers already existed. JavaScript started as a glorified web design tool; now it sits several layers beneath complex client-side applications. Even more recently, developers of Node.js (itself  JavaScript-based) changed a commonly used interface that broke many applications. Historical mistakes can be annoying to live with, but — as the Node.js team determined — compatibility trumps cleanliness.

The largest bucket is replacement systems software. Linux, Java, ZFS, and DTrace fall into this category. At the time of their development, each was a notionally compatible replacement for something that already existed. Linux, of course, reimplemented the UNIX design to provide a free, compatible alternative. Java set about building a better runtime (the stable interface being a binary provided to customers to execute) designed to abstract away the operating system entirely. ZFS represented a completely new way of thinking about filesystems, but it did so within the tight constraints of POSIX interfaces and storage hardware. DTrace added new and unique observability to most of the stable interfaces that applications build on.

Finally, there’s intentional systems software. This is systems software by design, but unlike supporting systems software, there’s no consumer. Intentional systems software takes an “if you build it, they will come” approach. This is highly prone to failure — without an existence proof that your software solves a problem and exposes the right interfaces, it’s very difficult to know if you’re building the right thing.

Why define these categories? Knowing which you’re working with can inform your decisions. If you’ve written accidental systems software that has had systems-ness thrust upon it, realize that future versions need to respect the consumers — or willfully cast them aside. When writing replacement systems software, recognize the constraints on the system, and know exactly where you’re constrained and where you can innovate (or consider if you don’t want to use the existing solution). If you’ve written supporting systems software, know that others will inevitably need solutions to the same problems. Either invest in maintaining it and keeping it best of breed; resign to the fact that it will need to be replaced as others invest in a different solution; or open source it and hope (or advocate) that it becomes that ubiquitous solution.

TL;DR?

What’s systems software? It is the increasingly complex, increasingly capable, increasingly diverse foundation on which applications are built. It’s that long and growing tail of the corpus of software at large. The interfaces might be static, but it’s a rich venue for innovation. As more and more critical applications build on an interface, the more value there is in improving the systems software beneath it. Systems software is defined by the constraints; it’s a mission and a mindset with unique challenges, and unique potential.

The idea of the holistic engineer embodies the point of view that an engineer needs to consider the whole system, the whole body of work that makes a product successful. It bears no relation to holistic health — and it’s not some even newer age quackery. There are many specialist roles in the software industry — marketing, product management, project management, documentation, education, support, etc. — but the best software engineers are generalists who can assume a portion of each specialty. Further, some software is particularly well-suited for generalists who can combine a deep understanding of the market, the technology, and the implementation.

Software products are born of many different types of organizations, and even within similar organizations roles might have different names. Here’s a generic example with some names on the roles. New products and features start with product managers. Their role is to talk to customers and sales, educate themselves on the market, and determine the right product or enhancement. The handoff to engineering takes the form of a product requirements document (PRD) — it might sound like jargon, but the term is more or less universal. Software engineers execute against that PRD; QA engineers design tests that assert conformance to the PRD while developers steer the product from point A to point B as described by product management. Documentation writers and learning services take the PRD and the software to generate collateral that teaches customers how to use it. Product marketing makes the PowerPoints; sales presents them to customers.

And that’s where babies come from.

It’s not a perfect process, but it’s birthed many successful products. The shortcoming is that it can bury engineers under filters. Instead of learning about actual customer problems, engineers hear some processed form of what the customer said. Instead of raw critique of a new feature, engineers hear a softened and truncated form. The more technical the product and the market, the more those filters impede innovation and hamper the trajectory of the product.

The holistic engineer augments the jobs of those specialists, participating in each phase of product development. They join in those early conversations with customers, and share the responsibility of market comprehension. They partner to construct the requirements and design that those engineers will then implement. Along the way, engineers of course validate decisions with sales and customers — this is Agile writ large — but engineers also participate in the outbound documentation, training, and marketing activities.

From start to finish, the process is designed to fuel innovation by arming creative engineers with data and understanding. Customers often tell you what they want; they rarely tell you what they need. The more technical or disruptive the product, the more value an engineer has in those conversations, extracting the essence of the problem from the noise of preconceptions. The relationships with customer and the full context around their problems keeps engineers grounded as the inevitable gaps emerge in the product specification. Holistic engineers also help to educate the rest of the company and the rest of the world about new products and features. The process of explaining technology advises the way engineers design and build products. When we’re having a hard time explaining a feature or presenting a product, we need to revise our design. We’ve all heard engineering accused of building a product that was too complicated for the market, or engineers complain that a product failed because it was poorly marketed; both are symptoms of poor coordinating. Giving engineers holistic responsibility guards against this problem — if the product is failing the onus is on them to solve it.

Most important though are the feelings of ownership and agency associated with the whole-body approach. The holistic engineer is explicitly tasked with making a product succeed. That’s not to say that he or she goes it alone — specialists in all functions have major roles — rather the engineer is empowered to move the product through all stages; the other side of that coin is that there’s no opportunity to shrug off a responsibility as belonging to someone else.

In this model, everyone in every role at the company has the opportunity to engage in product management. Indeed, there’s still value in explicit product management. Channels of communication need to be easy and open for people with ideas to connect to people who will distill them into implementation. And it’s not enough to just create the right environment; hiring processes need to identify broad thinkers, and mentorship needs to nurture and reward holistic execution. Not every engineer can — or wants to — take on those additional responsibilities, but the best thrive with market and technology awareness, unencumbered by filters. They want responsibility and authority to make their ideas succeed.

The idea of the holistic engineer isn’t theoretical, it’s the model we stumbled into in the Solaris Kernel Group, and later implemented deliberately at Fishworks. There, a small team took on wide ranging responsibilities to build a product that’s now doing $400m/year for Oracle. At Delphix we’re again inculcating and hiring for holistic thinking. At all three I’ve seen engineers develop new products and features that address customer needs that would have otherwise never emerged from customers’ initial requests. It’s not easy to find the right kind of engineers, but if a company can empower the right engineers in the right ways — and they can live up to the responsibility — the payoff is a better product, built more efficiently.

[latexpage] RAID algorithms have become a particular fascination of mine, and I recently read a very interesting paper that describes an optimization for RAID reconstruction (by Xiang, Xu, Lui, Chang, Pan, and Li). Before writing double- and triple-parity RAID algorithms for ZFS, I spent a fair bit of time researching the subject and have stayed interested since. Before describing the reconstruction optimization, some preamble is required. RAID algorithms can be divided into two buckets: one-dimensional algorithms, and multi-dimensional algorithms (terms of my own choosing; I haven’t seen this distinction discussed in literature).

One-dimensional RAID

A one-dimensional algorithm is one in which all data in a single RAID stripe is used to compute all parities. The RAID algorithm used by ZFS falls into this category as do most algorithms derived from Reed-Solomon coding. For a given RAID stripe’s set of data blocks, $D$, we can compute the nth parity block with some function $p(D, n)$. For example, ZFS, roughly, uses the formula \[ p(D,n) = \sum_{i=1}^{\left|{D}\right|} 2^{(i-1)(n-1)} \cdot D_{i} \] Here, addition and multiplication are defined over a Galois Field – the explanation would be far longer than it would be interesting or relevant so I’ll omit it from this post. It is worth noting that this particular approach only works for three parity disks or fewer, but that too is an entirely different subject (albeit an interesting one). Reconstruction of a missing block in a one-dimensional algorithm requires reading the available data, and performing some computation; each stripe may be reconstructed separately (and thus, in parallel).

Multi-dimensional RAID

A multi-dimensional algorithm is one in which parts of multiple logical RAID stripes may contribute to parity calculation. Examples of this include IBM’s EVENODD and NetApp’s slight simplification, Row-Diagonal Parity (RDP). These are most easily conveyed through diagrams:

EVEN-ODD
EVENODD
RDP
RDP

With both EVENODD and RDP, calculation of the first parity block simply XOR the data blocks in that RAID stripe. The second parity block is calculated by a simple XOR of data values across RAID stripes more or less diagonally. Both of these techniques place constraints on the width of a RAID stripe.

Optimizing RAID reconstruction for fewer reads

The paper, A Hybrid Approach of Failed Disk Recovery Using RAID-6 Codes: Algorithms and Performance Evaluation, describes a optimization for reconstruction under multi-dimensional RAID algorithms. The key insight is that with parity calculations that effectively overlap, a clever reconstruction algorithm can use fewer blocks, thus incurring fewer disk reads. As described in the paper, normally when a given disk fails, all remaining data blocks and blocks from the first parity disk are used for reconstruction:

simple recovery
simple recovery

It is, however, possible to read fewer total blocks by taking advantage of the fact that certain blocks can be multiply used. In the reconstruction below, blocks with circles are used for “row” reconstruction, and blocks with squares are used for “diagonal” reconstruction.

optimized recovery
optimized recovery

Note that the simple approach requires reading 36 blocks (none from disk 7) whereas the reconstruction described in the paper requires reading only 27 blocks. This applies generally: the new approach requires 25% fewer blocks to perform the same reconstruction. And the paper includes a method of balancing the reads between disks.

Disappointing results

Unfortunately, optimizing for fewer reads didn’t translate to significant performance improvements in the overall recovery. For RDP it was about 12% better in the best case, but typically closer to 7%. For EVENODD the improvement was less than 5% in all cases. Why? The naïve recovery algorithm streams data off the healthy hard drives, performs a simple computation, and streams good data onto the replacement drive. Streaming is what drives do best – 3.5” or 2.5”, 7200, 10K, or 15K RPM; SATA, SAS, or FC they all stream pretty well. There may be some contention for I/O resources, but either that contention isn’t severe or the “skips” in the I/O patterns interrupt the normal streaming efficiency.

Applicability in flash

There’s another medium, however, that has throughput and IOPS to spare where this RAID reconstruction approach could be highly effective: flash. With SSDs, it’s possible to see throughput that strains the limits of I/O systems; reading less data could be a significant improvement, and the non-contiguous read patterns wouldn’t degrade performance as they do with HDDs. For all-flash arrays, this sort of optimization may be one of many in its class; with a surplus of IOPS, compute, and memory, the RAID algorithms designed for slow disks, slow CPUs, and a dearth of DRAM, may need to be scrapped and rethought.

Recent Posts

April 17, 2024
January 13, 2024
December 29, 2023
February 12, 2017
December 18, 2016

Archives

Archives