From Prometheus to Sisyphus

When Apple announced their new file system, APFS, in June, I hustled to be in the front row of the WWDC presentation, questions with the presenters, and then the open Q&A session. I took a week to write up my notes which turned into as 12 page behemoth of a blog post — longer than my college thesis. Despite reassurances from the tweeps, I was sure that the blog post was an order of magnitude longer than the modern attention span. I was wrong; so wrong that Ars Technica wanted to republish the blog post. Never underestimate the interest in all things Apple.

In that piece I left one big thread dangling. Apple shipped APFS as a technology preview, but they left out access to one of the biggest new features: snapshots. Digging around I noticed that there was a curiously named new system call, “fs_snapshot”, but explicitly didn’t investigate: the post was already too long (I thought), I had spent enough time on it, and someone else (surely! surely?) would want to pull on that thread.

Slow News Day

Every so often I’d poke around for APFS news, but there was very little new. Last month folks discovered that APFS was coming to iOS sooner rather than later. But there wasn’t anything new to play with or any revelations on how APFS would work.

I would search for “APFS snapshots”, “fs_snapshot”, anything I could think of to see if anyone had figured out how to make snapshots work on APFS. Nothing.

A few weekends ago, I decided to yank on that thread myself.

Prometheus

I started from the system call, wandered through Apple’s open source kernel, leaned heavily on DTrace, and eventually figured it out. Apple had shipped snapshots in APFS, they just hadn’t made it easy to get there. The folks at Ars were excited for a follow-up, and my investigation turned into this: “Testing out snapshots in Apple’s next-generation APFS file system.”

Snapshots were there; the APIs were laid bare; I was going to bring fire to all the Mac fans; John Siracusa and Andy Ihnatko would carry me on their shoulders down the streets of the Internet.

Sisyphus

On the eve that this new piece was about to run I was nervously scrolling through Twitter as I took the bus home from work. Now that I had invested the time to research and write-up APFS snapshots I didn’t want someone else beating me to the punch.

Then I found this and my heart dropped:

 

Skim past the craziness of MFS and the hairball of HFS, and start digging through the APFS section. Slide 49, “APFS Snapshots” and there it is “apfs_snapshot” — not a tool that anyone laboriously reverse engineered, deciphering system calls and semi-published APIs — a tool shipped from Apple and included in macOS by default. F — .

Apple had secreted this utility away (along with some others) in /System/Library/Filesystems/apfs.fs/Contents/Resources/

What To Do?

The article that was initially about a glorious act of discovery had become an article about the reinvention of the wheel. Conversely, vanishingly few people would recognize this as rediscovery since the apfs_snapshot tool was so obscure (9 hits on Google!).

We toned down the already modest chest-thumping and published the article this morning to a pretty nice response so far. I might have happier as an FAKE NEWS Prometheus, blissfully unaware of the pre-existence of fire, but I would have been mortified when the inevitable commenter, one of the few who had used apfs_snapshot, crushed me with my own boulder.

Posted on February 12, 2017 at 5:23 pm by ahl · Permalink · One Comment
In: DTrace

DTrace at Home

I had been procrastinating making the family holiday card. It was a combination of having a lot on my plate and dreading the formulation of our annual note recapping the year; there were some great moments, but I’m glad I don’t have to do 2016 again. It was just before midnight and either I’d make the card that night or leave an empty space on our friends’ refrigerators. Adobe Illustrator had other ideas:

Unable to set maximum number of files to be opened.

I’m not the first person to hit this. The problem seems to have existed since CS6 was released in 2016. None of the solutions was working for me, and — inspired by Sara Mauskopf’s excellent post — I was rapidly running out of the time bounds for the project. Enough; I’d just DTrace it.

A colleague scoffed the other day, “I mean, how often do you actually use DTrace?” In his mind DTrace was for big systems, critical system, when dollars and lives were at stake. My reply: I use DTrace every day. I can’t imagine developing software without DTrace, and I use it when my laptop (not infrequently) does something inexplicable (I’m forever grateful to the Apple team that ported it to Mac OS X).

First I wanted to make sure I had the name of the Illustrator process right:

$ sudo dtrace -n ‘syscall:::entry{ @[execname] = count(); }’
dtrace: description ‘syscall:::entry’ matched 500 probes
^C
pboard 1
watchdogd 2
awdd 3
...
com.apple.WebKit 7065
Google Chrome He 7128
Google Chrome 8099
Adobe Illustrato 36674

Glad I checked: “Adobe Illustrato”. Now we can be pretty sure that Illustrator is failing on setrlimit(2) and blowing up as result. Let’s confirm that it is in fact returning -1:

$ sudo dtrace -n 'syscall::setrlimit:return/execname == "Adobe Illustrato"/{ printf("%d %d", arg1, errno); }'
dtrace: description 'syscall::setrlimit:return' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0    532                 setrlimit:return -1 1

There it is. And setrlimit(2) is failing with errno 1 which is EPERM (value too high for non-root user). I already tuned up the files limit pretty high. Let’s confirm that it is in fact setting the files limit and check the value to which it’s being set. To write this script I looked at the documentation for setrlimit(2) (hooray for man pages!) to determine that the position of the resource parameter (arg0) and the type of the value parameter (struct rlimit). I needed the DTrace copyin() subroutine to grab the structure from the process’s address space:

$ sudo dtrace -n 'syscall::setrlimit:entry/execname == "Adobe Illustrato"/{ this->r = *(struct rlimit *)copyin(arg1, sizeof (struct rlimit)); printf("%x %x %x", arg0, this->r.rlim_cur, this->r.rlim_max);  }'
dtrace: description 'syscall::setrlimit:entry' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0    531                 setrlimit:entry 1008 2800 7fffffffffffffff

Looking through /usr/include/sys/resource.h we can see that 1008 corresponds to the number of files (RLIMIT_NOFILE | _RLIMIT_POSIX_FLAG). Illustrator is trying to set that value to 0x7fffffffffffffff or 2⁶³-1. Apparently too big; I filed any latent curiosity for another day.

The quickest solution was to use DTrace again to whack a smaller number into that struct rlimit. Easy:

$ sudo dtrace -w -n 'syscall::setrlimit:entry/execname == "Adobe Illustrato"/{ this->i = (rlim_t *)alloca(sizeof (rlim_t)); *this->i = 10000; copyout(this->i, arg1 + sizeof (rlim_t), sizeof (rlim_t)); }'
dtrace: description 'syscall::setrlimit:entry' matched 1 probe
dtrace: could not enable tracing: Permission denied

Oh right. Thank you SIP. This isa new laptop (at least a new motherboard due to some bizarre issue) which probably contributed to Illustrator not working when once it did. Because it’s new I haven’t yet disabled the part of SIP that prevents you from using DTrace on the kernel or in destructive mode (e.g. copyout()). It’s easy enough to disable, but I’m reboot-phobic — I hate having to restart my terminals — so I went to plan B: lldb.

First I used DTrace to find the code that was calling setrlimit(2): using some knowledge of the x86 ISA/ABI:

$ sudo dtrace -n 'syscall::setrlimit:return/execname == "Adobe Illustrato" && arg1 == -1/{ printf("%x", *(uintptr_t *)copyin(uregs[R_RSP], sizeof (uintptr_t)) - 5) }'
dtrace: description 'syscall::setrlimit:return' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0    532                 setrlimit:return 1006e5b72
  0    532                 setrlimit:return 1006e5b72

I ran it a few times to confirm the address of the call instruction and to make sure the location wasn’t being randomized. If I wasn’t in a rush I might have patched the binary, but Apple’s Mach-O Object format always confuses me. Instead I used lldb to replace the call with a store of 0 to %eax (to evince a successful return value) and some nops as padding (hex values I remember due to personal deficiencies):

(lldb) break set -n _init
Breakpoint 1: 47 locations.
(lldb) run
...
(lldb) di -s 0x1006e5b72 -c 1
0x1006e5b72: callq  0x1011628e0     ; symbol stub for: setrlimit
(lldb) memory write 0x1006e5b72 0x31 0xc0 0x90 0x90 0x90
(lldb) di -s 0x1006e5b72 -c 4
0x1006e5b72: xorl   %eax, %eax
0x1006e5b74: nop
0x1006e5b75: nop
0x1006e5b76: nop

Next I just process detach and got on with making that holiday card…

DTrace Every Day

DTrace was designed for solving hard problems on critical systems, but the need to understand how systems behave exists in development and on consumer systems. Just because you didn’t write a program doesn’t mean you can’t fix it.

Posted on December 18, 2016 at 4:32 am by ahl · Permalink · 2 Comments
In: DTrace

A Filesystem on Noms

Since Noms dropped last week the dev community has seemed into it. “Git for data” — it simultaneously evokes something very familiar and yet unconstrained. Something that hasn’t been well-noted is how much care the team has taken to make Noms fun to build with, and it is.

Noms is a content-addressable, decentralized, append-only database. It borrows concepts from a variety of interesting data systems. Obviously databases are represented: Noms is a persistent, transactional data repository. You can also see the fundamentals of git and other decentralized source code management tools. Noms builds up a chain of commits; those chains can be extended, forked, and shared, while historical data are preserved. Noms shares much in common with modern filesystems such as ZFS, btrfs, or Apple’s forthcoming APFS. Like those filesystems, Noms uses copy-on-write, never modifying data in situ; it forms a self-validating hierarchy of data; and it (intrinsically) supports snapshots and efficient movement of data between snapshots.

After learning a bit about Noms I thought it would be interesting to use it as the foundation for a filesystem. I wanted to learn about Noms, and contribute a different sort of example that might push the project in new and valuable ways. The Noms founders, Aaron and Raf, were fired up so I dove in…

What’s Modern?

When people talk about a filesystem being “modern” there’s a list of features that they often have in mind. Let’s look at how the Noms database stacks up:

Snapshots

A filesystem snapshot preserves the state of the filesystem for some future use — typically data recovery or fast cloning. Since Noms is append-only, every version is preserved. Snapshots are, therefore, a natural side effect. You can make a Noms “snapshot” — any commit in a dataset’s history — writeable by syncing it to a new dataset. Easy.

Dedup

The essence of dedup is that unique data should be stored exactly once. If you duplicate a file, a folder, or an entire filesystem the storage consumption should be close to zero. Noms is content addressable, unique data is only ever stored once. Every Noms dataset intrinsically removes duplicated data.

Consistency

A feature of a filesystem — arguably the feature of a filesystem — is that it shouldn’t ever lose or corrupt your data. One common technique to ensure consistency is to write new data to a new location rather than overwriting old data — so called copy-on-write (COW). Noms is append-only, it doesn’t throw out (or overwrite) old data; copying modified is required and explicit. Noms also recursively checksums all data — a feature of ZFS and btrfs, notably absent from APFS.

Backup

The ability to backup your data from a filesystem is almost as important as keeping it intact in the first place. ZFS, for example, lets you efficiently serialize and send the latest changes between systems. When pulling or pushing changes git also efficiently serializes just the changed data. Noms does something similar with its structured data. Data differences are efficiently computed to optimize for minimal data transfer.

Noms has all the core components of a modern filesystem. My goal was to write the translation layer to expose filesystem semantics on top of the Noms interfaces.

Designing a Schema

Initially, Badly

It’s in the name: Noms eats all the data. Feed it whatever data you like, and let Noms infer a schema as you go. For a filesystem though I wanted to define a fixed structure. I started with a schema modeled on a simplified ZFS. Filesystems keep track of files and directories with a structure called an “inode” each of which has a unique integer identifier, the “inode number”. ZFS keeps track of files and directories with DMU objects named by their integer ID. The schema would use a Map<number, Inode> to serve the same function (spoiler: read on and don’t copy this!):

struct Filesystem {
     inodes: Map<Number, struct Inode {
          attr: struct Attr { /* e.g. permissions, modification time, etc. */ }
          contents: Union {
               struct File { data: Ref /* Noms pile of bytes */ } |
               struct Directory { contents: Map<String, Number> }
          }
     }>
     rootInode: Number
     maxInode: Number
}

Nice and simple. Files are just Noms Blobs represented by a sequence of bytes. Directories are a Map of strings (the name of the directory entry) to the inode number; the inode number can be used to find the actual content by looking in the top-level map.

Schema philosophy

This made sense for a filesystem. Did it make sense for Noms? I wasn’t trying to put the APFS team out of work, rather I was creating a portal from the shell or Finder into Noms. To evaluate the schema, I had the benefit of direct access to the Noms team (and so can all developers at http://slack.noms.io/). I learned two guiding principles for data in Noms:

Noms data should be self-validating. As much as possible the application should rely on noms to ensure consistency. My schema failed this test because the relationship between inode numbers contained in directories and the entires of the inodes map was something my code alone could maintain and validate.

Noms data should be deterministic. A given collection of data should have a single representation; the Noms structures should be path and history independent. Two apparently identical datasets should be identical in the eyes of Noms to support efficient storage and transfer, and identical data should produce an identical hash at the root of the dataset. Here, again, my schema fell short because the inode number assigned to a given file or directory depended on how other objects were created. Two different users with two identical filesystems wouldn’t be able to simply sync the way they would with two identical git branches.

A Better Schema

My first try made for a fine filesystem, just not a Noms filesystem. With a better understanding of the principles, and with help from the Noms team, I built this schema:

struct Filesystem {
     root: struct Inode {
          attr: struct Attr { /* e.g. permissions, modification time, etc. */ }
          contents: Union {
               struct File { data: Ref<Blob> /* Noms pile of bytes */ } |
               struct Directory: { contents: Map<string, Cycle<1>> }
          }
     }
}

Obviously simpler; the thing that bears explanation is the use of so-called “Cycle” types. A Cycle is a means of expressing a recursive relationship within Noms types. The integer parameter specifies the ancestor struct to which the cycle refers. Consider a linked list type:

struct LinkedList {
    data: Blob
    next: Cycle<0>
}

The “next” field refers to immediately containing struct, LinkedList. In our filesystem schema, a Directory’s contents are represented by a map of strings (directory entry names) to Inodes, Cycle<1> referring to the struct “above” or “containing” the Directory type. (Read on for a visualization of this.)

Writing It

To build the filesystem I picked a FUSE binding for Go, dug into the Noms APIs, and wrestled my way through some Go heartache.

Working with Noms requires a slightly different mindset than other data stores. Recall in particular that Noms data is immutable. Adding a new entry into a Map creates a new Map. Setting a member of a Struct creates a new Struct. Changing nested structures such as the one defined by our schema requires unzipping it, and then zipping it back together. Here’s a Go snippet that demonstrates that methodology for creating a new directory:

Demo

Showing it off has all the normal glory of a systems demo! Check out the documentation for requirements.

Create and mount a filesystem from a new local Noms dataset:

$ go build
$ mkdir /var/tmp/mnt
$ go run nomsfs.go /var/tmp/nomsfs::fs /var/tmp/mnt
running...

You can open the folder and drop data into it.

Your database fell into my filesystem!

Now let’s take a look at the underlying Noms dataset:

$ noms show http://demo.noms.io/ahl_blog::fs
struct Commit {
  meta: struct {},
  parents: Set<Ref<Cycle<0>>>,
  value: struct Filesystem {
    root: struct Inode {
      attr: struct Attr {
        ctime: Number,
        gid: Number,
        mode: Number,
        mtime: Number,
        uid: Number,
        xattr: Map<String, Blob>,
      },
      contents: struct Directory {
        entries: Map<String, Cycle<1>>,
      } | struct Symlink {
        targetPath: String,
      } | struct File {
        data: Ref<Blob>,
      },
    },
  },
}({
  meta:  {},
  parents: {
    5v82rie0be68915n1q7pmcdi54i9tmgs,
  },
  value: Filesystem {
    root: Inode {
      attr: Attr {
        ctime: 1.4705227450393803e+09,
        gid: 502,
        mode: 511,
        mtime: 1.4705227450393803e+09,
        uid: 110853,
        xattr: {},
      },
      contents: Directory {
        entries: {
          "usenix_winter91_faulkner.pdf": Inode {
            attr: Attr {
              ctime: 1.4705228859273868e+09,
              gid: 502,
              mode: 420,
              mtime: 1.468425783e+09,
              uid: 110853,
              xattr: {
                "com.apple.FinderInfo": 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  // 32 B
                00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00,
                "com.apple.quarantine": 30 30 30 31 3b 35 37 38 36 36 36 33 37 3b 53 61  // 21 B
                66 61 72 69 3b,
              },
            },
            contents: File {
              data: dmc45152ie46mn3ls92vvhnm41ianehn,
            },
          },
        },
      },
    },
  },
})

You can see the type at the top and then the actual filesystem contents. Let’s look at more complicated example where I’ve taken part of the Noms source tree and copied it to nomsfs. You can navigate around its structure courtesy of the Splore utility (take particular note of nested directories that show the recursive data definition described above):

Embedded ‘Splore! http://splore.noms.io/?db=https://demo.noms.io/ahl_blog&hash=2nhi5utm4s38hka22vt9ilv5i3l8r2ol

You can see the all of the various states that the filesystem has been through — each state change — using noms log http://demo.noms.io/ahl_blog::fsnoms. You can sync it to your local computer with noms sync http://demo.noms.io/ahl_blog::fsnoms /var/tmp/fs or checkout some previous state from the log (just like a filesystem snapshot). Diff two states from the log or make your own changes and diff it with the original using noms diff.

Nom Nom Nom

It took less than 1000 lines of Go code to make Noms appear as a Window in the Finder, eating data as quickly as I could drag and drop (try it!). Imagine what Noms might look like behind other known data interfaces; it could bring git semantics to existing islands of data. Noms could form the basis of a new type of data lake — maybe one that’s simple and powerful enough to bring real results. Beyond the marquee features, Noms is fun to build with. I’m already working on my next Noms application.

Posted on August 9, 2016 at 3:54 pm by ahl · Permalink · 5 Comments
In: Software · Tagged with: , , , , ,

I Love Go; I Hate Go

I liked Go right away. It was close enough to C and Java to be instantly familiar, the examples and tutorials were straightforward, and I was quickly writing real code. I’ve wanted to learn Go since its popularity was surging few years ago. In no danger of being judged an early adopter, I happily found a great project that—as it happened—had to be in Go (more in a future post).

I Love Go

My first priority was not looking stupid. The folks I’d be doing this for are actual Go developers; I wanted my code to fit in without imposing with too many questions. They had no style guide. Knowing that my 80-column width sensibilities expose an unattractive nostalgia, I went looking for a max line length. There wasn’t one. Then I discovered gofmt, the simple tool that Go employs to liberate developers from the tyranny of stylistic choice. It takes Go code and spits it back out in the One True Style. I love it. I was raised in an engineering culture with an exacting style guide, but any style guide has gaps. Factions form; style-originalists face off against those who view (incorrectly!) the guide as a living document. I updated my decades-old .vimrc to gofmt on save. These Go tyrants were feeling like my kind of tyrant.

One of the things that turned me off of C++ (98, 11, and 14) is its increasing amount of magic. Go is comparatively straightforward. When I reach for something I find that it’s where I expect it to be. Libraries and examples aren’t mysterious. Error messages are non-mysterious (other than quickly resolved confusion about methods with “pointer receivers”). Contrast this with Rust whose errors read like mind-bindingly inscrutable tax forms.

Go’s containment-based inheritance is easy enough to reason about. Its interfaces are similarly no-nonsense and pragmatic. You don’t have to define a structure as implementing an interface. You can use an interface to describe anything that implements it. Simple. This is particularly useful for testing. In a messy environment with components beyond your control you can apply interfaces to other people’s code, and then mock it out.

The toolchain is, again, simple to use—the benefit of starting from scratch—and makes for fast compilation, quick testing, and easy integration with a good-sized ecosystem. I stopped worrying about dependencies, rebuilding, etc. knowing that go run would find errors wherever I had introduced them and do so remarkably quickly.

I Hate Go

Go is opinionated. Most successful products have that strong sense of what they are and what they aren’t; Go draws as sharp a line as any language. I was seduced by its rightheadedness around style, but with anything or anyone that opinionated you’ll find some of those opinions weird and others simply off-putting.

Reading the official documentation, I found myself in the middle of a section prefixed with the phrase “if GOOS is set to plan9”. Wow. I’m a few standard deviations from the norm in terms of being an OS nerd, but I’ve never even seen Plan 9. I knew that the Plan 9 folks got the band back together to create Go; it’s great that their pop audiences don’t dissuade them from playing off their old B-sides. Quirky, but there’s nothing wrong with that.

I wanted to check an invariant; how does Go do assertions? Fuck you, you’re a bad programmer. That’s how. The Go authors feel so strongly about asserts being typically misused that they refuse to provide them. So folks use one (or more) of some workable libraries.

I created an infinite recursion, overflowing the stack. Go produces the first 100 stack frames and that’s it. Maybe you can change that, but I couldn’t figure out how. (“go stackoverflow” is about the most useless thing you can search for; chapeau, Go and Stackoverflow respectively.) I could be convinced that I only want 100 stack frames, but not the last 100, not the same 4 over and over again. I ended up limiting the stack size with runtime.debug.SetMaxStack(), Goldilocks-ing it between too big to catch the relevant frames and too small to allow for normal operation.

I tried using other tools (ahem, DTrace) to print the stack, but, of course, the Go compiler omits frame pointers rendering the stacks unobservable to debuggers. Ditto arguments due to ABI-non-compliant calling conventions, but that’s an aside. The environment variable GOEXPERIMENT=framepointer is supposed to compile with frame pointers, but it was a challenge to rebuild the world. All paths seemed to lead me to my former-colleague’s scathing synopsis: Golang is Trash.

As fun as it is to write code in Go, debugging in Go is no fun at all. I may well just be ignorant of the right tooling. But there sure isn’t a debugger with the simple charm of go build for compilation, go test for testing, or go run for execution.

Immaturity and Promise

Have you ever been in a relationship where minor disagreements immediately escalated to “maybe we should break up?” The Go documentation even seems ready to push you to some other language at the slightest affront. Could I have asserts? Sure, if you’re a bad programmer. Perhaps ABI compliance has its merits? I’m sure you could find that in some other language. Could you give me the absolute value of this int? Is something wrong with your ‘less than’ key?

I’m sure time will, as it tends to, bring pragmatism. I appreciate that Go does have strong opinions (it’s just hard to remember that when I disagree with them). Weak opinions are what turn languages into unreadable mishmashes of overlapping mechanism. My favorite example of this is Perl. My first real programming job was in Perl. I was the most avid reader teenage of the Perl llama and camel books. Ask me about my chat with Larry Wall, Perl’s creator, if you see a beer in my hand. In an interview, Larry said, “In Perl 6, we decided it would be better to fix the language than fix the user”. Contrast this with Go’s statement on assertions:

Go doesn’t provide assertions. They are undeniably convenient, but our experience has been that programmers use them as a crutch to avoid thinking about proper error handling and reporting.

Perl wants to be whatever the user wants to be. The consumate pleaser. Go sees the user as flawed and the strictures of the language as the cure. It’s an authoritarian, steadfast in its ideals, yet too sensitive to find compromise (sound like anyone we all know?).

Despite my frustrations I really enjoy writing code in Go. It’s clean and supported by a great community and ecosystem. I’m particularly heartened that Go 1.7 will compile with frame pointers by default. Diagnosing certain types of bugs is still a pain in the neck, but I’m sure it will improve and I’ll see where I can pitch in.

Posted on August 2, 2016 at 3:19 am by ahl · Permalink · 23 Comments
In: Software

APFS in Detail: Conclusions

This series of posts covers APFS, Apple’s new filesystem announced at WWDC 2016. See the first post for the table of contents.

Summing Up

I’m not sure Apple absolutely had to replace HFS+, but likely they had passed an inflection point where continuing to maintain and evolve the 30+ year old software was more expensive than building something new. APFS is a product born of that assessment.

Based on what Apple has shown I’d surmise that its core design goals were:

Those are great goals that will benefit all Apple users, and based on the WWDC demos APFS seems to be on track (though the macOS Sierra beta isn’t quite as far along).

In the process of implementing a new file system the APFS team has added some expected features. HFS was built when 400KB floppies ruled the Earth (recognized now only as the ubiquitous and anachronistic save icon). Any file system started in 2014 should of course consider huge devices, and SSDs–check and check. Copy-on-write (COW) snapshots are the norm; making the Duplicate command in the Finder faster wasn’t much of a detour. The use case is unclear, it’s a classic garbage can theory solution, a solution in search of a problem, but it doesn’t hurt and it makes for a fun demo. The beach ball of doom earned its nickname; APFS was naturally built to avoid it.

There are some seemingly absent or ancillary design goals: performance, openness, and data integrity. Squeezing the most IOPS or throughput out of a device probably isn’t critical on watchOS, and it’s relevant only to a small percentage of macOS users. It will be interesting to see how APFS performs once it ships (measuring any earlier would only misinform the public and insult the APFS team). The APFS development docs have a bullet on open source: “An open source implementation is not available at this time.” I don’t expect APFS to be open source at this time or any other, but prove me wrong, Apple. If APFS becomes world-class I’d love to see it in Linux and FreeBSD–maybe Microsoft would even jettison their ReFS experiment. My experience with OpenZFS has shown that open source accelerates that path to excellence. It’s a shame that APFS lacks checksums for user data and doesn’t provide for data redundancy. Data integrity should be job one for a file system, and I believe that that’s true for a watch or phone as much as it is for a server.

At stability, APFS will be an improvement, for Apple users of all kinds, on every device. There are some clear wins and some missed opportunities. Now that APFS has been shared with the world the development team is probably listening. While Apple is clearly years past the decision to build from scratch rather than adopting existing modern technology, there’s time to raise the priority of data integrity and openness. I’m impressed by Apple’s goal of using APFS by default within 18 months. Regardless of how it goes, it will be an exciting transition.

Posted on June 19, 2016 at 7:37 pm by ahl · Permalink · 37 Comments
In: Software · Tagged with: 

APFS in Detail: Data Integrity

This series of posts covers APFS, Apple’s new filesystem announced at WWDC 2016. See the first post for the table of contents.

Data Integrity

Arguably the most important job of a file system is preserving data integrity. Here’s my data, don’t lose it, don’t change it. If file systems could be trusted absolutely then the “only” reason for backup would be the idiot operators (i.e. you and me). There are a few mechanisms that file systems employ to keep data safe.

Redundancy

APFS makes no claims with regard to data redundancy. As Apple’s Eric Tamura noted at WWDC, most Apple devices have a single storage device (i.e. one logical SSD) making RAID, for example, moot. Instead redundancy comes from lower layers such as Apple RAID (apparently a thing), hardware RAID controllers, SANs, or even the “single” storage devices themselves..

As an aside note that SSDs in most Apple products where APFS will run include multiple more-or-less independent NAND chips. High-end SSDs do implement data redundancy within the device, but it comes at the price of reduced capacity and performance. As noted above, the “flash-optimization” of APFS doesn’t actually extend much below the surface of the standard block device interface, but the raw materials for innovation are there.

Also, APFS removes the most common way of a user achieving local data redundancy: copying files. A copied file in APFS actually creates a lightweight clone with no duplicated data. Corruption of the underlying device would mean that both “copies” were damaged whereas with full copies localized data corruption would affect just one.

Crash Consistency

Computer systems can fail at any time—crashes, bugs, power outages, etc.—so file systems need to anticipate and recover from these scenarios. The old old old school method is to plod along and then have a special utility to check and repair the file system during boot (fsck, short for file system check). More modern systems labor to achieve an always consistent format, or only narrow windows of inconsistency, obviating the need for the full, expensive fsck. ZFS, for example, builds up new state on disk and then atomically transitions from the previous state to the new one with a single atomic operation.

Overwriting data creates the most obvious opening for inconsistency. If the file system needs to overwrite several regions there is a window where some regions represent the new state and some represent the former state. Copy-on-write (COW) is a method to avoid this by always allocating new regions and then releasing old ones for reuse rather than modifying data in-place. APFS claims to implement a “novel copy-on-write metadata scheme”; APFS lead developer Dominic Giampaolo emphasized the novelty of this approach without delving into the details. In conversation later, he made it clear that APFS does not employ the ZFS mechanism of copying all metadata above changed user data which allows for a single, atomic update of the file system structure.

It’s surprising to see that APFS includes fsck_apfs—even after asking Dominic I’m not sure why it would be necessary. For comparison I don’t believe there’s been an instance where fsck for ZFS would have found a problem that the file system itself didn’t already know how to detect. But Dominic was just as confused about why ZFS would forego fsck, so perhaps it’s just a matter of opinion.

Checksums

Notably absent from the APFS intro talk was any mention of checksums. A checksum is a digest or summary of data used to detect (and correct) data errors. The story here is surprisingly nuanced. APFS checksums its own metadata but not user data. The justification for checksumming metadata is strong: there’s relatively not much of it (so the checksums don’t consume much storage) and losing metadata can cast a potentially huge shadow of data loss. If, for example, metadata for a top level directory is corrupted then potentially all data on the disk could be rendered inaccessible. ZFS duplicates metadata (and triple duplicates top-level metadata) for exactly this reason.

Explicitly not checksumming user data is a little more interesting. The APFS engineers I talked to cited strong ECC protection within Apple storage devices. Both flash SSDs and magnetic media HDDs use redundant data to detect and correct errors. The engineers contend that Apple devices basically don’t return bogus data. NAND uses extra data, e.g. 128 bytes per 4KB page, so that errors can be corrected and detected. (For reference, ZFS uses a fixed size 32 byte checksum for blocks ranging from 512 bytes to megabytes. That’s small by comparison, but bear in mind that the SSD’s ECC is required for the expected analog variances within the media.) The devices have a bit error rate that’s tiny enough to expect no errors over the device’s lifetime. In addition, there are other sources of device errors where a file system’s redundant check could be invaluable. SSDs have a multitude of components, and in volume consumer products they rarely contain end-to-end ECC protection leaving the possibility of data being corrupted in transit. Further, their complex firmware can (does) contain bugs that can result in data loss.

The Apple folks were quite interested in my experience with regard to bit rot (aging data silently losing integrity) and other device errors. I’ve seen many instances where devices raised no error but ZFS (correctly) detected corrupted data. Apple has some of the most stringent device qualification tests for its vendors; I trust that they really do procure the best components. Apple engineers I spoke with claimed that bit rot was not a problem for users of their devices, but if your software can’t detect errors then you have no idea how your devices really perform in the field. ZFS has found data corruption on multi-million dollar storage arrays; I would be surprised if it didn’t find errors coming from TLC (i.e. the cheapest) NAND chips in some of Apple’s devices. Recall the (fairly) recent brouhaha regarding storage problems in the high capacity iPhone 6. At least some of Apple’s devices have been imperfect.

As someone who has data he cares about on a Mac, who has seen data lost from HFS, and who knows that even expensive, enterprise-grade equipment can lose data, I would gladly sacrifice 16 bytes per 4KB–less than 1% of my device’s size.

Scrub

As data ages you might occasionally want to check for bit rot. Likely fsck_apfs can accomplish this; as noted though there’s no data redundancy and no checksums for user data, so scrub would only help to find problems and likely wouldn’t help to correct them. And if it makes it any easier for Apple to reverse course, let’s say it’s for the el cheap-o drive I bought from Fry’s not for the gold-plated device I got from Apple.

 

Next in this series: Conclusions

Posted on June 19, 2016 at 7:37 pm by ahl · Permalink · 41 Comments
In: Software · Tagged with: , ,

APFS in Detail: Performance

This series of posts covers APFS, Apple’s new filesystem announced at WWDC 2016. See the first post for the table of contents.


Performance

APFS claims to be optimized for flash. Flash memory (NAND) is the stuff in your speedy SSD. Apple changed the computing industry when it put flash into the iPod and iPhone, volumes for which fundamentally changed the economics of flash. This consumer change impacted the enterprise (as it often does), giving rise to hybrid and all-flash arrays. Ten years ago flash cost as much as DRAM; now it’s challenging the economics of hard disks.

SSDs mimic the block interface of conventional hard drives, but the underlying technology is completely different. In particular while magnetic media can read or write sectors arbitrarily, flash erases large chunks (blocks) and reads and writes smaller chunks (pages). The management is done by what’s called the flash translation layer (FTL), software that makes blocks and pages appear more like a hard drive. An FTL is very similar to a file system, creating a virtual mapping (a translation) between block addresses and locations within the media. Apple controls the full stack including the SSD, FTL, and file system; they could have built something differentiated, optimizing this components to work together. What APFS does, however, is simply write in patterns known to be more easily handled by NAND. It’s a file system with flash-aware characteristics rather than one written explicitly for the native flash interfaces, more or less what you’d expect in 2016.

Also on the topic of flash, APFS includes TRIM support. TRIM is a command in the ATA protocol that allows a file system to indicate to an SSD (specifically, its FTL) that some space has been freed. SSDs require significant free space and perform better when there’s more of it; they include more physical space than they advertise. For example, my 1TB SSD includes 1TB (240 = 10244) bytes of flash but only reports 931GB of available space, sneakily matching the storage industry’s self-serving definition of 1TB (10004 = 1 trillion bytes). With more free space, FTLs can trade off space efficiency for performance and longevity. TRIM has become expected of file systems; it’s unsurprising that APFS supports it. The problem with TRIM though is that it’s only useful when there’s free space: it’s something of a benchmark special. Once your disk is mostly full (as mine are in my laptop and phone basically at all times) TRIM doesn’t do anything for you. I doubt that TRIM will bring any discernible benefit for APFS users beyond the placebo effect of feature parity.

APFS also focuses on latency; Apple’s number one goal is to avoid the beachball of doom. APFS addresses this with I/O QoS (quality of service) to prioritize accesses that are immediately visible to the user over background activity that doesn’t have the same time-constraints. This is inarguably a benefit to users and a sophisticated file system capability.

 

Next in this series: Data Integrity

Posted on June 19, 2016 at 7:36 pm by ahl · Permalink · 11 Comments
In: Software · Tagged with: , , , , , ,

APFS in Detail: Space Efficiency and Clones

This series of posts covers APFS, Apple’s new filesystem announced at WWDC 2016. See the first post for the table of contents.

Space Efficiency

A modern trend in file systems has been to store data more efficiently to effectively increase the size of your device. Common approaches include compression (which, as noted above, is very very likely coming) and deduplication. Dedup finds common blocks and avoids storing them multiply. This is potentially highly beneficial for file servers where many users or many virtual machines might have copies of the same file; it’s probably not useful for the single-user or few-user environments that Apple cares about. (Yes, they have server-ish offerings but their heart clearly isn’t into it.) It’s also furiously hard to do well as I learned painfully while supporting ZFS.

Apple’s sort-of-unique contribution to space efficiency is constant time cloning of files and directories. As a quick aside, “files” in macOS are often really directories; it’s a convenient lie they tell to allow logically related collections of files to be treated as an indivisible unit. Right click an application and select “Show Package Contents” to see what I mean. Accordingly, I’m going to use the term “file” rather than “file or directory” in sympathy for the patient readers who have made it this far.

With APFS, if you copy a file within the same file system (or possibly the same container; more on this later), no data is actually duplicated. Instead a constant amount of metadata is updated and the on-disk data is shared. Changes to either copy cause new space to be allocated (so-called “copy on write” or COW).

I haven’t seen this offered in other file systems, and it clearly makes for a good demo, but it got me wondering about the use case (UPDATE: btrfs supports this and calls the feature “reflinks”–link by reference). Copying files between devices (e.g. to a USB stick for sharing) still takes time proportional to the amount of data copied of course. Why would I want to copy a file locally? The common case I could think of is the layman’s version control: “thesis”, “thesis-backup”, “thesis-old”, “thesis-saving because I’m making edits while drunk”.

There are basically three categories of files:

For the average user, most files fall into that first category. So with APFS I can make a copy of my document and get the benefits of space sharing, but those benefits will be eradicated as soon as I save the new revision. Perhaps users of larger files have a greater need for this and have a better idea of how it might be used.

Personally, my only use case is taking a file, say time-shifted Game of Thrones episodes falling into the “fair use” section of copyright law, and sticking it in Dropbox. Currently I need to choose to make a copy or permanently move the file to my Dropbox folder. Clones would let me do this more easily. But then so would hard links (a nearly ubiquitous file system feature that lets a file appear in multiple directories).

Clones open the door for potential confusion. While copying a file may take up no space, so too deleting a file may free no space. Imagine trying to free space on your system, and needing to hunt down the last clone of a large file to actually get your space back.

APFS engineers don’t seem to have many use cases in mind; at WWDC they asked for suggestions from the assembled developers (the best I’ve heard is for copied VMs; not exactly a mass-market problem). If the focus is generic revision control, I’m surprised that Apple didn’t shoot for a more elegant solution. One could imagine functionality with APFS that allows a user to enable per-file Time Machine, change tracking for any file. This would create a new type of file where each version is recorded transparently and automatically. You could navigate to previous versions, prune the history, or delete the whole pile of versions at once (with no stray clones to hunt down). In fact, Apple introduced something related 5 years ago, but I’ve literally never seen or heard of it until researching this post (show of hands if you’ve clicked “Browse All Versions…”). APFS could clean up its implementation, simplify its use, and bring generic support for all applications. None of this solves my Game of Thrones storage problem, but I’m not even sure it’s much of a problem…

Side note: Finder copy creates space-efficient clones, but cp from the command line does not.

 

Next in this series: Performance

Posted on June 19, 2016 at 7:35 pm by ahl · Permalink · 24 Comments
In: Software · Tagged with: ,

APFS in Detail: Encryption, Snapshots, and Backup

This series of posts covers APFS, Apple’s new filesystem announced at WWDC 2016. See the first post for the table of contents.

Encryption

Encryption is clearly a core feature of APFS. This comes from diverse requirements from the various devices, for example multiple keys within file systems on the iPhone or per-user keys on laptops. I heard the term “innovative” quite a bit at WWDC, but here the term is aptly applied to APFS. It supports several different encryption choices for a file system:

Multi-key encryption is particularly relevant for portables where all data might be encrypted, but unlocking your phone provides access to an additional key and therefore additional data. Unfortunately this doesn’t seem to be working in the first beta of macOS Sierra (specifying fileEncryption when creating a new volume with diskutil results in a file system that reports “Is Encrypted” as “No”).

Related to encryption, I noticed an undocumented feature while playing around with diskutil (which prompts you for interactive confirmation of the destructive power of APFS unless this is added to the command-line: -IHaveBeenWarnedThatAPFSIsPreReleaseAndThatIMayLoseData; I’m not making this up). APFS (apparently) supports constant time cryptographic file system erase, called “effaceable” in the diskutil output. This presumably builds a secret key that cannot be extracted from APFS and encrypts the file system with it. A secure erase then need only delete the key rather than needing to scramble and re-scramble the full disk to ensure total eradication. Various iOS docs refer to this capability requiring some specialized hardware; it will be interesting to see what the option means on macOS. Either way, let’s not mention this to the FBI or NSA, agreed?

Snapshots and Backup

APFS brings a much-desired file system feature: snapshots. A snapshot lets you freeze the state of a file system at a particular moment and continue to use and modify that file system while preserving the old data. It does so in a space-efficient fashion where, effectively, changes are tracked and only new data takes up additional space. This has the potential to be extremely valuable for backup by efficiently tracking the data that has changed since the last backup.

ZFS includes snapshots and serialization mechanisms that make it efficient to backup file systems or transfer file systems to a remote location. Will APFS work like that? Probably not, answered Dominic Giampaolo, APFS lead developer. ZFS sends all changed data while Time Machine can have exclusion lists and the like. That seems surmountable, but we’ll see what Apple does. APFS right now is incompatible with Time Machine due to the lack of directory hard links, a fairly disgusting implementation that likely contributes to Time Machine’s questionable reliability. Hopefully APFS will create some efficient serialization for Time Machine backup.

While Eric Tamura, APFS dev manager, demonstrated snapshots at WWDC, the required utilities aren’t included in the macOS Sierra beta. I used DTrace (technology I’m increasingly amazed that Apple ported from OpenSolaris) to find a tantalizingly-named new system call fs_snapshot; I’ll leave it to others to reverse engineer its proper use.

Management

APFS brings another new feature known as space sharing. A single APFS “container” that spans a device can have multiple “volumes” (file systems) within it. Apple contrasts this with the static allocation of disk space to support multiple HFS+ instances, which seems both specious and an uncommon use case. Both ZFS and btrfs have a similar concept of a shared pool of storage with nested file systems for administration and management.

Speaking with Dominic and other members of the APFS team, we discussed how volumes are the unit by which users can control things like snapshots and encryption. You’d want multiple volumes to correspond with different policies around those settings. For example while you might want to snapshot and backup your system each day, the massive /private/var/vm/sleepimage (for saving memory when hibernating) should live on its own and not be backed up.

Space sharing is more like an operational detail than a game changing feature. You can think of it like special folders with snapshot and encryption controls… which is probably why Apple’s marketing department has yet to make me a job offer. Unfortunately this feature isn’t working in the macOS Sierra beta, so I wasn’t able to have more than one volume per container. Adding new volumes can fail with an opaque error (-69625 mean anything to you?), but using a larger disk image resolve the problem.

 

Next in this series: Space Efficiency and Clones

Posted on June 19, 2016 at 7:34 pm by ahl · Permalink · 6 Comments
In: Software · Tagged with: 

APFS in Detail: Overview

Apple announced a new file system that will make its way into all of its OS variants (macOS, tvOS, iOS, watchOS) in the coming years. Media coverage to this point has been mostly breathless elongations of Apple’s developer documentation. With a dearth of detail I decided to attend the presentation and Q&A with the APFS team at WWDC. Dominic Giampaolo and Eric Tamura, two members of the APFS team, gave an overview to a packed room; along with other members of the team, they patiently answered questions later in the day. With those data points and some first hand usage I wanted to provide an overview and analysis both as a user of Apple-ecosystem products and as a long-time operating system and file system developer.

I’ve divided my review into several sections that span a few posts. I’d encourage you to jump around to topics of interest or skip right to the conclusion (or to the tweet summary). Highest praise goes to encryption; ire to data integrity.

Basics

APFS, the Apple File System, was itself started in 2014 with Dominic as its lead engineer. It’s a stand-alone, from-scratch implementation (an earlier version of this post noted a dependency on Core Storage, but Dominic set me straight). I asked him about looking for inspiration in other modern file systems such as BSD’s HAMMER, Linux’s btrfs, or OpenZFS (Solaris, illumos, FreeBSD, Mac OS X, Ubuntu Linux, etc.), all of which have features similar to what APFS intends to deliver. (And note that Apple built a fairly complete port of ZFS, though Dominic was not apparently part of the group advocating for it.) Dominic explained that while, as a self-described file system guy (he built the file system in BeOS, unfairly relegated to obscurity when Apple opted to purchase NeXTSTEP instead), he was aware of them, but didn’t delve too deeply for fear, he said, of tainting himself.

Dominic praised the APFS testing team as being exemplary. This is absolutely critical. A common adage is that it takes a decade to mature a file system. And my experience with ZFS more or less confirms this. Apple will be delivering APFS broadly with 3-4 years of development so will need to accelerate quickly to maturity.

Paying Down Debt

HFS was introduced in 1985 when the Mac 512K (of memory! Holy smokes!) was Apple’s flagship. HFS+, a significant iteration, shipped in 1998 on the G3 PowerMacs with 4GB hard drives. Since then storage capacities have increased by factors of 1,000,000 and 1,000 respectively. HFS+ has been pulled in a bunch of competing directions with different forks for different devices (e.g. the iOS team created their own HFS variant, working so covertly that not even the Mac OS team knew) and different features (e.g. journaling, case insensitive). It’s old; it’s a mess; and, critically, it’s missing a bunch of features that are really considered the basic cost of doing business for most operating systems. Wikipedia lists nanosecond timestamps, checksums, snapshots, and sparse file support among those missing features. Add to that the obvious gap of large device support and you’ve got a big chunk of the APFS feature list.

APFS first and foremost pays down the unsustainable technical debt that Apple has been carrying in HFS+. (In 2001 ZFS grew from a similar need where UFS had been evolved since 1977.) It unifies the multifarious forks. It introduces the expected features. In general it first brings the derelict building up to code.

Compression is an obvious gap in the APFS feature list that is common in many file systems. It’s conceptually quite easy, I told the development team (we had it in ZFS from the outset), so why not include it? To appeal to Dominic’s BeOS nostalgia I even recalled my job interview with Be in 2000 when they talked about how compression actually improved overall performance since data I/O is far more expensive than computation (obvious now, but novel then). The Apple folks agreed, and—in typical Apple fashion—neither confirmed nor denied while strongly implying that it’s definitely a feature we can expect in APFS. I’ll be surprised if compression isn’t included in its public launch.

 

Next in this series: Encryption, Snapshots, and Backup

Posted on June 19, 2016 at 7:29 pm by ahl · Permalink · 42 Comments
In: Software · Tagged with: