Adam Leventhal's blog

Search
Close this search box.

Month: August 2016

Since Noms dropped last week the dev community has seemed into it. “Git for data” — it simultaneously evokes something very familiar and yet unconstrained. Something that hasn’t been well-noted is how much care the team has taken to make Noms fun to build with, and it is.

[youtube_sc url=”https://www.youtube.com/watch?v=3R_4Pdb7ev4″ title=”nomfs%20in%20action%20(video%20for%20non-readers)” width=”300″ class=”alignright”]Noms is a content-addressable, decentralized, append-only database. It borrows concepts from a variety of interesting data systems. Obviously databases are represented: Noms is a persistent, transactional data repository. You can also see the fundamentals of git and other decentralized source code management tools. Noms builds up a chain of commits; those chains can be extended, forked, and shared, while historical data are preserved. Noms shares much in common with modern filesystems such as ZFS, btrfs, or Apple’s forthcoming APFS. Like those filesystems, Noms uses copy-on-write, never modifying data in situ; it forms a self-validating hierarchy of data; and it (intrinsically) supports snapshots and efficient movement of data between snapshots.

After learning a bit about Noms I thought it would be interesting to use it as the foundation for a filesystem. I wanted to learn about Noms, and contribute a different sort of example that might push the project in new and valuable ways. The Noms founders, Aaron and Raf, were fired up so I dove in…

What’s Modern?

When people talk about a filesystem being “modern” there’s a list of features that they often have in mind. Let’s look at how the Noms database stacks up:

Snapshots

A filesystem snapshot preserves the state of the filesystem for some future use — typically data recovery or fast cloning. Since Noms is append-only, every version is preserved. Snapshots are, therefore, a natural side effect. You can make a Noms “snapshot” — any commit in a dataset’s history — writeable by syncing it to a new dataset. Easy.

Dedup

The essence of dedup is that unique data should be stored exactly once. If you duplicate a file, a folder, or an entire filesystem the storage consumption should be close to zero. Noms is content addressable, unique data is only ever stored once. Every Noms dataset intrinsically removes duplicated data.

Consistency

A feature of a filesystem — arguably the feature of a filesystem — is that it shouldn’t ever lose or corrupt your data. One common technique to ensure consistency is to write new data to a new location rather than overwriting old data — so called copy-on-write (COW). Noms is append-only, it doesn’t throw out (or overwrite) old data; copying modified is required and explicit. Noms also recursively checksums all data — a feature of ZFS and btrfs, notably absent from APFS.

Backup

The ability to backup your data from a filesystem is almost as important as keeping it intact in the first place. ZFS, for example, lets you efficiently serialize and send the latest changes between systems. When pulling or pushing changes git also efficiently serializes just the changed data. Noms does something similar with its structured data. Data differences are efficiently computed to optimize for minimal data transfer.

Noms has all the core components of a modern filesystem. My goal was to write the translation layer to expose filesystem semantics on top of the Noms interfaces.

Designing a Schema

Initially, Badly

It’s in the name: Noms eats all the data. Feed it whatever data you like, and let Noms infer a schema as you go. For a filesystem though I wanted to define a fixed structure. I started with a schema modeled on a simplified ZFS. Filesystems keep track of files and directories with a structure called an “inode” each of which has a unique integer identifier, the “inode number”. ZFS keeps track of files and directories with DMU objects named by their integer ID. The schema would use a Map<number, Inode> to serve the same function (spoiler: read on and don’t copy this!):

struct Filesystem {
     inodes: Map<Number, struct Inode {
          attr: struct Attr { /* e.g. permissions, modification time, etc. */ }
          contents: Union {
               struct File { data: Ref /* Noms pile of bytes */ } |
               struct Directory { contents: Map<String, Number> }
          }
     }>
     rootInode: Number
     maxInode: Number
}

Nice and simple. Files are just Noms Blobs represented by a sequence of bytes. Directories are a Map of strings (the name of the directory entry) to the inode number; the inode number can be used to find the actual content by looking in the top-level map.

Schema philosophy

This made sense for a filesystem. Did it make sense for Noms? I wasn’t trying to put the APFS team out of work, rather I was creating a portal from the shell or Finder into Noms. To evaluate the schema, I had the benefit of direct access to the Noms team (and so can all developers at http://slack.noms.io/). I learned two guiding principles for data in Noms:

Noms data should be self-validating. As much as possible the application should rely on noms to ensure consistency. My schema failed this test because the relationship between inode numbers contained in directories and the entires of the inodes map was something my code alone could maintain and validate.

Noms data should be deterministic. A given collection of data should have a single representation; the Noms structures should be path and history independent. Two apparently identical datasets should be identical in the eyes of Noms to support efficient storage and transfer, and identical data should produce an identical hash at the root of the dataset. Here, again, my schema fell short because the inode number assigned to a given file or directory depended on how other objects were created. Two different users with two identical filesystems wouldn’t be able to simply sync the way they would with two identical git branches.

A Better Schema

My first try made for a fine filesystem, just not a Noms filesystem. With a better understanding of the principles, and with help from the Noms team, I built this schema:

struct Filesystem {
     root: struct Inode {
          attr: struct Attr { /* e.g. permissions, modification time, etc. */ }
          contents: Union {
               struct File { data: Ref<Blob> /* Noms pile of bytes */ } |
               struct Directory: { contents: Map<string, Cycle<1>> }
          }
     }
}

Obviously simpler; the thing that bears explanation is the use of so-called “Cycle” types. A Cycle is a means of expressing a recursive relationship within Noms types. The integer parameter specifies the ancestor struct to which the cycle refers. Consider a linked list type:

struct LinkedList {
    data: Blob
    next: Cycle<0>
}

The “next” field refers to immediately containing struct, LinkedList. In our filesystem schema, a Directory’s contents are represented by a map of strings (directory entry names) to Inodes, Cycle<1> referring to the struct “above” or “containing” the Directory type. (Read on for a visualization of this.)

Writing It

To build the filesystem I picked a FUSE binding for Go, dug into the Noms APIs, and wrestled my way through some Go heartache.

Working with Noms requires a slightly different mindset than other data stores. Recall in particular that Noms data is immutable. Adding a new entry into a Map creates a new Map. Setting a member of a Struct creates a new Struct. Changing nested structures such as the one defined by our schema requires unzipping it, and then zipping it back together. Here’s a Go snippet that demonstrates that methodology for creating a new directory:

Demo

Showing it off has all the normal glory of a systems demo! Check out the documentation for requirements.

Create and mount a filesystem from a new local Noms dataset:

$ go build
$ mkdir /var/tmp/mnt
$ go run nomsfs.go /var/tmp/nomsfs::fs /var/tmp/mnt
running...

You can open the folder and drop data into it.

Your database fell into my filesystem!

Now let’s take a look at the underlying Noms dataset:

$ noms show http://demo.noms.io/ahl_blog::fs
struct Commit {
  meta: struct {},
  parents: Set<Ref<Cycle<0>>>,
  value: struct Filesystem {
    root: struct Inode {
      attr: struct Attr {
        ctime: Number,
        gid: Number,
        mode: Number,
        mtime: Number,
        uid: Number,
        xattr: Map<String, Blob>,
      },
      contents: struct Directory {
        entries: Map<String, Cycle<1>>,
      } | struct Symlink {
        targetPath: String,
      } | struct File {
        data: Ref<Blob>,
      },
    },
  },
}({
  meta:  {},
  parents: {
    5v82rie0be68915n1q7pmcdi54i9tmgs,
  },
  value: Filesystem {
    root: Inode {
      attr: Attr {
        ctime: 1.4705227450393803e+09,
        gid: 502,
        mode: 511,
        mtime: 1.4705227450393803e+09,
        uid: 110853,
        xattr: {},
      },
      contents: Directory {
        entries: {
          "usenix_winter91_faulkner.pdf": Inode {
            attr: Attr {
              ctime: 1.4705228859273868e+09,
              gid: 502,
              mode: 420,
              mtime: 1.468425783e+09,
              uid: 110853,
              xattr: {
                "com.apple.FinderInfo": 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  // 32 B
                00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00,
                "com.apple.quarantine": 30 30 30 31 3b 35 37 38 36 36 36 33 37 3b 53 61  // 21 B
                66 61 72 69 3b,
              },
            },
            contents: File {
              data: dmc45152ie46mn3ls92vvhnm41ianehn,
            },
          },
        },
      },
    },
  },
})

You can see the type at the top and then the actual filesystem contents. Let’s look at more complicated example where I’ve taken part of the Noms source tree and copied it to nomsfs. You can navigate around its structure courtesy of the Splore utility (take particular note of nested directories that show the recursive data definition described above):

Embedded ‘Splore! http://splore.noms.io/?db=https://demo.noms.io/ahl_blog&hash=2nhi5utm4s38hka22vt9ilv5i3l8r2ol

You can see the all of the various states that the filesystem has been through — each state change — using noms log http://demo.noms.io/ahl_blog::fsnoms. You can sync it to your local computer with noms sync http://demo.noms.io/ahl_blog::fsnoms /var/tmp/fs or checkout some previous state from the log (just like a filesystem snapshot). Diff two states from the log or make your own changes and diff it with the original using noms diff.

Nom Nom Nom

It took less than 1000 lines of Go code to make Noms appear as a Window in the Finder, eating data as quickly as I could drag and drop (try it!). Imagine what Noms might look like behind other known data interfaces; it could bring git semantics to existing islands of data. Noms could form the basis of a new type of data lake — maybe one that’s simple and powerful enough to bring real results. Beyond the marquee features, Noms is fun to build with. I’m already working on my next Noms application.

I liked Go right away. It was close enough to C and Java to be instantly familiar, the examples and tutorials were straightforward, and I was quickly writing real code. I’ve wanted to learn Go since its popularity was surging few years ago. In no danger of being judged an early adopter, I happily found a great project that—as it happened—had to be in Go (more in a future post).

I Love Go

My first priority was not looking stupid. The folks I’d be doing this for are actual Go developers; I wanted my code to fit in without imposing with too many questions. They had no style guide. Knowing that my 80-column width sensibilities expose an unattractive nostalgia, I went looking for a max line length. There wasn’t one. Then I discovered gofmt, the simple tool that Go employs to liberate developers from the tyranny of stylistic choice. It takes Go code and spits it back out in the One True Style. I love it. I was raised in an engineering culture with an exacting style guide, but any style guide has gaps. Factions form; style-originalists face off against those who view (incorrectly!) the guide as a living document. I updated my decades-old .vimrc to gofmt on save. These Go tyrants were feeling like my kind of tyrant.

One of the things that turned me off of C++ (98, 11, and 14) is its increasing amount of magic. Go is comparatively straightforward. When I reach for something I find that it’s where I expect it to be. Libraries and examples aren’t mysterious. Error messages are non-mysterious (other than quickly resolved confusion about methods with “pointer receivers”). Contrast this with Rust whose errors read like mind-bindingly inscrutable tax forms.

Go’s containment-based inheritance is easy enough to reason about. Its interfaces are similarly no-nonsense and pragmatic. You don’t have to define a structure as implementing an interface. You can use an interface to describe anything that implements it. Simple. This is particularly useful for testing. In a messy environment with components beyond your control you can apply interfaces to other people’s code, and then mock it out.

The toolchain is, again, simple to use—the benefit of starting from scratch—and makes for fast compilation, quick testing, and easy integration with a good-sized ecosystem. I stopped worrying about dependencies, rebuilding, etc. knowing that go run would find errors wherever I had introduced them and do so remarkably quickly.

I Hate Go

Go is opinionated. Most successful products have that strong sense of what they are and what they aren’t; Go draws as sharp a line as any language. I was seduced by its rightheadedness around style, but with anything or anyone that opinionated you’ll find some of those opinions weird and others simply off-putting.

Reading the official documentation, I found myself in the middle of a section prefixed with the phrase “if GOOS is set to plan9”. Wow. I’m a few standard deviations from the norm in terms of being an OS nerd, but I’ve never even seen Plan 9. I knew that the Plan 9 folks got the band back together to create Go; it’s great that their pop audiences don’t dissuade them from playing off their old B-sides. Quirky, but there’s nothing wrong with that.

I wanted to check an invariant; how does Go do assertions? Fuck you, you’re a bad programmer. That’s how. The Go authors feel so strongly about asserts being typically misused that they refuse to provide them. So folks use one (or more) of some workable libraries.

I created an infinite recursion, overflowing the stack. Go produces the first 100 stack frames and that’s it. Maybe you can change that, but I couldn’t figure out how. (“go stackoverflow” is about the most useless thing you can search for; chapeau, Go and Stackoverflow respectively.) I could be convinced that I only want 100 stack frames, but not the last 100, not the same 4 over and over again. I ended up limiting the stack size with runtime.debug.SetMaxStack(), Goldilocks-ing it between too big to catch the relevant frames and too small to allow for normal operation.

I tried using other tools (ahem, DTrace) to print the stack, but, of course, the Go compiler omits frame pointers rendering the stacks unobservable to debuggers. Ditto arguments due to ABI-non-compliant calling conventions, but that’s an aside. The environment variable GOEXPERIMENT=framepointer is supposed to compile with frame pointers, but it was a challenge to rebuild the world. All paths seemed to lead me to my former-colleague’s scathing synopsis: Golang is Trash.

As fun as it is to write code in Go, debugging in Go is no fun at all. I may well just be ignorant of the right tooling. But there sure isn’t a debugger with the simple charm of go build for compilation, go test for testing, or go run for execution.

Immaturity and Promise

Have you ever been in a relationship where minor disagreements immediately escalated to “maybe we should break up?” The Go documentation even seems ready to push you to some other language at the slightest affront. Could I have asserts? Sure, if you’re a bad programmer. Perhaps ABI compliance has its merits? I’m sure you could find that in some other language. Could you give me the absolute value of this int? Is something wrong with your ‘less than’ key?

I’m sure time will, as it tends to, bring pragmatism. I appreciate that Go does have strong opinions (it’s just hard to remember that when I disagree with them). Weak opinions are what turn languages into unreadable mishmashes of overlapping mechanism. My favorite example of this is Perl. My first real programming job was in Perl. I was the most avid reader teenage of the Perl llama and camel books. Ask me about my chat with Larry Wall, Perl’s creator, if you see a beer in my hand. In an interview, Larry said, “In Perl 6, we decided it would be better to fix the language than fix the user”. Contrast this with Go’s statement on assertions:

Go doesn’t provide assertions. They are undeniably convenient, but our experience has been that programmers use them as a crutch to avoid thinking about proper error handling and reporting.

Perl wants to be whatever the user wants to be. The consumate pleaser. Go sees the user as flawed and the strictures of the language as the cure. It’s an authoritarian, steadfast in its ideals, yet too sensitive to find compromise (sound like anyone we all know?).

Despite my frustrations I really enjoy writing code in Go. It’s clean and supported by a great community and ecosystem. I’m particularly heartened that Go 1.7 will compile with frame pointers by default. Diagnosing certain types of bugs is still a pain in the neck, but I’m sure it will improve and I’ll see where I can pitch in.

Recent Posts

January 22, 2024
January 13, 2024
December 29, 2023
February 12, 2017
December 18, 2016
August 9, 2016

Archives

Archives