Fault tolerance in Manta

Since launching Manta last week, we’ve seen a lot of excitement. Thoughtful readers quickly got to burning questions about the system’s fault tolerance: what happens when backend Manta components go down? In this post, I’ll discuss fault tolerance in Manta in more detail. If anything seems left out, please ask: it’s either an oversight or just seemed too arcane to be interesting.

This is an engineering internals post for people interested in learning how the system is built. If you just want to understand the availability and durability of objects stored in Manta, the simpler answer is that the system is highly available (i.e., service survives system and availability zone outages) and durable (i.e., data survives multiple system and component failures).

First principles

First, Manta is strongly consistent. If you PUT an object, a subsequent GET for the same path will immediately return the object you just PUT. In terms of CAP, that means Manta sacrifices availability for consistency in the face of a complete network partition. We feel this is the right choice for an object store: while other object stores remain available in a technical sense, they can emit 404s or stale data both when partitioned and in steady-state. The client code to deal with eventual consistency is at least as complex as dealing with unavailability, except that there’s no way to distinguish client error from server inconsistency — both show up as 404 or stale data. We feel transparency about the state of the system is more valuable here. If you get a 404, that’s because the object’s not there. If the system’s partitioned and we cannot be sure of the current state, you’ll get a 503, indicating clearly that the service is not available and you should probably retry your request later. Moreover, if desired, it’s possible to build an eventually consistent caching layer on top of Manta’s semantics for increased availability.

While CAP tells us that the system cannot be both available and consistent in the face of an extreme network event, that doesn’t mean the system fails in the face of minor failures. Manta is currently deployed in three availability zones in a single region (us-east), and it’s designed to survive any single inter-AZ partition or a complete AZ loss. As expected, availability zones in the region share no physical components (including power) except for being physically close to one another and connected by a high-bandwidth, low-latency interconnect.

Like most complex distributed systems, Manta is itself made up of several different internal services. The only public-facing services are the loadbalancers, which proxy directly to the API services. The API services are clients of other internal services, many of which make use of still other internal services, and so on.

Stateless services

Most of these services are easy to reason about because they’re stateless. These include the frontend loadbalancers, the API servers, authentication caches, job supervisors, and several others.

For each stateless service, we deploy multiple instances in each AZ, each instance on a different physical server. Incoming requests are distributed across the healthy instances using internal DNS with aggressive TTLs. The most common failure here is a software crash. SMF restarts the service, which picks up where it left off.

For the internal services, more significant failures (like a local network partition, power loss, or kernel panic) result in the DNS record expiring and the instance being taken out of service.

Stateful services

Statelessness just pushes the problem around: there must be some service somewhere that ultimately stores the state of the system. In Manta, that lives in two tiers:

  • the storage tier, which stores the contents of users’ objects
  • the metadata tier, which maps user object names to the servers where the data is stored

The storage tier

The availability and durability characteristics of an object are determined in part by its “durability level“. From an API perspective, this indicates the number of independent copies of the object that you want Manta to store. You pay for each copy, and the default value is 2. Copies are always stored on separate physical servers, and the first 3 copies are stored in separate availability zones.

Durability of a single copy: Objects in the storage tier are stored on raidz2 storage pools with two hot spares. The machine has to sustain at least three concurrent disk failures before losing any data, and could survive as many as eight. The use of hot spares ensures that the system can begin resilvering data from failed disks onto healthy ones immediately, in order to reduce the likelihood of a second or third concurrent failure. Keith discusses our hardware choices in more depth on his blog.

Object durability: Because of the above, it’s very unlikely for even a single copy of an object to be lost as a result of storage node failure. If the durability level is greater than 1 (recall that it’s 2 by default), all copies would have to be lost for the object’s data to be lost.

Object availability: When making a request for an object, Manta selects one of the copies and tries to fetch the object from the corresponding storage node. If that node is unavailable, Manta tries another node that has a copy of the object, and it continues doing this until it either finds an available copy or runs out of copies to try. As a result, the object is available as long as the frontend can reach any storage node hosting a copy of the object. As described above, any storage node failure (transient or otherwise) or AZ loss would not affect object availability for objects with at least two copies, though such failures may increase latency as Manta tries to find available copies. Similarly, in the event of an AZ partition, the partitioned AZ’s loadbalancers would be removed from DNS, and the other AZs would be able to service requests for all objects with at least two copies.

Since it’s much more likely for a single storage node to be temporarily unavailable than for data to be lost, it may be more useful to think of “durability level” as “availability level”. (This property also impacts the amount of concurrency you can get for an object — see Compute Jobs below.)

Metadata tier

The metadata tier records the physical storage nodes where each object is stored. The object namespace is partitioned into several completely independent shards, each of which is designed to survive the usual failure modes (individual component failure, AZ loss, and single-AZ partition).

Each shard is backed by a postgres database using postgres-based replication from the master to both a synchronous slave and an asynchronous standby. Each database instance within the shard (master, sync slave, and async slave) is located in a different AZ, and we use Zookeeper for election of the master.

The shard requires only one peer to be available for read availability, and requires both master and synchronous slave for write availability. Individual failures (or partitions) of the master or synchronous slave can result in transient outages as the system elects a new leader.

The mechanics of this component are complex and interesting (as in, we learned a lot of lessons in building this). Look for a subsequent blog post from the team about the metadata tier.

Compute Jobs

Manta’s compute engine is built on top of the same metadata and storage tiers. Like the other supporting services, the services are effectively stateless and the real state lives in the metadata tier. It’s subject to the availability characteristics of the metadata tier, but it retries internal operations as needed to survive the transient outages described above.

If a given storage node becomes unavailable when there are tasks running on it, those tasks will be retried on a node storing another copy of the object. (That’s one source of the “retries” counter you see in “mjob get”.) Manta makes sure that the results of only one of these instances of the task are actually used.

The durability level of an object affects not only its availability for compute (for the same reasons as it affects availability over HTTP as described above), but also the amount of concurrency you can get on an object. That’s because Manta schedules compute tasks to operate on a random copy of an object. All things being equal, if you have two copies of an object instead of one, you can have twice as many tasks operating on the object concurrently (on twice as many physical systems).

Final notes

You’ll notice that I’ve mainly been talking about transient failures, either of software, hardware, or the network. The only non-transient failure in the system is loss of a ZFS storage pool; any other failure mode is recoverable by replacing the affected components. Objects with durability of at least two would be recoverable in the event of pool loss from the other copies, while objects with durability of one that were stored on that pool would be lost. (But remember: storage pool loss as a result of any normal hardware failure, even multiple failures, is unlikely.)

I also didn’t mention anything about replication in the storage tier. That’s because there is none. When you PUT a new object, we dynamically select the storage nodes that will store that object and then funnel the incoming data stream to both nodes synchronously (or more, if durability level is greater than 2). If we lose a ZFS storage pool, we would have to replicate objects to other pools, but that’s not something that’s triggered automatically in response to failure since it’s not appropriate for most failures.

Whether in the physical world or the cloud, infrastructure services have to be highly available. We’re very up front about how Manta works, the design tradeoffs we made, and how it’s designed to survive software failure, hardware component failure, physical server failure, AZ loss, and network partitions. With a three AZ model, if all three AZs became partitioned, the system chooses strong consistency over availability, which we believe provides a significantly more useful programming model than the alternatives.

For more detail on how we think about building distributed systems, see Mark Cavage’s ACM Queue article “There’s Just No Getting Around It: You’re Building a Distributed System.”

Inside Manta: Distributing the Unix shell

Today, Joyent has launched Manta: our internet-facing object store with compute as a first class operation. This is the culmination of over a year’s effort on the part of the whole engineering team, and I’m personally really excited to be able to share this with the world. There’s plenty of documentation on how to use Manta, so in this post I want to talk about the guts of my favorite part: the compute engine.

The super-quick crash course on Manta: it’s an object store, which means you can use HTTP PUT/GET/DELETE to store arbitrary byte streams called objects. This is similar to other HTTP-based object stores, with a few notable additions: Unix-like directory semantics, strong read-after-write consistency, and (most significantly) a Unixy compute engine.

Computation in Manta

There’s a terrific Getting Started tutorial already, so I’m going to jump straight to a non-trivial job and explain how it runs under the hood.

At the most basic level, Manta’s compute engine runs arbitrary shell commands on objects in the object store. Here’s my example job:

$ mfind -t o /dap/stor/snpp | mjob create -qom 'grep poochy'

This job enumerates all the objects under /dap/stor/snpp (using the mfind client tool, analogous to Unix find(1)), then creates a job that runs “grep poochy” on each one, waits for the job to finish, and prints the outputs.

I can run this one-liner from my laptop to search thousands of objects for the word “poochy”. Instead of downloading each file from the object store, running “grep” on it, and saving the result back, Manta just runs “grep poochy” inside the object store. The data never gets copied.

Notice that our Manta invocation of “grep” didn’t specify a filename at all. This works because Manta redirects stdin from an object, and grep reads input from stdin if you don’t give it a filename. (There’s no funny business about tacking the filename on to the end of the shell script, as though you’d run ‘grep poochy FILENAME’, though you can do that if you want using environment variables.) This model extends naturally to cover “reduce” operations, where you may want to aggregate over enormous data streams that don’t fit on a single system’s disks.

One command, many tasks

What does it actually mean to run grep on 100 objects? Do you get one output or 100? What if some of these commands succeed, but others fail?

In keeping with the Unix tradition, Manta aims for simple abstractions that can be combined to support more sophisticated use cases. In the example above, Manta does the obvious thing: if the directory has 100 objects, it runs 100 separate invocations of “grep”, each producing its own output object, and each with its own success or failure status. Unlike with a single shell command, a one-phase map job can have any number of inputs, outputs, and errors. You can build more sophisticated pipelines that combine output from multiple phases, but that’s beyond the scope of this post.1

How does it work?

Manta’s compute engine hinges on three SmartOS (illumos) technologies:

  • Zones: OS-based virtualization, which allows us to run thousands of these user tasks concurrently in lightweight, strongly isolated environments. Each user’s program runs as root in its own zone, and can do whatever it wants there, but processes in the zone have no visibility into other zones or the rest of the system.
  • ZFS: ZFS’s copy-on-write semantics and built-in snapshots allow us to completely reset zones between users. Your program can scribble all over the filesystem, and when it’s done we roll it back to its pristine state for the next user. (We also leverage ZFS clones for the filesystem image: we have one image with tens of gigabytes of software installed, and each zone’s filesystem is a clone of that single image, for almost zero disk space overhead per zone.)
  • hyprlofs: a filesystem we developed specifically for Manta, hyprlofs allows us to mount read-only copies of files from one filesystem into another. The difference between hyprlofs and traditional lofs is that hyprlofs supports commands to map and unmap files on-demand, and those files can be backed by arbitrary other filesystems. More on this below.

In a nutshell: each copy of a Manta object is stored as a flat file in ZFS. On the same physical servers where these files are stored, there are a bunch of compute zones for running jobs.

As you submit the names of input objects, Manta locates the storage servers containing a copy of each object and dispatches tasks to one server for each object. That server uses hyprlofs to map a read-only copy of the object into one of the compute zones. Then it runs the user’s program in that zone and uses zfs rollback to reset the zone for the next tenant. (There’s obviously a lot more to making this system scale and survive component failures, but that’s the basic idea.)

What’s next?

In this post I’ve explained the basics of how Manta’s compute engine works under the hood, but this is a very simple example. Manta supports more sophisticated distributed computation, including reducers (including multiple reducers) and multi-phase jobs (e.g., map-map-map).

Because Manta uses the Unix shell as the primitive abstraction for computation, it’s very often trivial to turn common shell invocations that you usually run sequentially on a few files at a time into Manta jobs that run in parallel on thousands of objects. For tasks beyond the complexity a shell script, you can always execute a program in some other language — that’s, after all, what the shell does best. We’ve used it for applications ranging from converting image files to generating aggregated reports on activity logs. (In fact, we use the jobs facility internally to implement metering, garbage collection of unreferenced objects, and our own reports.) My colleague Bill has already used it to analyze transit times on SF Muni. Be on the lookout for a rewrite of kartlytics based on Manta.

We’re really excited about Manta, and we’re looking forward to seeing what people do with it!

1 Manta’s “map” is like the traditional functioning programming primitive that performs a transformation on each of its inputs. This is similar but not the same as the Map in MapReduce environments, which specifically operates on key-value pairs. You can do MapReduce with Manta by having your program parse key-value pairs from objects and emit key-value pairs as output, but you can also do other transformations that aren’t particularly well-suited to key-value representation (like video transcoding, for example).

Debugging dynamic library dependencies on illumos

In this short follow-up to my post on illumos process tools, I’ll expand a bit on ldd and pldd, which print the dynamic linking dependencies of binaries and processes, respectively, and crle, which prints out the runtime linker configuration. These tools are available in most illumos distributions including SmartOS.

Understanding builds (and broken builds in particular) can be especially difficult. I hate running into issues like this one:

$ ffmpeg
ld.so.1: ffmpeg: fatal: libavdevice.so.53: open failed: No such file or directory

You can use ldd to see the dynamic library dependencies of a binary:

$ ldd $(which ffmpeg)
        libavdevice.so.53 =>     (file not found)
        libavfilter.so.2 =>      (file not found)
        libavformat.so.53 =>     (file not found)
        libavcodec.so.53 =>      (file not found)
        libswresample.so.0 =>    (file not found)
        libswscale.so.2 =>       (file not found)
        libavutil.so.51 =>       (file not found)
        libsocket.so.1 =>        /lib/libsocket.so.1
        libnsl.so.1 =>   /lib/libnsl.so.1
        libvpx.so.0 =>   /opt/local/lib/libvpx.so.0
        libm.so.2 =>     /lib/libm.so.2
        libbz2.so.0 =>   /opt/local/lib/libbz2.so.0
        libz.so.1 =>     /lib/libz.so.1
        libc.so.1 =>     /lib/libc.so.1
        libmp.so.2 =>    /lib/libmp.so.2
        libmd.so.1 =>    /lib/libmd.so.1
        libpthread.so.1 =>       /lib/libpthread.so.1
        librt.so.1 =>    /lib/librt.so.1
        libgcc_s.so.1 =>         /opt/local/lib/libgcc_s.so.1

In this case, the problem is that I installed ffmpeg into /usr/local, but the ffmpeg build appears not to have used the -R linker flag, which tells the runtime linker where to look dynamic libraries when the program is loaded. As a result, ffmpeg doesn’t know where to find its own libraries. If I set LD_LIBRARY_PATH, I can see that it will work:

$ LD_LIBRARY_PATH=/usr/local/lib ldd $(which ffmpeg)
        libavdevice.so.53 =>     /usr/local/lib/libavdevice.so.53
        libavfilter.so.2 =>      /usr/local/lib/libavfilter.so.2
        libavformat.so.53 =>     /usr/local/lib/libavformat.so.53
        libavcodec.so.53 =>      /usr/local/lib/libavcodec.so.53
        libswresample.so.0 =>    /usr/local/lib/libswresample.so.0
        libswscale.so.2 =>       /usr/local/lib/libswscale.so.2
        libavutil.so.51 =>       /usr/local/lib/libavutil.so.51

I resolved this by rebuilding ffmpeg explicitly with LDFLAGS += -R/usr/local.

ldd only examines binaries, and so can only print out dependencies built into the binary. Some programs use dlopen to open libraries whose name isn’t known until runtime. Node.js add-ons and Apache modules are two common examples. You can view these with pldd, which prints the dynamic libraries loaded in a running process. Here’s the output on a Node program with the node-dtrace-provider add-on:

$ pfexec pldd $(pgrep -x node)
32113:  /usr/local/bin/node /home/snpp/current/js/snpp.js -l 80 -d

If you want to see where the system looks for dynamic libraries, use crle, which prints or edits the runtime linker configuration:

$ crle
Configuration file [version 4]: /var/ld/ld.config
  Platform:     32-bit LSB 80386
  Default Library Path (ELF):   /lib:/usr/lib:/opt/local/lib
  Trusted Directories (ELF):    /lib/secure:/usr/lib/secure  (system default)
Command line:
  crle -c /var/ld/ld.config -l /lib:/usr/lib:/opt/local/lib

Of course, for more information on any of these tools, check out their man pages. They’re well-documented. If you find yourself debugging build problems, you’ll probably also want to know about nm, objdump, and elfdump, which are available on many systems and well documented elsewhere.

illumos tools for observing processes

illumos, with Solaris before it, has a history of delivering rich tools for understanding the system, but discovering these tools can be difficult for new users. Sometimes, tools just have different names than people are used to. In many cases, users don’t even know such tools might exist.

In this post I’ll describe some tools I find most useful, both as a developer and an administrator. This is not intended to be a comprehensive reference, but more like part of an orientation for users new to illumos (and SmartOS in particular) but already familiar with other Unix systems. This post will likely be review for veteran illumos and Solaris users.

The proc tools (ptools)

The ptools are a family of tools that observe processes running on the system. The most useful of these are pgrep, pstack, pfiles, and ptree.

pgrep searches for processes, returning a list of process ids. Here are some common example invocations:

$ pgrep mysql         # print all processes with "mysql" in the name
                      # (e.g., "mysql" and "mysqld")
$ pgrep -x mysql      # print all processes whose name is exactly "mysql"
                      # (i.e., not "mysqld")
$ pgrep -ox mysql     # print the oldest mysql process
$ pgrep -nx mysql     # print the newest mysql process
$ pgrep -f mysql      # print processes matching "mysql" anywhere in the name
                      # or arguments (e.g., "vim mysql.conf")
$ pgrep -u dap        # print all of user dap's processes

These options let you match processes very precisely and allow scripts to be much more robust than “ps -A | grep foo” allows.

I often combine pgrep with ps. For example, to see the memory usage of all of my node processes, I use:

$ ps -opid,rss,vsz,args -p "$(pgrep -x node)"
 4914 94380 98036 /usr/local/bin/node demo.js -p 8080
32113 92616 95964 /usr/local/bin/node demo.js -p 80

pkill is just like pgrep, but sends a signal to the matching processes.

pstack shows you thread stack traces for the processes you give it:

$ pstack 51862
51862:      find /
 fedd6955 getdents64 (fecb0200, 808ef87, 804728c, fedabd84, 808ef88, 804728c) + 15
 0805ee9c xsavedir (808ef87, 0, 8089a90, 1000000, 0, fee30000) + 7c
 080582dc process_path (808e818, 0, 8089a90, 1000000, 0, fee30000) + 33c
 080583ee process_path (808e410, 0, 8089a90, 1000000, 0, fee30000) + 44e
 080583ee process_path (808e008, 0, 8089a90, 0, 0, fecb2a40) + 44e
 080583ee process_path (8047cbd, 0, 8089a90, 0, fef40c20, fedc78b6) + 44e
 080583ee process_path (8075cd0, 0, 2f, fed59274, 8047b48, 8047cbd) + 44e
 08058931 do_process_top_dir (8047cbd, 8047cbd, 0, 0, 0, 0) + 21
 08057c5e at_top   (8058910, 2f, 8047bb0, 8089a90, 28, 80571f0) + 9e
 08072eda main     (2, 8047bcc, 8047bd8, 80729d0, 0, 0) + 4ea
 08057093 _start   (2, 8047cb8, 8047cbd, 0, 8047cbf, 8047cd3) + 83

This is incredibly useful as a first step for figuring out what a program is doing when it’s slow or not responsive.

pfiles shows you what file descriptors a process has open, similar to “lsof” on Linux systems, but for a specific process:

$ pfiles 32113
32113:      /usr/local/bin/node /home/snpp/current/js/snpp.js -l 80 -d
  Current rlimit: 1024 file descriptors
   0: S_IFCHR mode:0666 dev:527,6 ino:2848424755 uid:0 gid:3 rdev:38,2
   1: S_IFREG mode:0644 dev:90,65565 ino:38817 uid:0 gid:0 size:793928
   2: S_IFREG mode:0644 dev:90,65565 ino:38817 uid:0 gid:0 size:793928
   3: S_IFPORT mode:0000 dev:537,0 uid:0 gid:0 size:0
   4: S_IFIFO mode:0000 dev:524,0 ino:6257976 uid:0 gid:0 size:0
   5: S_IFIFO mode:0000 dev:524,0 ino:6257976 uid:0 gid:0 size:0
   6: S_IFSOCK mode:0666 dev:534,0 ino:23280 uid:0 gid:0 size:0
    sockname: AF_INET  port: 80
   7: S_IFREG mode:0644 dev:90,65565 ino:91494 uid:0 gid:0 size:6999682

This includes details on files (including offset, which is great for checking on programs that scan through large files) and sockets.

ptree shows you a process tree for the whole system or for a given process or user. This is great for programs that use lots of processes (like a build):

$ ptree $(pgrep -ox make)
4599  zsched
  6720  /usr/lib/ssh/sshd
    45902 /usr/lib/ssh/sshd
      45903 /usr/lib/ssh/sshd
        45906 -bash
          54464 make -j4
            54528 make -C out BUILDTYPE=Release
                55719 /opt/local/libexec/gcc/i386-pc-solaris2.11/4.6.2/cc1 -quiet -I
                55758 /opt/local/libexec/gcc/i386-pc-solaris2.11/4.6.2/cc1 -quiet -I
              55769 sed -e s|^bf_null.o|/home/dap/node/out/Release/obj.target/openss
              55771 /bin/sh -c sed -e "s|^bf_nbio.o|/home/dap/node/out/Release/obj.t

Here’s a summary of these and several other useful ptools:

  • pgrep/pkill: search processes (and signal them)
  • pstack: print thread stack traces
  • ptree: print process tree
  • pargs [-e]: print process arguments (and environment variables)
  • pmap: print process virtual address mappings
  • pwdx: print a process’s working directory
  • pstop: stop a process (as a debugger would — useful for testing what happens when a process hangs or otherwise gets delayed)
  • prun: run a stopped process
  • plockstat: print lock statistics for a process
  • psig: print a process’s signal dispositions
  • pwait: wait for a process to terminate
  • ptime: print detailed timing stats for a process
  • pldd: print dynamic libraries for a process
  • fuser: show which processes have a given file open (not technically a ptool, but useful nonetheless)

Some of these tools (including pfiles and pstack) will briefly pause the process to gather their data. For example, “pfiles” can take several seconds if there are many file descriptors open.

For details on these and a few others, check their man pages, most of which are in proc(1).

Core files

Many of the proc tools operate on core files just as well as live processes. Core files are created when a process exits abnormally, as via abort(3C) or a SIGSEGV. But you can also create one on-demand with gcore:

$ gcore 45906
gcore: core.45906 dumped

$ pstack core.45906
core 'core.45906' of 45906:     -bash
 fee647f5 waitid   (7, 0, 8047760, f)
 fee00045 waitpid  (ffffffff, 8047838, c, 108a7, 3, 8047850) + 65
 0808f4c3 waitchld (0, 0, 0, 0, 20000, 0) + 87
 0808ffc6 wait_for (108a7, 0, 813c128, 3e, 330000, 78) + 2ce
 08082ee8 execute_command_internal (813b348, 0, ffffffff, ffffffff, 813c128) + 1758
 08083d3d execute_command (813b348, 1, 8047b58, 8071a7d, 0, 0) + 45
 08071c18 reader_loop (fed90b2c, 80663dd, 8047c34, fed90dc8, 8069380, 0) + 240
 080708e3 main     (1, 8047dfc, 8047e04, 80eb9f0, 0, 0) + aff
 0806f32b _start   (1, 8047ea4, 0, 8047eaa, 8047eb3, 8047ebf) + 83

Lazy tracing of system calls

DTrace can trace system calls across the system with minimal impact, but for cases where the overhead is not important and you only care about one process, truss can be a convenient tool because it decodes arguments and return values for you:

$ truss -p 3135
sysconfig(_CONFIG_PAGESIZE)                     = 4096
ioctl(1, TCGETA, 0x080479F0)                    = 0
ioctl(1, TIOCGWINSZ, 0x08047B88)                = 0
brk(0x08086CA8)                                 = 0
brk(0x0808ACA8)                                 = 0
open(".", O_RDONLY|O_NDELAY|O_LARGEFILE)        = 3
fcntl(3, F_SETFD, 0x00000001)                   = 0
fstat64(3, 0x08047940)                          = 0
getdents64(3, 0xFEC84000, 8192)                 = 720
getdents64(3, 0xFEC84000, 8192)                 = 0

When debugging path-related issues (like why Node.js can’t find the module you’re requiring), it’s often useful to trace just calls to “open” and “stat” with “truss -topen,stat”. This is also good for watching commands that traverse a directory tree, like “tar” or “find”.

DTrace and MDB

I mention DTrace and MDB last, but they’re the most comprehensive, most powerful tools in the system for understanding program behavior. The tools described above are simpler and present the most commonly useful information (e.g., process arguments or open file descriptors), but when you need to get arbitrary information about the system, these two are the tools to use.

DTrace is a comprehensive tracing framework for both the kernel and userland apps. It’s designed to be safe by design, to have zero overhead when not enabled, and to minimize overhead when enabled. DTrace has hundreds of thousands of probes at the kernel level, including system calls (system-wide), the scheduler, the I/O subsystem, ZFS, process execution, signals, and most function entry/exit points in the kernel. In userland, DTrace instruments function entry and exit points, individual instructions, and arbitrary probes added by application developers. At each of these instrumentation points, you can gather information like the currently running process, a kernel or userland stack backtrace, function arguments, or anything else in memory. To get started, I’d recommend Adam Leventhal’s DTrace boot camp slides. (The context and instructions for setup are a little dated, but the bulk of the content is still accurate.)

MDB is the modular debugger. Like GDB on other platforms, it’s most useful for deep inspection of a snapshot of program state. That can be a userland program or the kernel itself, and in both cases you can open a core dump (crash dump, for the kernel) or attach to the running program (kernel). As you’d expect, MDB lets you examine the stack, global variables, threads, and so on. The syntax is a little arcane, but the model is Unixy, allowing debugger commands to be strung together much like a shell pipeline. Eric Schrock has two excellent posts for people moving from GDB to MDB.

Let me know if I’ve missed any of the big ones. I’ll be writing a few more posts on tools in other areas of the system.