Delphix and Flash

I started working with flash in 2006 — fortunate timing as flash was just starting to take hold in the enterprise. I started asking customers I’d visit about flash. I’ll always remember the response from an early adopter when I asked about how he planned on using the new, expensive storage, “We just bought it, and we have no idea.” It was a solution in search of a problem — the garbage can model at play.

Flash has evolved significantly since then from a raw material used on its own to a component in systems of increasing complexity. I wrote recently about the various techniques being employed to get the most out of flash; all share the basic idea of trading compute and IOPS (abundant commodities) for capacity (still more expensive for flash than hard drives). The ideal use cases are the ones that benefit most from that trade-off, ones where compression and dedup consume cheap compute cycles rather than expensive space on the NAND flash. Flash storage is best with data that contains high degrees of redundancy that clever software can squeeze out. With those loose criteria, it’s been amazing to me how flash storage vendors have flocked to the VDI use case. It’s certainly well-suited — big on IOPS with nearly identical data from different Windows installs that’s easily compressed and deduped — but seemingly every flash vendor has decided that it’s one part — if not the part — of the market they want to address. Take a look at the information on VDI from various flash storage vendors: Fusion, Nimble, Pure Storage, Tegile, Tintri, Violin, Virident, Whiptailthe list goes on and on.

I worked extensively with flash until leaving Oracle in 2010 when I decided to leave for a start up. I ended up not sticking with flash precisely because it was — and is — such a crowded space. I’d happily bet on the space, but it was harder to pick one winner. One of the things that drew me to Delphix though was precisely its compatibility with flash. At Delphix we create virtual database copies by sharing blocks; think of it as dedup before the fact, or dedup but without the runtime tax. Creating a virtual copy happens almost instantaneously saving tremendous amounts of administration time, unblocking developers, and accelerating projects — hence our credo of agile data. Unlike storage-based snapshots, Delphix virtual copies are database aware, provisioning is fully integrated and automated. Those virtual copies also take up much less physical space, but with as many or more IOPS hitting the aggregate of those virtual copies. Sound familiar yet? One tenth the capacity with the same workload — let’s call it 10x greater IOPS intensity — is ideally suited for flash storage.

Flash storage is best when clever software can squeeze out redundancies; Delphix is that clever software for databases. Delphix customers are starting to combine our product with their flash storage purchases. An all-flash array that’s 5x the $/TB as disk storage suddenly becomes half the price of disk storage when combined with Delphix — with substantially better performance. We as an industry still haven’t realized the full potential of flash storage. Agile data through Delphix fills in another big piece of the flash picture.

Posted on May 6, 2013 at 4:28 am by ahl · Permalink · Leave a comment
In: Delphix, Flash

On Systems Software

A prospective new college hire recently related an odd comment from his professor: systems programming is dead. I was nonplussed; what could the professor have meant? Systems is clearly very much alive. Interesting and important projects march under the banner of systems. But as I tried to construct a less emotional rebuttal, I realized I lacked a crisp definition of what systems programming is.

Wikipedia defines systems software in the narrowest terms: the stuff that interacts with hardware. But that covers a tiny fraction of modern systems. So what is systems software? It depends on when you’re asking the question. At one time, the web server was the application; now it’s the systems software on which many web-facing applications are built. At one time a database was the application; now it’s systems software that supports a variety of custom and off-the-shelf applications. Before my time, shells were probably considered a bleeding edge application; now they’re systems software on which some of the lowest-level plumbing of modern operating systems are built.

Any layer on which people build applications of increasing complexity is systems software. Most software that endures the transition to systems software does so whether its authors intended it or not. People in the software industry often talk about standing on the shoulders of giants; the systems software accumulated and refined over decades are those giants.

Stable interfaces define systems software. The programs that consume those interfaces expect the underlying systems software to be perfect every time. Initially innovation might happen in the interfaces themselves — the concurrent model of Node.js is a great example. As software matures, the interfaces become commodified; innovation happens behind those stable interfaces. Systems is only “dead” at its edges. Interfaces might be flexible and well-designed, or sclerotic and poorly designed. Regardless, new or improved systems software can increase performance, enhance observability, or simply fit a different economic niche.

There are a few different types of systems software. First there’s supporting systems software, systems software written as necessary foundation for some new application. This is systems software written with a purpose and designed to solve an unsolved — or poorly solved — problem. Chronologically, examples include UNIX, compilers, and libraries like jQuery. You write it because you need it, but it’s solving a problem that’s likely not unique to your particular application.

Then there’s accidental systems software. Stick everything from Apache to Excel to the Bourne shell in that category. These didn’t necessarily set out to be the foundation on which increasingly complex software would be written, but they definitely are. I’m sure there were times when indoctrination into systems-hood was painful, where the authors wanted to change interfaces, but good systems software respects its consumers and carries them forward. Somewhat famously make preserved its arcane syntax because two consumers already existed. JavaScript started as a glorified web design tool; now it sits several layers beneath complex client-side applications. Even more recently, developers of Node.js (itself  JavaScript-based) changed a commonly used interface that broke many applications. Historical mistakes can be annoying to live with, but — as the Node.js team determined — compatibility trumps cleanliness.

The largest bucket is replacement systems software. Linux, Java, ZFS, and DTrace fall into this category. At the time of their development, each was a notionally compatible replacement for something that already existed. Linux, of course, reimplemented the UNIX design to provide a free, compatible alternative. Java set about building a better runtime (the stable interface being a binary provided to customers to execute) designed to abstract away the operating system entirely. ZFS represented a completely new way of thinking about filesystems, but it did so within the tight constraints of POSIX interfaces and storage hardware. DTrace added new and unique observability to most of the stable interfaces that applications build on.

Finally, there’s intentional systems software. This is systems software by design, but unlike supporting systems software, there’s no consumer. Intentional systems software takes an “if you build it, they will come” approach. This is highly prone to failure — without an existence proof that your software solves a problem and exposes the right interfaces, it’s very difficult to know if you’re building the right thing.

Why define these categories? Knowing which you’re working with can inform your decisions. If you’ve written accidental systems software that has had systems-ness thrust upon it, realize that future versions need to respect the consumers — or willfully cast them aside. When writing replacement systems software, recognize the constraints on the system, and know exactly where you’re constrained and where you can innovate (or consider if you don’t want to use the existing solution). If you’ve written supporting systems software, know that others will inevitably need solutions to the same problems. Either invest in maintaining it and keeping it best of breed; resign to the fact that it will need to be replaced as others invest in a different solution; or open source it and hope (or advocate) that it becomes that ubiquitous solution.

TL;DR?

What’s systems software? It is the increasingly complex, increasingly capable, increasingly diverse foundation on which applications are built. It’s that long and growing tail of the corpus of software at large. The interfaces might be static, but it’s a rich venue for innovation. As more and more critical applications build on an interface, the more value there is in improving the systems software beneath it. Systems software is defined by the constraints; it’s a mission and a mindset with unique challenges, and unique potential.

Posted on February 25, 2013 at 4:46 am by ahl · Permalink · 5 Comments
In: Software · Tagged with: 

The Holistic Engineer

The idea of the holistic engineer embodies the point of view that an engineer needs to consider the whole system, the whole body of work that makes a product successful. It bears no relation to holistic health — and it’s not some even newer age quackery. There are many specialist roles in the software industry — marketing, product management, project management, documentation, education, support, etc. — but the best software engineers are generalists who can assume a portion of each specialty. Further, some software is particularly well-suited for generalists who can combine a deep understanding of the market, the technology, and the implementation.

Software products are born of many different types of organizations, and even within similar organizations roles might have different names. Here’s a generic example with some names on the roles. New products and features start with product managers. Their role is to talk to customers and sales, educate themselves on the market, and determine the right product or enhancement. The handoff to engineering takes the form of a product requirements document (PRD) — it might sound like jargon, but the term is more or less universal. Software engineers execute against that PRD; QA engineers design tests that assert conformance to the PRD while developers steer the product from point A to point B as described by product management. Documentation writers and learning services take the PRD and the software to generate collateral that teaches customers how to use it. Product marketing makes the PowerPoints; sales presents them to customers.

And that’s where babies come from.

It’s not a perfect process, but it’s birthed many successful products. The shortcoming is that it can bury engineers under filters. Instead of learning about actual customer problems, engineers hear some processed form of what the customer said. Instead of raw critique of a new feature, engineers hear a softened and truncated form. The more technical the product and the market, the more those filters impede innovation and hamper the trajectory of the product.

The holistic engineer augments the jobs of those specialists, participating in each phase of product development. They join in those early conversations with customers, and share the responsibility of market comprehension. They partner to construct the requirements and design that those engineers will then implement. Along the way, engineers of course validate decisions with sales and customers — this is Agile writ large — but engineers also participate in the outbound documentation, training, and marketing activities.

From start to finish, the process is designed to fuel innovation by arming creative engineers with data and understanding. Customers often tell you what they want; they rarely tell you what they need. The more technical or disruptive the product, the more value an engineer has in those conversations, extracting the essence of the problem from the noise of preconceptions. The relationships with customer and the full context around their problems keeps engineers grounded as the inevitable gaps emerge in the product specification. Holistic engineers also help to educate the rest of the company and the rest of the world about new products and features. The process of explaining technology advises the way engineers design and build products. When we’re having a hard time explaining a feature or presenting a product, we need to revise our design. We’ve all heard engineering accused of building a product that was too complicated for the market, or engineers complain that a product failed because it was poorly marketed; both are symptoms of poor coordinating. Giving engineers holistic responsibility guards against this problem — if the product is failing the onus is on them to solve it.

Most important though are the feelings of ownership and agency associated with the whole-body approach. The holistic engineer is explicitly tasked with making a product succeed. That’s not to say that he or she goes it alone — specialists in all functions have major roles — rather the engineer is empowered to move the product through all stages; the other side of that coin is that there’s no opportunity to shrug off a responsibility as belonging to someone else.

In this model, everyone in every role at the company has the opportunity to engage in product management. Indeed, there’s still value in explicit product management. Channels of communication need to be easy and open for people with ideas to connect to people who will distill them into implementation. And it’s not enough to just create the right environment; hiring processes need to identify broad thinkers, and mentorship needs to nurture and reward holistic execution. Not every engineer can — or wants to — take on those additional responsibilities, but the best thrive with market and technology awareness, unencumbered by filters. They want responsibility and authority to make their ideas succeed.

The idea of the holistic engineer isn’t theoretical, it’s the model we stumbled into in the Solaris Kernel Group, and later implemented deliberately at Fishworks. There, a small team took on wide ranging responsibilities to build a product that’s now doing $400m/year for Oracle. At Delphix we’re again inculcating and hiring for holistic thinking. At all three I’ve seen engineers develop new products and features that address customer needs that would have otherwise never emerged from customers’ initial requests. It’s not easy to find the right kind of engineers, but if a company can empower the right engineers in the right ways — and they can live up to the responsibility — the payoff is a better product, built more efficiently.

Posted on February 6, 2013 at 8:02 am by ahl · Permalink · One Comment
In: Software

ZFS fundamentals: transaction groups

I’ve continued to explore ZFS as I try to understand performance pathologies, and improve performance. A particular point of interest has been the ZFS write throttle, the mechanism ZFS uses to avoid filling all of system memory with modified data. I’m eager to write about the strides we’re making in that regard at Delphix, but it’s hard to appreciate without an understanding of how ZFS batches data. Unfortunately that explanation is literally nowhere to be found. Back in 2001 I had not yet started working on DTrace, and was talking to Matt and Jeff, the authors of ZFS, about joining them. They had only been at it for a few months; I was fortunate to be in a conference with them as the ideas around transaction groups formulated. Transaction groups are how ZFS batches up chunks of data to be written to disk (“groups” of “transactions”). Jeff stood at the whiteboard and drew the progression of states for transaction groups, from open, accepting new transactions, to quiescing, allowing transactions to complete, to syncing, writing data out to disk. As far as I can tell, that was both the first time that picture had been drawn and the last. If you search for information on ZFS transaction groups you’ll find mention of those states… and not much else. The header comment in usr/src/uts/common/fs/zfs/txg.c isn’t particularly helpful:

/*
 * Pool-wide transaction groups.
 */

I set out to write a proper description of ZFS transaction groups. I’m posting it here first, and I’ll be offering it as a submission to illumos. Many thanks to Matt Ahrens, George Wilson, and Max Bruning for their feedback.

ZFS Transaction Groups

ZFS transaction groups are, as the name implies, groups of transactions that act on persistent state. ZFS asserts consistency at the granularity of these transaction groups. Each successive transaction group (txg) is assigned a 64-bit consecutive identifier. There are three active transaction group states: open, quiescing, or syncing. At any given time, there may be an active txg associated with each state; each active txg may either be processing, or blocked waiting to enter the next state. There may be up to three active txgs, and there is always a txg in the open state (though it may be blocked waiting to enter the quiescing state). In broad strokes, transactions — operations that change in-memory structures — are accepted into the txg in the open state, and are completed while the txg is in the open or quiescing states. The accumulated changes are written to disk in the syncing state.

Open

When a new txg becomes active, it first enters the open state. New transactions — updates to in-memory structures — are assigned to the currently open txg. There is always a txg in the open state so that ZFS can accept new changes (though the txg may refuse new changes if it has hit some limit). ZFS advances the open txg to the next state for a variety of reasons such as it hitting a time or size threshold, or the execution of an administrative action that must be completed in the syncing state.

Quiescing

After a txg exits the open state, it enters the quiescing state. The quiescing state is intended to provide a buffer between accepting new transactions in the open state and writing them out to stable storage in the syncing state. While quiescing, transactions can continue their operation without delaying either of the other states. Typically, a txg is in the quiescing state very briefly since the operations are bounded by software latencies rather than, say, slower I/O latencies. After all transactions complete, the txg is ready to enter the next state.

Syncing

In the syncing state, the in-memory state built up during the open and (to a lesser degree) the quiescing states is written to stable storage. The process of writing out modified data can, in turn modify more data. For example when we write new blocks, we need to allocate space for them; those allocations modify metadata (space maps)… which themselves must be written to stable storage. During the sync state, ZFS iterates, writing out data until it converges and all in-memory changes have been written out. The first such pass is the largest as it encompasses all the modified user data (as opposed to filesystem metadata). Subsequent passes typically have far less data to write as they consist exclusively of filesystem metadata.

To ensure convergence, after a certain number of passes ZFS begins overwriting locations on stable storage that had been allocated earlier in the syncing state (and subsequently freed). ZFS usually allocates new blocks to optimize for large, continuous, writes. For the syncing state to converge however it must complete a pass where no new blocks are allocated since each allocation requires a modification of persistent metadata. Further, to hasten convergence, after a prescribed number of passes, ZFS also defers frees, and stops compressing.

In addition to writing out user data, we must also execute synctasks during the syncing context. A synctask is the mechanism by which some administrative activities work such as creating and destroying snapshots or datasets. Note that when a synctask is initiated it enters the open txg, and ZFS then pushes that txg as quickly as possible to completion of the syncing state in order to reduce the latency of the administrative activity. To complete the syncing state, ZFS writes out a new uberblock, the root of the tree of blocks that comprise all state stored on the ZFS pool. Finally, if there is a quiesced txg waiting, we signal that it can now transition to the syncing state.

What else?

Please let me know if you have suggestions for how to improve the descriptions above. There’s more to be written on the specifics of the implementation, transactions, the DMU, and, well, ZFS in general. One thing that I’d note is that Matt mentioned to me recently that were he starting from scratch, he might eliminate the quiescing state. I didn’t understand fully until I researched the subsystem. Typically transactions take a very brief amount of time to “complete”, time measured by CPU latency as opposed, say, to I/O latency. Had the quiescing phase been merged into the syncing phase, the design would be slightly simpler, and it would eliminate the mostly idle intermediate phase where a bunch of dirty data can sit in memory relatively idle.

Next I’ll write about the ZFS write throttle, it’s various brokenness, and our ideas for how to fix it.

Posted on December 13, 2012 at 6:17 am by ahl · Permalink · 4 Comments
In: ZFS · Tagged with: , , , ,

illumos and ZFS days

Back in October I was pleased to attend — and my employer, Delphix, was pleased to sponsor — illumos day and ZFS day, run concurrently with Oracle Open World. Inspired by the success of dtrace.conf(12) in the Spring, the goal was to assemble developers, practitioners, and users of ZFS and illumos-derived distributions to educate, share information, and discuss the future.

illumos day

The week started with the developer-centric illumos day. While illumos picked up the torch when Oracle re-closed OpenSolaris, each project began with a very different focus. Sun and the OpenSolaris community obsessed with inclusion, and developer adoption — often counterproductively. The illumos community is led by those building products based on the unique technologies in illumos — DTrace, ZFS, Zones, COMSTAR, etc. While all are welcome, it’s those who contribute the most whose voices are most relevant.

I was asked to give a talk about technologies unique to illumos that are unlikely to appear in Oracle Solaris. It was only when I started to prepare the talk that the difference in focuses of illumos and Oracle Solaris fell into sharp focus. In the illumos community, we’ve advanced technologies such as ZFS in ways that would benefit Oracle Solaris greatly, but Oracle has made it clear that open source is anathema for its version of Solaris. For example, at Delphix we’ve recently been fixing bugs, asking ourselves, “how has Oracle never seen this?”.

Yet the differences between illumos and Oracle Solaris are far deeper. In illumos we’re building products that rely on innovation and differentiation in the operating system, and it’s those higher-level products that our various customers use. At Oracle, the priorities are more traditional: support for proprietary SPARC platforms, packaging and updating for administrators, and ease-of-use. In my talk, rather than focusing on the sundry contributions to illumos, I picked a few of my favorites. The slides are more or less incomprehensible on their own; many thanks to Deirdre Straughan for posting the video (and for putting together the event!) — check out 40:30 for a photo of Jean-Luc Picard attending the DTrace talk at OOW.

ZFS day

While illumos day was for developers, ZFS day was for users of ZFS to learn from each others’ experiences, and hear from ZFS developers. I had the ignominous task of presenting an update on the Hybrid Storage Pool (HSP). We developed the HSP at Fishworks as the first enterprise storage system to add flash memory into the storage hierarchy to accelerate reads and writes. Since then, economics and physics have thrown up some obstacles: DRAM has gotten cheaper, and flash memory is getting harder and harder to turn into a viable enterprise solution. In addition, the L2ARC that adds flash as a ZFS read cache, has languished; it has serious problems that no one has been motivated or proficient enough to address.

I’ll warn you that after the explanation of the HSP, it’s mostly doom and gloom (also I was sick as a dog when I prepared and gave the talk), but check out the slides and video for more on the promise and shortcomings of the HSP.

Community

For both illumos day and ZFS day, it was a mostly full house. Reuniting with the folks I already knew was fun as always, but even better was to meet so many who I had no idea were building on illumos or using ZFS. The events highlighted that we need to facilitate more collaboration — especially around ZFS — between the illumos distros, FreeBSD, and Linux — hell, even Oracle Solaris would be welcome.

Posted on November 25, 2012 at 7:03 pm by ahl · Permalink · One Comment
In: illumos, ZFS · Tagged with: , , , , ,

ZFS trivia: metaslabs and growing vdevs

Lately, I’ve been rooting around in the bowels of ZFS as we’ve explored some long-standing performance pathologies. To that end I’ve been fortunate to learn at the feet of Matt Ahrens who was half of the ZFS founding team and George Wilson who has forgotten more about ZFS than most people will ever know. I wanted to start sharing some of the interesting details I’ve unearthed.

For allocation purposes, ZFS carves vdevs (disks) into a number of “metaslabs” — simply smaller, more manageable chunks of the whole. How many metaslabs? Around 200:

void
vdev_metaslab_set_size(vdev_t *vd)
{
        /*
         * Aim for roughly 200 metaslabs per vdev.
         */
        vd->vdev_ms_shift = highbit(vd->vdev_asize / 200);
        vd->vdev_ms_shift = MAX(vd->vdev_ms_shift, SPA_MAXBLOCKSHIFT);
}

http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/vdev.c#1553

Why 200? Well, that just kinda worked and was never revisited. Is it optimal? Almost certainly not. Should there be more or less? Should metaslab size be independent of vdev size? How much better could we do? All completely unknown.

The space in the vdev is allotted proportionally, and contiguously to those metaslabs. But what happens when a vdev is expanded? This can happen when a disk is replaced by a larger disk or if an administrator grows a SAN-based LUN. It turns out that ZFS simply creates more metaslabs — an answer whose simplicity was only obvious in retrospect.

For example, let’s say we start with a 2T disk; then we’ll have 200 metaslabs of 10G each. If we then grow the LUN to 4TB then we’ll have 400 metaslabs. If we started instead from a 200GB LUN that we eventually grew to 4TB we’d end up with 4,000 metaslabs (each 1G). Further, if we started with a 40TB LUN (why not) and grew it by 100G ZFS would not have enough space to allocate a full metaslab and we’d therefore not be able to use that additional space.

At Delphix our metaslabs can become highly fragmented because most of our datasets use a 8K record size (read up on space maps to understand how metaslabs are managed — truly fascinating), and our customers often expand LUNs as a mechanism for adding more space. It’s not clear how much room there is for improvement, but these are curious phenomena that we intend to investigate along with the structure of space maps, the idiosyncrasies of the allocation path, and other aspects of ZFS as we continue to understand and improve performance. Stay tuned.

Posted on November 8, 2012 at 5:24 pm by ahl · Permalink · 8 Comments
In: ZFS · Tagged with: , , , ,

illumos hackathon 2012: user-land types for DTrace

At the illumos hackathon last week, Robert Mustacchi and I prototyped better support for manipulating user-land structures. As anyone who’s used it knows, DTrace is currently very kernel-centric — this both reflects the reality of how operating systems and DTrace are constructed, and the origins of DTrace itself in the Solaris Kernel Group. Discussions at dtrace.conf(12) this spring prompted me to chart a path to better user-land support. This prototype of copyin-automagic was a first step.

What we implemented was a new ‘user’ keyword to denote that a type is a user-land structure. For example, let’s say we had the address of a 4-byte integer; today we’d access its value using copyin():

this->i = *(int *)copyin(this->addr, sizeof (int));

With our prototype, this gets simpler and more intuitive:

this->i = *(user int *)addr;

The impact is even more apparent when it comes to pointer chasing through structures. Today if we need to get to the third element of a linked list, the D code would look like this:

this->p = (node_t *)copyin(this->addr, sizeof (node_t));
this->p = (node_t *)copyin((uintptr_t)this->p->next, sizeof (node_t));
this->p = (node_t *)copyin((uintptr_t)this->p->next, sizeof (node_t));
trace(this->p->value);

Again, it’s much simpler with the user keyword:

this->p = (user node_t *)this->addr;
trace(this->p->next->next->value);

D programs are compiled into a series of instructions — DIF — that are executed by the code DTrace framework when probes fire. We use the new keyword to generate instructions that load from the address space of the currently executing process rather than that of the kernel.

Adding a new keyword feels a little clunky, and I’m not sure if it’s the right way forward, but it does demonstrate a simpler model for accessing user-land structures, and was a critical first step. We already have three main sources of user-land values; the next steps are to make use of those.

Types for system calls

Arguments to system calls (mostly) have well-known types. Indeed those types are encoded in truss in excruciating and exotic detail. We should educate DTrace about those types. What I’d propose is that we create a single repository of all system call metadata. This could be, for example, and XML file that listed every system call, its name, code, subcode, types, etc. Of course we could use that as the source of type information for the syscall provider, but we could also use that to generate everything from the decoding tables in truss to the libc and kernel stubs for system calls.

As an aside, there are a couple of system calls whose parameter types — ioctl(2) is the obvious example. It would be interesting to assess the utility of an ioctl provider whose probes would be the various codes that are passed as the second parameter. Truss already has this information; why not DTrace?

Types for the pid provider

Another obvious source of type information is the process being traced. When a user specifies the -p or -c option to dtrace(1M) we’re examining a particular process, and that process can have embedded type information. We could use those types and implicitly identify them as belonging to user-space rather than the kernel. Pid provider probes correspond to the entry and return from user-land functions; we could identify the appropriate types for those parameters. We could simplify it further by doing all type handling in libdtrace (in user-land) rather than pushing the types into the kernel.

Types for USDT

User-land statically defined tracing — tracepoints explicitly inserted into code — can already have types associated with them. A first step would be to implicitly identify those types as belonging to user-land. I believe that this could even be done without adversely affecting existing scripts.

Thorny issues

While there are some clear paths forward, there are some tricky issues that remain. In particular that processes can have different data models — 32-bit v. 64-bit — presents a real challenge. Both the width of a load and offsets into structures change depending on the process that’s running. There might be some shortcuts for system calls, and we might be able to constrain the problem for the pid provider by requiring -p or -c, or we might have to compile our D program twice and then choose which version to run based on the data model of the process. In the spirit of the hackathon, Robert and I punted for our ‘user’ keyword prototype, but these problems need to be well understood and sufficiently solved.

Next steps

I’ll be working on some of these problems on the back burner; I’m especially interested in the Grand Unified Syscall Project — an idea I’ve been touting for more years than I care to relate — to bring types to the syscall provider. If you have ideas for user-land tracing with DTrace, or want to work on anything I’ve mention, leave a comment or drop me a note.

Posted on October 11, 2012 at 4:28 pm by ahl · Permalink · 3 Comments
In: DTrace · Tagged with: , , , ,

Webex utilities

I wish that none of our customers encountered problems with our product, but they do, and when they do our means for remotely accessing their systems is often via a Webex shared screen. We remotely control their Delphix server to collect data (often using DTrace). While investigating a customer issue recently I developed a couple of techniques to work around common problems; I thought I’d share them in case others have similar problems — and as a note to my future self who will certainly forget the specifics next time.

Copying and Pasting

Webex makes it fairly easy to copy text from the remote system and paste it locally: just select the text, and that implicitly copies it to the clipboard. I do this very very often as I write DTrace scripts to collect data, and then want to record both the script and the output. To that end, the Mac OS X pbpaste(1) utility is unbelievably helpful; pbpaste emits the contents of the clipboard. For example, I’ll select text in the webex and use pbpaste like this:

$ pbpaste | tee -a data.log

Doing that, I can both verify that I selected the right data, and append it to the log of all data collected. Sometimes, though, the remote data is annoying to copy because I need to scroll up — the mouse latency over webex can make this an exasperating experience. In those cases where the text I want to transfer is longer than a page, I do the following on the remote system:

$ cat output | gzip -9c | uuencode /dev/stdin
begin 644 /dev/stdin
M'XL(`..C4E`"`]5:W7_;-A!_#Y#_@>@P),&0A,<O5=X2=&LWH`_M]K`^%9TK
M2THBU+8\24[3C^UO'TG%L2D1,B6[0ZJG0+[[Z7CWN^,=PRQ-BZ?+?#:-%G<P
...

I then select the text, and back on my mac do this to dump out the data:

$ pbpaste | uudecode -o /dev/stdout | gzip -cd

By compressing and uuencoding the data, even large chunks of output easily fit on one screen. Here are the results on a large-ish chunk of data I copied from a customer system:

$ cat customer.data.txt | wc -l
 234
$ cat customer.data.txt | gzip -9c | uuencode /dev/stdin | wc -l
 44

234 lines would have had me tearing my hair out as I tried to capture the output, scrolling backward with 250ms screen refresh latency; 44 lines wasn't bad at all. Depending on the exact text I seem to get an 80-90% reduction in lines to copy. Many thanks to Brendan Gregg who had mentioned this technique to me; I hadn't appreciated it fully until I absolutely needed it.

Screen Savers v. Thinking/Lunch

When diagnosing a problem on a customer system, we like to be as unobtrusive as possible, so it's annoying when we need to disturb the customer to enter his or her password because the screen lock has kicked in while I'm thinking about the next step in the investigation, or I'm getting something to eat. Many enterprise environments make it such that the screen saver delay can't be changed. I spent a day a couple of weeks ago bringing my laptop to meetings, and running to get lunch (and elsewhere) so that I could move the mouse at least every 15 minutes.

I didn't want to modify the customer system ("I let you remotely access my computer, and you're installing what?!"). Instead I wanted to programmatically move the mouse every so often on my system to ensure the remote system wouldn't lock the screen. I couldn't find anything pre-fab, but thanks to the tips at stackoverflow, I pieced something together that wiggles the cursor around if it hasn't moved in a little while. I could post it compressed and uuencoded in keeping with the theme above (it's just 17 lines!), but instead I've added a github repo: github.com/adamleventhal/wiggle.

Happy Webex-ing

I hope people find these tips useful. Given my penchant for looking up past tips on my own blog, I'm sure at least my future self will be thanking me at some point...

Posted on September 14, 2012 at 7:07 pm by ahl · Permalink · Comments Closed
In: Delphix · Tagged with: ,

My New DTrace Favorite

The mantra as we initially developed DTrace was to make impossible things possible, not easy things easier. Since codifying that, the tendency toward the latter in developer tools has been apparent. Our focus on the former however has left certain usability burrs that stymie newbies, and annoy vets. Much of the DTrace development of late has focused on a middle category: simplifying hard things that should be simple.

The print() action

In that vein, my colleague, Eric Schrock, added the print() action to DTrace back in November. Before then, my workflow used to look like this:

fbt::xdr_bytes:entry
{
        trace(args[0]->x_base);
        trace(args[0]->x_handy);
}

Repeat times a thousand, allow for errors, iterate on chased pointers, and sum up the time. With Eric’s fix, DTrace is a lot easier to use:

fbt::xdr_bytes:entry
{
        print(args[0]);
}

print() for translated types

Of course, in addition to tracing any kernel function, DTrace has stable probes that identify points of well-known, (reasonably) well-documented activity. Those probes don’t correspond to kernel functions so mdb isn’t as useful. The workflow is a little more annoying:

io:::start
{
        trace(args[1]->dev_name);
        trace(args[1]->dev_pathname);
}

Repeat another thousand, much more annoying times.

Unfortunately, print() wasn’t as helpful in this case:

# dtrace -n 'io:::start{ print(*args[1]); }'
dtrace: invalid probe specifier io:::start{ trace(*args[1]); }: print( ) may not be applied to a dynamic expression

Stable probes such as the io:::start probe can use translated arguments, synthetic types that DTrace populates with stable data from the unstable underlying implementation. For example, despite very different implementations, the io:::start provider exposes the same data on illumos, FreeBSD, Mac OS X, and Oracle Solaris. Parameters are effectively translated one at a time; the * (dereference) operator was invalid for these expressions.

In a recent push to illumos, I added this support:

# dtrace -n 'io:::start{ print(*args[1]); }'
dtrace: description 'io:::start' matched 6 probes
CPU ID FUNCTION:NAME
0 11307 bdev_strategy:start devinfo_t {
    int dev_major = 0x62
    int dev_minor = 0x40
    int dev_instance = 0x1
    string dev_name = [ "sd" ]
    string dev_statname = [ "sd1" ]
    string dev_pathname = [ "/devices/pci@0,0/pci15ad,1976@10/sd@0,0:a" ]
}

Between Eric’s addition and my own, my most commonly encountered DTrace annoyances are no more.

Behind the scenes

For the DTrace super-nerds out there, I thought I’d share a bit of the implementation. In order to trace() or print() an expression, it needs to exist in memory somewhere. Translated types don’t exist in memory, rather individual members are translated statically. We can see this in the output of the DTrace DIF (D intermediate form) disassembler:

# dtrace -n 'io:::start{ trace(args[1]->dev_name); }' -Se
DIFO 0x75e940 returns string (unknown) by ref (size 256)
OFF OPCODE INSTRUCTION
00: 25000001 setx DT_INTEGER[0], %r1 ! 0x0
01: 28000101 ldga DT_VAR(0), %r1, %r1
02: 0e010002 mov %r1, %r2
03: 25000103 setx DT_INTEGER[1], %r3 ! 0xe0
04: 07020302 add %r2, %r3, %r2
05: 22020002 ldx [%r2], %r2
06: 25000003 setx DT_INTEGER[0], %r3 ! 0x0
07: 0f020300 cmp %r2, %r3
08: 1200000b be 11
09: 0e000002 mov %r0, %r2
10: 1100000c ba 12
11: 25000202 setx DT_INTEGER[2], %r2 ! 0x1
12: 10020000 tst %r2
13: 12000011 be 17
14: 26000102 sets DT_STRING[1], %r2 ! "nfs"
15: 0e020002 mov %r2, %r2
16: 1100001e ba 30
17: 25000302 setx DT_INTEGER[3], %r2 ! 0xfffffffffc031110
18: 22020002 ldx [%r2], %r2
19: 0e010003 mov %r1, %r3
20: 25000404 setx DT_INTEGER[4], %r4 ! 0xa8
21: 07030403 add %r3, %r4, %r3
22: 22030003 ldx [%r3], %r3
23: 33000000 flushts
24: 31000003 pushtv DT_TYPE(0), %r3 ! DT_TYPE(0) = D type
25: 2f001403 call DIF_SUBR(20), %r3 ! getmajor
26: 25000504 setx DT_INTEGER[5], %r4 ! 0x70
27: 08030403 mul %r3, %r4, %r3
28: 07020302 add %r2, %r3, %r2
29: 22020002 ldx [%r2], %r2
30: 23000002 ret %r2

In this case, this logic comes from /usr/lib/io.d, and — in particular — this translation:

        dev_name = B->b_dip == NULL ? "nfs" :
            stringof(`devnamesp[getmajor(B->b_edev)].dn_name);

To implement allow trace() and print() to work on translated types, we now generate code to first use the DTrace build-in alloca() function to get some scratch space, and then generate the translation for each member of the translated type. For example:

# dtrace -n 'io:::start{ print(*args[1]); }' -Se
DIFO 0x9466b0 returns D type (struct) by ref (size 780)
OFF OPCODE INSTRUCTION
00: 25000001 setx DT_INTEGER[0], %r1 ! 0x0
01: 28000101 ldga DT_VAR(0), %r1, %r1
02: 25000102 setx DT_INTEGER[1], %r2 ! 0x30c
03: 33000000 flushts
04: 31000002 pushtv DT_TYPE(0), %r2 ! DT_TYPE(0) = D type
05: 2f000f02 call DIF_SUBR(15), %r2 ! alloca
06: 0e010003 mov %r1, %r3
07: 25000204 setx DT_INTEGER[2], %r4 ! 0xe0
08: 07030403 add %r3, %r4, %r3
09: 22030003 ldx [%r3], %r3
10: 25000004 setx DT_INTEGER[0], %r4 ! 0x0
11: 0f030400 cmp %r3, %r4
12: 1300000f bne 15
13: 0e000003 mov %r0, %r3
14: 11000010 ba 16
15: 25000303 setx DT_INTEGER[3], %r3 ! 0x1
...
316: 2f001603 call DIF_SUBR(22), %r3 ! ddi_pathname
317: 25001204 setx DT_INTEGER[18], %r4 ! 0x20c
318: 07020404 add %r2, %r4, %r4
319: 25000e05 setx DT_INTEGER[14], %r5 ! 0x100
320: 3b030504 copys %r3, %r5, %r4
321: 23000002 ret %r2

More to come

Usability was a big topic at dtrace.conf a few months ago. Expect to see more contributions along this theme.

Posted on July 28, 2012 at 1:09 am by ahl · Permalink · 4 Comments
In: DTrace

BTrace: DTrace for Java… ish

DTrace first peered into Java in early 2005 thanks to an early prototype by Jarod Jenson that led eventually to the inclusion of USDT probes in the HotSpot JVM. If you want to see where, say, the java.net.SocketOutputStream.write() method is called, you can simply run this DTrace script:

hotspot$target:::method-entry
/copyinstr(arg1, arg2) == "java/net/SocketOutputStream" &&
 copyinstr(arg3, arg4) == "write"/
{
        jstack(50, 8000);
}

And that will work as long as you rememember to start your JVM with the -XX:+ExtendedDTraceProbes option or you use the jinfo utility to enable it after the fact. And as long as you don’t mind a crippling performance penalty (hint: you probably do).

Inspired by dtrace.conf a few weeks ago, I wanted to sketch out what the real Java provider would look like:

java$target:java.net.SocketOutputStream:write:entry
{
        jstack(50,8000);
}

And check it out:

# jdtrace.pl -p $(pgrep java) -n 'java$target:java.net.SocketOutputStream::entry{ jstack(50,8000); }'
dtrace: script '/tmp/jdtrace.19092/jdtrace.d' matched 0 probes
CPU     ID                    FUNCTION:NAME
0  64991 Java_com_sun_btrace_BTraceRuntime_dtraceProbe0:event
libbtrace.so`Java_com_sun_btrace_BTraceRuntime_dtraceProbe0+0xbb
com/sun/btrace/BTraceRuntime.dtraceProbe0(Ljava/lang/String;Ljava/lang/String;II)I
com/sun/btrace/BTraceRuntime.dtraceProbe(Ljava/lang/String;Ljava/lang/String;II)I
com/sun/btrace/BTraceUtils$D.probe(Ljava/lang/String;Ljava/lang/String;II)I
com/sun/btrace/BTraceUtils$D.probe(Ljava/lang/String;Ljava/lang/String;)I
java/net/SocketOutputStream.$btrace$jdtrace$probe1(Ljava/lang/String;Ljava/lang/String;)V
java/net/SocketOutputStream.write([BII)V
sun/nio/cs/StreamEncoder.writeBytes()V
sun/nio/cs/StreamEncoder.implFlushBuffer()V
sun/nio/cs/StreamEncoder.implFlush()V
sun/nio/cs/StreamEncoder.flush()V
java/io/OutputStreamWriter.flush()V
java/io/BufferedWriter.flush()V
java/io/PrintWriter.newLine()V
java/io/PrintWriter.println()V
java/io/PrintWriter.println(Ljava/lang/String;)V
com/delphix/appliance/server/ham/impl/HAMonitorServerThread.run()V
java/util/concurrent/ThreadPoolExecutor$Worker.runTask(Ljava/lang/Runnable;)V
java/util/concurrent/ThreadPoolExecutor$Worker.run()V
java/lang/Thread.run()V
StubRoutines (1)
libjvm.so`__1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v_+0x21d
libjvm.so`__1cCosUos_exception_wrapper6FpFpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v2468_v_+0x27
libjvm.so`__1cJJavaCallsMcall_virtual6FpnJJavaValue_nGHandle_nLKlassHandle_nMsymbolHandle_5pnGThread__v_+0x149
libjvm.so`__1cMthread_entry6FpnKJavaThread_pnGThread__v_+0x113
libjvm.so`__1cKJavaThreadDrun6M_v_+0x2c6
libjvm.so`java_start+0x1f2
libc.so.1`_thrp_setup+0x9b
libc.so.1`_lwp_start

Obviously there's something fishy going on. First, we're using perl -- the shibboleth of fake-o-ware -- and there's this BTrace stuff in the output.

Faking it with BTrace

BTrace is a dynamic instrumentation tool for Java; it is both inspired by DTrace and contains some DTrace integration. The perl script above takes the DTrace syntax and generates a DTrace script and a BTrace-enabled Java source file.

Like DTrace, BTrace lets you specify the points of instrumentation in your Java program as well as the actions to take. Here's what our generated source file looks like.

import com.sun.btrace.annotations.*;
import static com.sun.btrace.BTraceUtils.*;
@BTrace
public class jdtrace {
        @OnMethod(clazz="java.net.SocketOutputStream", method="write", location=@Location(Kind.ENTRY))
        public static void probe1(@ProbeClassName String c,
            @ProbeMethodName String m) {
                String name = "entry";
                String p = Strings.strcat(c, Strings.strcat(":",
                    Strings.strcat(m, Strings.strcat(":", name))));
                D.probe(p, "");
        }
}

Note that we specify where to trace (this can be a regular expression), and then take the action of joining the class, method, and "entry" string into a single string that we pass to the D.probe() method that causes a BTrace USDT probe to fire.

Here's what the D script looks like:

btrace$target:::event
{
        this->__jd_arg = copyinstr(arg0);
        this->__jd_mod = strtok(this->__jd_arg, ":");
        this->__jd_func = strtok(NULL, ":");
        this->__jd_name = strtok(NULL, ":");
}

btrace$target:::event
/((this->__jd_mod == "java.net.SocketOutputStream" &&
 this->__jd_func == "write" &&
 this->__jd_name == "entry"))/
{
        jstack(50,8000);
}

It's pretty simple. We parse the string that was passed to D.probe(), and disassemble it into the DTrace notion of module, function, and name. We then use that information so that the specified actions are executed as appropriate (we could have specified different Java methods to probe, and different actions to take for each). Here's the code if you're interested.

This isn't the real Java provider, but is it close enough? Unfortunately not. The most glaring problem is that BTrace sometimes renders my Java process unresponsive. Other times it leaves instrumentation behind with no way of extracting it. The word "safe" appears as the third word on the BTrace website ("BTrace is safe"), but apparently there's still some way to go to achieve the requisite level of safety.

A Better BTrace

BTrace is an interesting tool for examining Java programs, but one obvious obstacle is that the programs are pretty cumbersome to write. With BTrace, we should be able to write a simple one-liner to see where we are when the java.net.SocketOutputStream.write() method is called, but instead we have to write a fairly cumbersome program:

import com.sun.btrace.annotations.*;
import static com.sun.btrace.BTraceUtils.*;
@BTrace
public class TraceWrite {
        @OnMethod(clazz="java.net.SocketOutputStream", method="write", location=@Location(Kind.ENTRY))
        public static void onWrite() {
                jstack();
        }
}

DTrace-inspired syntax would let users iterate much more quickly:

$ dbtrace -p $(pgrep -n java) -n 'java.net.SocketOutputStream:write:entry{ jstack(); }'
java.net.SocketOutputStream.write(SocketOutputStream.java)
sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)
sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272)
sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:276)
sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:122)
java.io.OutputStreamWriter.flush(OutputStreamWriter.java:212)
java.io.BufferedWriter.flush(BufferedWriter.java:236)
java.io.PrintWriter.newLine(PrintWriter.java:438)
java.io.PrintWriter.println(PrintWriter.java:585)
java.io.PrintWriter.println(PrintWriter.java:696)
com.delphix.appliance.server.ham.impl.HAMonitorServerThread.run(HAMonitorServerThread.java:56)
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:662)

With BTrace, you can trace nearly arbitrary information about a program's state, but instead of doing something like this:

dbtrace -p $(pgrep -n java) -n 'java.net.SocketOutputStream:write:entry{ printFields(this.impl); }'

You have to do this:

import com.sun.btrace.annotations.*;
import com.sun.btrace.AnyType;
import static com.sun.btrace.BTraceUtils.Reflective.*;
@BTrace
public class TraceWrite {
        @OnMethod(clazz="java.net.SocketOutputStream", method="write", location=@Location(Kind.ENTRY))
        public static void onWrite(@Self Object self) {
                Object impl = get(field(classOf(self), "impl"), self);
                printFields(impl);
        }
}
$ ./bin/btrace $(pgrep -n java) TraceWrite.java
{server=null, port=1080, external_address=null, useV4=false, cmdsock=null, cmdIn=null, cmdOut=null, applicationSetProxy=false, timeout=0, trafficClass=0, shut_rd=false, shut_wr=false, socketInputStream=java.net.SocketInputStream@9993a1, fdUseCount=0, fdLock=java.lang.Object@ab5443, closePending=false, CONNECTION_NOT_RESET=0, CONNECTION_RESET_PENDING=1, CONNECTION_RESET=2, resetState=0, resetLock=java.lang.Object@292936, fd1=null, anyLocalBoundAddr=null, lastfd=-1, stream=false, socket=Socket[addr=/127.0.0.1,port=38832,localport=8765], serverSocket=null, fd=java.io.FileDescriptor@50abcc, address=/127.0.0.1, port=38832, localport=8765, }

BTrace needs a language that enables rapid iteration — piggybacking on Java is holding it back — and it needs some hard safety guarantees. With those, many developers and support engineers would use BTrace as part of their daily work — we certainly would here at Delphix.

Back to DTrace. Even with a useable solution for Java only, the ability to have lightweight and focused tracing for Java (and other dynamic languages) could be highly valuable. We’ll see how far BTrace can take us.

Posted on April 24, 2012 at 7:29 am by ahl · Permalink · 4 Comments
In: DTrace · Tagged with: , ,