Adam Leventhal's blog

Search
Close this search box.

Month: April 2008

It was a good run, but Jarod and I didn’t make the cut for JavaOne this year…

2005

In 2005, Jarod came up with what he described as a jacked up way to use DTrace to get inside Java. This became the basis of the Java provider (first dvm for the 1.4.2 and 1.5 JVMs and now the hotspot provider for Java 6). That year, I got to stand up on stage at the keynote with John Loiacono and present DTrace for Java for the first time (to 10,000 people — I was nervous). John was then the EVP of software at Sun. Shortly after that, he parlayed our keynote success into a sweet gig at Adobe (I was considered for the job, but ultimately rejected, they said, because their door frames couldn’t accommodate my fro — legal action is pending).

That year we also started the DTrace challenge. The premise was that if we chained up Jarod in the exhibition hall, developers could bring him their applications and he could use DTrace to find a performance win — or he’d fork over a free iPod. In three years Jarod has given out one iPod and that one deserves a Bondsian asterisk.

After the excitement of the keynote, and the frenetic pace of the exhibition hall (and a haircut), Jarod and I anticipated at least fair interest in our talk, but we expected the numbers to be down a bit because we were presenting in the afternoon on the last day of the conference. We got to the room 15 minutes early to set up, skirting what we thought must have been the line for lunch, or free beer, or something, but turned out to be the line for our talk. Damn. It turns out that in addition to the 1,000 in the room, there was an overflow room with another 500-1,000 people. That first DTrace for Java talk had only the most basic features like tracing method entry and return, memory allocation, and Java stack backtraces — but we already knew we were off to a good start.

2006

No keynote, but the DTrace challenge was on again and our talk reprised its primo slot on the last day of the conference after lunch (yes, that’s sarcasm). That year the Java group took the step of including DTrace support in the JVM itself. It was also possible to dynamically turn instrumentation of the JVM off and on as opposed to the start-time option of the year before. In addition to our talk, there was a DTrace hands-on lab that was quite popular and got people some DTrace experience after watching what it can do in the hands of someone like Jarod.

2007

The DTrace talk in 2007 (again, last day of the conference after lunch) was actually one of my favorite demos I’ve given because I had never seen the technology we were presenting before. Shortly before JavaOne started, Lev Serebryakov from the Java group had built a way of embedding static probes in a Java program. While this isn’t required to trace Java code, it does mean that developers can expose the higher level semantics of their programs to users and developers through DTrace. Jarod hacked up an example in his hotel room about 20 minutes before we presented, and amazingly it all went off without a hitch. How money is that?

JSDT — as the Java Statically Defined Tracing is called — is in development for the next version of the JVM, and is the next step for DTrace support of dynamic languages. Java was the first dynamic language that we first considered for use with DTrace, and it’s quite a tough environment to support due to the incredible sophistication of the JVM. That support has lead the way for other dynamic languages such as Ruby, Perl, and Python which all now have built-in DTrace providers.

2008

For DTrace and Java, this is not the end. It is not even the beginning of the end. Jarod and I are out, but Jon, Simon, Angelo, Raghavan, Amit, and others are in. At JavaOne 2008 next month there will be a talk, a BOF, and a hands-on lab about DTrace for Java and it’s not even all Java: there’s some php and JavaScript mixed in and both also have their own DTrace providers. I’ve enjoyed speaking at JavaOne these past three years, and while it’s good to pass the torch, I’ll miss doing it again this year. If I have the time, and can get past security I’ll try to sneak into Jon and Simon’s talk — though it will be a departure from tradition for a DTrace talk to fall on a day other than the last.

I was having a conversation with an OpenBSD user and developer the other day, and he mentioned some ongoing work in the community to consolidate support for RAID controllers. The problem, he was saying, was that each controller had a different administrative model and utility — but all I could think was that the real problem was the presence of a RAID controller in the first place! As far as I’m concerned, ZFS and RAID-Z have obviated the need for hardware RAID controllers.

ZFS users seem to love RAID-Z, but a frustratingly frequent request is to be able to expand the width of a RAID-Z stripe. While the ZFS community may care about solving this problem, it’s not the highest priority for Sun’s customers and, therefore, for the ZFS team. It’s common for a home user to want to increase his total storage capacity by a disk or two at a time, but enterprise customers typically want to grow by multiple terabytes at once so adding on a new RAID-Z stripe isn’t an issue. When the request has come up on the ZFS discussion list, we have, perhaps unhelpfully, pointed out that the code is all open source and ready for that contribution. Partly, it’s because we don’t have time to do it ourselves, but also because it’s a tricky problem and we weren’t sure how to solve it.

Jeff Bonwick did a great job explaining how RAID-Z works, so I won’t go into it too much here, but the structure of RAID-Z makes it a bit trickier to expand than other RAID implementations. On a typical RAID with N+M disks, N data sectors will be written with M parity sectors. Those N data sectors may contain unrelated data so adding modifying data on just one disk involves reading the data off that disk and updating both those data and the parity data. Expanding a RAID stripe in such a scheme is as simple as adding a new disk and updating the parity (if necessary). With RAID-Z, blocks are never rewritten in place, and there may be multiple logical RAID stripes (and multiple parity sectors) in a given row; we therefore can’t expand the stripe nearly as easily.

A couple of weeks ago, I had lunch with Matt Ahrens to come up with a mechanism for expanding RAID-Z stripes — we were both tired of having to deflect reasonable requests from users — and, lo and behold, we figured out a viable technique that shouldn’t be very tricky to implement. While Sun still has no plans to allocate resources to the problem, this roadmap should lend credence to the suggestion that someone in the community might work on the problem.

The rest of this post will discuss the implementation of expandable RAID-Z; it’s not intended for casual users of ZFS, and there are no alchemic secrets buried in the details. It would probably be useful to familiarize yourself with the basic structure of ZFS, space maps (totally cool by the way), and the code for RAID-Z.

Dynamic Geometry

ZFS uses vdevs — virtual devices — to store data. A vdev may correspond to a disk or a file, or it may be an aggregate such as a mirror or RAID-Z. Currently the RAID-Z vdev determines the stripe width from the number of child vdevs. To allow for RAID-Z expansion, the geometry would need to be a more dynamic property. The storage pool code that uses the vdev would need to determine the geometry for the current block and then pass that as a parameter to the various vdev functions.

There are two ways to record the geometry. The simplest is to use the GRID bits (an 8 bit field) in the DVA (Device Virtual Address) which have already been set aside, but are currently unused. In this case, the vdev would need to have a new callback to set the contents of the GRID bits, and then a parameter to several of its other functions to pass in the GRID bits to indicate the geometry of the vdev when the block was written. An alternative approach suggested by Jeff and Bill Moore is something they call time-dependent geometry. The basic idea is that we store a record each time the geometry of a vdev is modified and then use the creation time for a block to infer the geometry to pass to the vdev. This has the advantage of conserving precious bits in the fixed-width DVA (though at 128 bits its still quite big), but it is a bit more complex since it would require essentially new metadata hanging off each RAID-Z vdev.

Metaslab Folding

When the user requests a RAID-Z vdev be expanded (via an existing or new zpool(1M) command-line option) we’ll apply a new fold operation to the space map for each metaslab. This transformation will take into account the space we’re about to add with the new devices. Each range [a, b] under a fold from width n to width m will become

[ m * (a / n) + (a % n), m * (b / n) + b % n ]

The alternative would have been to account for m – n free blocks at the end of every stripe, but that would have been overly onerous both in terms of processing and in terms of bookkeeping. For space maps that are resident, we can simply perform the operation on the AVL tree by iterating over each node and applying the necessary transformation. For space maps which aren’t in core, we can do something rather clever: by taking advantage of the log structure, we can simply append a new type of space map entry that indicates that this operation should be applied. Today we have allocated, free, and debug; this would add fold as an additional operation. We’d apply that fold operation to each of the 200 or so space maps for the given vdev. Alternatively, using the idea of time-dependent geometry above, we could simply append a marker to the space map and access the geometry from that repository.

Normally, we only rewrite the space map if the on-disk, log-structure is twice as large as necessary. I’d argue that the fold operation should always trigger a rewrite since processing it always requires a O(n) operation, but that’s really an ancillary point.

vdev Update

At the same time as the previous operation, the vdev metadata will need to be updated to reflect the additional device. This is mostly just bookkeeping, and a matter of chasing down the relevant code paths to modify and augment.

Scrub

With the steps above, we’re actually done for some definition since new data will spread be written in stripes that include the newly added device. The problem is that extant data will still be stored in the old geometry and most of the capacity of the new device will be inaccessible. The solution to this is to scrub the data reading off every block and rewriting it to a new location. Currently this isn’t possible on ZFS, but Matt and Mark Maybee have been working on something they call block pointer rewrite which is needed to solve a variety of other problems and nicely completes this solution as well.

That’s It

After Matt and I had finished thinking this through, I think we were both pleased by the relative simplicity of the solution. That’s not to say that implementing it is going to be easy — there’s still plenty of gaps to fill in — but the basic algorithm is sound. A nice property that falls out is that in addition to changing the number of data disks, it would also be possible to use the same mechanism to add an additional parity disk to go from singl
e- to double-parity RAID-Z — another common request.

So I can now extend a slightly more welcoming invitation to the ZFS community to engage on this problem and contribute in a very concrete way. I’ve posted some diffs which I used sketch out some ideas; that might be a useful place to start. If anyone would like to create a project on OpenSolaris.org to host any ongoing work, I’d be happy to help set that up.

Recent Posts

January 22, 2024
January 13, 2024
December 29, 2023
February 12, 2017
December 18, 2016
August 9, 2016

Archives

Archives