Oracle’s port: this is not DTrace

After writing about Oracle’s port of DTrace to OEL, I wanted to take it for a spin. Following the directions that Wim Coekaerts spelled out, I installed and configured a VM to run OEL with Oracle’s nascent DTrace port. Setting up the system was relatively painless.

Here’s my first DTrace invocation on OEL:

[root@screven ~]# uname -a
Linux screven 2.6.32-201.0.4.el6uek.x86_64 #1 SMP Tue Oct 4 16:47:00 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
[root@screven ~]# dtrace -n 'BEGIN{ trace("howdy from linux"); }'
dtrace: description 'BEGIN' matched 1 probe
CPU     ID                    FUNCTION:NAME
0      1                           :BEGIN   howdy from linux
^C

Then I wanted to see what was on the system:

[root@screven ~]# dtrace -l | wc -l
574
Are you kidding me? For comparison, my Mac has 154,918 probe available and our illumos-derived Delphix OS has 77,320 (Mac OS X has many probes pre-created for each process). It looks like this beta only has the syscall provider, but digging around I can see that Wim didn’t mention that the profile provider is also there:
[root@screven ~]# modprobe profile
[root@screven ~]# dtrace -l | wc -l
587
Sweet.
[root@screven ~]# dtrace -n profile:::profile-997
dtrace: failed to enable 'profile:::profile-997': Failed to enable probe
Not that sweet.
At least I can run my favorite DTrace script:
[root@screven ~]# dtrace -n syscall:::entry'{ @[execname] = count(); }'
dtrace: description 'syscall:::entry' matched 285 probes
^C
pickup                                                            9
abrtd                                                            11
qmgr                                                             17
rsyslogd                                                         25
rs:main Q:Reg                                                    35
master                                                           52
tty                                                              60
dircolors                                                        80
hostname                                                         92
tput                                                             92
id                                                              198
unix_chkpwd                                                     550
auditd                                                          599
dtrace                                                          760
bash                                                           1515
sshd                                                           8327
I wanted to trace activity when I connected to the system using ssh… but ssh logins fail with all probes enabled. To repeat: ssh logins fail with DTrace probes enabled. I’d try to debug it, but I’m too dejected.

Evaluation

While I’d like to give this obviously nascent port the benefit of the doubt, its current state is frankly embarrassing. It’s very clear now why Oracle wasn’t demonstrating this at OpenWorld last week: it doesn’t stand up to the mildest level of scrutiny. It’s fine that Oracle has embarked on a port of DTrace to the so-called unbreakable kernel, but this is months away from being usable. Announcing a product of this low quality and value calls into question Oracle’s credibility as a technology provider. Further, this was entirely avoidable; there were other DTrace ports to Linux that Oracle could have used as a starting point to produce something much closer to functional.

This is not DTrace

So, OEL users, know that this is not DTrace. This is no better than one of the DTrace knockoffs and in many ways much worse. What Oracle released is worse than worthless by violating perhaps the most fundamental tenet of DTrace: don’t damage the system. And, to the OEL folks, I’m sure you’ll get there, but how about you take down your beta until it’s ready? As it is, people might get the wrong impression about what DTrace is.
Posted on October 10, 2011 at 10:43 pm by ahl · Permalink · 13 Comments
In: DTrace · Tagged with: , ,

DTrace for Linux

Yesterday (October 4, 2011) Oracle made the surprising announcement that they would be porting some key Solaris features, DTrace and Zones, to Oracle Enterprise Linux. As one of the original authors, the news about DTrace was particularly interesting to me, so I started digging.

I should note that this isn’t the first time I’ve written about DTrace for Linux. Back in 2005, I worked on Linux-branded Zones, Solaris containers that contained a Linux user environment. I wrote a coyly-titled blog post about examining Linux applications using DTrace. The subject was honest — we used precisely the same techniques to bring the benefits of DTrace to Linux applications — but the title wasn’t completely accurate. That wasn’t exactly “DTrace for Linux”, it was more precisely “The Linux user-land for Solaris where users can reap the benefits of DTrace”; I chose the snappier title.

I also wrote about DTrace knockoffs in 2007 to examine the Linux counter-effort. While the project is still in development, it hasn’t achieved the functionality or traction of DTrace. Suggesting that Linux was inferior brought out the usual NIH reactions which led me to write a subsequent blog post about a theoretical port of DTrace to Linux. While a year later Paul Fox started exactly such a port, my assumption at the time was that the primary copyright holder of DTrace wouldn’t be the one porting DTrace to Linux. Now that Oracle is claiming a port, the calculus may change a bit.

What is Oracle doing?

Even among Oracle employees, there’s uncertainty about what was announced. Ed Screven gave us just a couple of bullet points in his keynote; Sergio Leunissen, the product manager for OEL, didn’t have further details in his OpenWorld talk beyond it being a beta of limited functionality; and the entire Solaris team seemed completely taken by surprise.

What is in the port?

Leunissen stated that only the kernel components of DTrace are part of the port. It’s unclear whether that means just fbt or includes sdt and the related providers. It sounds certain, though, that it won’t pass the DTrace test suite which is the deciding criterion between a DTrace port and some sort of work in progress.

What is the license?

While I abhor GPL v. CDDL discussions, this is a pretty interesting case. According to the release manager for OEL, some small kernel components and header files will be dual-licensed while the bulk of DTrace — the kernel modules, libraries, and commands — will use the CDDL as they had under (the now defunct) OpenSolaris (and to the consernation of Linux die-hards I’m sure). Oracle already faces an interesting conundum with their CDDL-licensed files: they can’t take the fixes that others have made to, for example, ZFS without needing to release their own fixes. The DTrace port to Linux is interesting in that Oracle apparently thinks that the CDDL license will make DTrace too toxic for other Linux vendors to touch.

Conclusion

Regardless of how Oracle brings DTrace to Linux, it will be good for DTrace and good for its users — and perhaps best of all for the author of the DTrace book. I’m cautiously optimistic about what this means for the DTrace development community if Oracle does, in fact, release DTrace under the CDDL. While this won’t mean much for the broader Linux community, we in the illumos community will happily accept anything of value Oracle adds. The Solaris lover in me was worried when it appeared that OEL was raiding the Solaris pantry, but if this is Oracle’s model for porting, then I — and the entire illumos community I’m sure — hope that more and more of Solaris is open sourced under the aegis of OEL differentiation.

10/10/2011 follow-up post, Oracle’s port: this is not DTrace.

Posted on October 5, 2011 at 9:51 pm by ahl · Permalink · 32 Comments
In: DTrace · Tagged with: , , , , , , ,

On RAID-6 recovery

RAID algorithms have become a particular fascination of mine, and I recently read a very interesting paper that describes an optimization for RAID reconstruction (by Xiang, Xu, Lui, Chang, Pan, and Li). Before writing double- and triple-parity RAID algorithms for ZFS, I spent a fair bit of time researching the subject and have stayed interested since. Before describing the reconstruction optimization, some preamble is required. RAID algorithms can be divided into two buckets: one-dimensional algorithms, and multi-dimensional algorithms (terms of my own choosing; I haven’t seen this distinction discussed in literature).

One-dimensional RAID

A one-dimensional algorithm is one in which all data in a single RAID stripe is used to compute all parities. The RAID algorithm used by ZFS falls into this category as do most algorithms derived from Reed-Solomon coding. For a given RAID stripe’s set of data blocks, D, we can compute the nth parity block with some function p(D, n). For example, ZFS, roughly, uses the formula

    \[ p(D,n) = \sum_{i=1}^{\left|{D}\right|} 2^{(i-1)(n-1)} \cdot D_{i} \]

Here, addition and multiplication are defined over a Galois Field – the explanation would be far longer than it would be interesting or relevant so I’ll omit it from this post. It is worth noting that this particular approach only works for three parity disks or fewer, but that too is an entirely different subject (albeit an interesting one). Reconstruction of a missing block in a one-dimensional algorithm requires reading the available data, and performing some computation; each stripe may be reconstructed separately (and thus, in parallel).

Multi-dimensional RAID

A multi-dimensional algorithm is one in which parts of multiple logical RAID stripes may contribute to parity calculation. Examples of this include IBM’s EVENODD and NetApp’s slight simplification, Row-Diagonal Parity (RDP). These are most easily conveyed through diagrams:

EVEN-ODD

EVENODD

RDP

RDP

With both EVENODD and RDP, calculation of the first parity block simply XOR the data blocks in that RAID stripe. The second parity block is calculated by a simple XOR of data values across RAID stripes more or less diagonally. Both of these techniques place constraints on the width of a RAID stripe.

Optimizing RAID reconstruction for fewer reads

The paper, A Hybrid Approach of Failed Disk Recovery Using RAID-6 Codes: Algorithms and Performance Evaluation, describes a optimization for reconstruction under multi-dimensional RAID algorithms. The key insight is that with parity calculations that effectively overlap, a clever reconstruction algorithm can use fewer blocks, thus incurring fewer disk reads. As described in the paper, normally when a given disk fails, all remaining data blocks and blocks from the first parity disk are used for reconstruction:

simple recovery

simple recovery

It is, however, possible to read fewer total blocks by taking advantage of the fact that certain blocks can be multiply used. In the reconstruction below, blocks with circles are used for “row” reconstruction, and blocks with squares are used for “diagonal” reconstruction.

optimized recovery

optimized recovery

Note that the simple approach requires reading 36 blocks (none from disk 7) whereas the reconstruction described in the paper requires reading only 27 blocks. This applies generally: the new approach requires 25% fewer blocks to perform the same reconstruction. And the paper includes a method of balancing the reads between disks.

Disappointing results

Unfortunately, optimizing for fewer reads didn’t translate to significant performance improvements in the overall recovery. For RDP it was about 12% better in the best case, but typically closer to 7%. For EVENODD the improvement was less than 5% in all cases. Why? The naïve recovery algorithm streams data off the healthy hard drives, performs a simple computation, and streams good data onto the replacement drive. Streaming is what drives do best – 3.5” or 2.5”, 7200, 10K, or 15K RPM; SATA, SAS, or FC they all stream pretty well. There may be some contention for I/O resources, but either that contention isn’t severe or the “skips” in the I/O patterns interrupt the normal streaming efficiency.

Applicability in flash

There’s another medium, however, that has throughput and IOPS to spare where this RAID reconstruction approach could be highly effective: flash. With SSDs, it’s possible to see throughput that strains the limits of I/O systems; reading less data could be a significant improvement, and the non-contiguous read patterns wouldn’t degrade performance as they do with HDDs. For all-flash arrays, this sort of optimization may be one of many in its class; with a surplus of IOPS, compute, and memory, the RAID algorithms designed for slow disks, slow CPUs, and a dearth of DRAM, may need to be scrapped and rethought.

Posted on September 21, 2011 at 10:33 pm by ahl · Permalink · One Comment
In: Software · Tagged with: , , , , ,

Flash news I wish I could read

For a short while, I ran the flash memory strategy at Sun and then Oracle, so I still keep my ear to the ground regarding flash news. That news is often frustratingly light — journalists in the space who are fully capable of providing analysis end up brushing the surface. With a tip of the hat to the FJM crew, here’s my commentary on a recent article.
NetApp has Hybrid Aggregate drives coming, with data moved automatically in real time between flash located next to the spinning disks. The company now says that this is a better technology than PCIe flash approaches.
Sounds interesting. NetApp had previously stacked its chips on a PCIe approach for flash called the performance acceleration module (PAM); I read about it in the same publication. This apparent change of strategy is significant, and I wish that the article would have explored the issue, but it was never mentioned.
NetApp, presenting at an Analyst Day event in New York on 30 June, said that having networked storage move as it were into the host server environment was disadvantageous. This was according to Stifel Nicolaus analyst Aaron Rakers.
1. So is this a quote from NetApp or a quote from an analyst or a quote from NetApp quoting an analyst? I’m confused.
2. This is a dense and interesting statement so allow me to unpack it. Moving storage to the host server is code for Fusion-io. These guys make a flash-laden PCIe card that you put in your compute node for super-fast local data access, and they connect a bunch of them together with an IB backplane to share the contents of different cards between hosts. They recently went public, and customers love the performance they offer over traditional SANs. I assume the term “disadvantageous” was left intentionally vague as those being disadvantaged may be NTAP shareholders rather than customers implementing such a solution.
Manish Goel, NetApp’s product ops EVP, said SSDs used as hard disk drive replacements were not as interesting as using flash at the disk layer in a Hybrid Aggregate drive approach – and this was coming.
An Aggregate is the term NetApp uses for a collection of drives. A Hybrid Aggregate — presumably — is some new thing that mixes HDDs and SSDs. Maybe it’s like Sun’s hybrid storage pool. I would have liked to see Manish Goel’s statement vetted or explained, but that’s all we get.
Flash Cache in the controller is a straightforward array read I/O accelerator. PCIe flash in host servers is a complementary technology but will not decentralise the storage market and move networked storage back into the host servers.
Is this still the NetApp announcement or is this back to the journalism? It’s a new paragraph so I guess it’s the latter. Fusion-io will be happy to learn that it only took a couple of lines to be upgraded from “disadvantageous” to “complementary”. And you may be interested to know why NetApp says that host-based flash is complementary. There’s a vendor out there working with NetApp on a host-based flash PCIe card that NetApp will treat as part of its caching tier, pushing data to the card for fast access by the host. I’d need to dig up my notes from the many vendor roadmaps I saw to recall who is building this, but in the context of a public blog post it’s probably better that I don’t.
NetApp has a patent in this Hybrid Aggregate disk drive area called “Mechanisms for moving data in a Hybrid Aggregate”.
I won’t bore you by reposting the except from the patent, but the broad language of the patent does recall to mind the many recent invalidated NetApp patents…
Surely this is what we all understand as auto-placement of data in a virtual storage pool comprising SSD and fast disk tiers, such as Compellent’s block-level Data Progression? Not so, according to a person close to the situation: “It’s much more automatic, real-time and granular. Compellent needs policies and is not real-time. [NetApp] will be automatic and always move data real-time, rather than retroactively.”
What could have followed this — but didn’t — was a response from a representative from Compellent or someone familiar with their technology. Compellent, EMC, Oracle, and others all have strategies that involve mixing flash memory with conventional hard drives. It’s the rare article that discusses those types of connections. Oracle’s ZFS-products uses flash as a caching tier, automatically populating it with useful data. Compellent has a clever technique of moving data between storage tiers seamlessly — and customers seem to love it. EMC just hucks a bunch of SSDs into an array — and customers seem to grin and bear it. NetApp’s approach? It’s hard to decipher what it would mean to “move data in real-time, rather than retroactively.” Does that mean that data is moved when it’s written and then never moved again? That doesn’t sound better. My guess is that NetApp’s approach is very much like Compellent’s — something they should be touting rather than parrying. And I’d love to read that article.
Posted on July 1, 2011 at 6:46 pm by ahl · Permalink · 3 Comments
In: Flash · Tagged with: , , , , , , , , ,

Engineers and customers

Today I took the train out to Long Island to meet up with our New York sales team for a visit with a prospective customer. You never know with an initial meeting, but this one was great. I thought I’d share a bit about what made these guys so excited which is the same stuff that gets me excited about what we’re doing at Delphix.

First though, there are some engineers who have never spoken with a customer. There are some engineering organizations in which requirements are collected from customers, correlated by product managers, handed to engineering managers, and given to engineers. It’s a fine workflow, but this needs to be balanced against engineers engaging directly with customers, hearing their issues, and brainstorming solutions technologist to technologist. Engineers talking to a small number of customers may miss broad trends or fail to connect certain dots, but it’s a complementary activity and part of being a holistic engineer.

I’ve heard software engineers groan that the right technical decision was trumped by business concerns. Those people might be good engineers, but they aren’t great ones. Engineering can’t stop at the boundaries of software; it must necessarily consider the whole ecosystem of the product and the company. Yes, we might not have architected the feature this way if we didn’t have legacy customers support, but we do (and we should be happy for it). (And, of course, this logic can be taken to the other extreme with equally bad results.) This doesn’t mean that a great engineer collects all data first hand, but the whole system must be considered, and walking into a customer’s office from time to time is a reality check.

In today’s meeting, the customer was learning about Delphix for the first time. And they got it right away. As with many enterprises, they have a initiative around virtualization to enable more self-service and more empowerment of their developers. The data in their relational databases is a big anchor weighing down those efforts; the time and effort required to copy and provision databases is a huge drag. Smart guys, they oscillated between how Delphix works — a super-smart, database-optimized storage gateway — and what Delphix does — virtualizing their Oracle databases, bringing the agility and cost-savings of other virtualization technologies. And the slide-ware made real through a demo of the product GUI elicited an terse expression of comprehension: “That’s cool.”

And maybe the best reason for engineers to get into the field is to witness customers who get how cool the product is.

Posted on June 3, 2011 at 2:03 pm by ahl · Permalink · 7 Comments
In: Delphix, Software · Tagged with: 

Another reason Apple misses ZFS

Apple recently announced a new iMac model — in itself, only as notable as the seasons — but with an interesting option: users can choose to have both an HDD and an SSD. Their use of these two is absolutely pedestrian, as noted on the Apple store:

If you configure your iMac with both the solid-state drive and a Serial ATA hard drive, it will come preformatted with Mac OS X and all your applications on the solid-state drive. Then you can use the hard drive for videos, photos, and other files.

Fantastic. This is hierarchical storage management (HSM) as it was conceived by Alan Turing himself as he toiled against the German war machine (if I remember my history correctly). The onus of choosing the right medium for given data is completely on the user. I guess the user who forks over $500 for this fancy storage probably is savvy enough to copy files to and fro, but aren’t computers pretty good at automating stuff like that?

Back at Sun, we built the ZFS Hybrid Storage Pool (HSP), a system that combines disk, DRAM, and, yes, flash. A significant part of this is the L2ARC implemented by Brendan Gregg that uses SSDs as a cache for often-used data. Hey Apple, does this sound useful?

Curiously, the new iMacs contain the Intel Z68 chipset which provides support for SSD caching. Similar to what ZFS does in software, the Intel chipset stores a subset of the data from the HDD on the SSD. By the time the hardware sees the data, it’s stripped of all semantic meaning — it’s just offsets and sizes. In ZFS, however, the L2ARC knows more about the data so can do a better job about retaining data that’s relevant. But iMac users suffer a more fundamental problem: the SSD caching feature of the Z68 doesn’t appear to be used.

It’s a shame that Apple abandoned the port of ZFS they had completed ostensibly due to “licensing issues” (DTrace in Mac OS X uses the same license — perhaps a subject for another blog post). Fortunately, Ten’s Complement has picked up the reins. Apple systems with HDDs and SSDs could be the ideal use case ZFS in the consumer environment.

Posted on May 20, 2011 at 8:02 am by ahl · Permalink · 5 Comments
In: ZFS · Tagged with: , , ,

Delphix Next

It’s rare to get software right the first time. I’m not referring to bugs in implementation requiring narrow fixes, but rather places in a design that simply missed the mark. Even if getting it absolutely right the first time were possible, it would be prohibitively expensive (and time-consuming) so we make the best decisions we can, hammer it out, and then wait. Users of a product quickly become experts on its strengths and weaknesses for them.  Customers aren’t beta testers — again, I’m not talking about bugs — but rather they expose use cases you never anticipated, and present environments too convoluted to ever conceive at a whiteboard.

When I worked in the Fishworks group at Sun, we learned more about our market in the first three months after shipping 1.0 than we had in the 30 months we spent developing it. We found the product both struggling in unanticipated conditions, and being used to solve problems we could have never predicted.  Some of these we might have guessed earlier given more time, but some will never come to light until you ship. That you need to ship 1.0 before you can write 2.0 is a deeper notion than it appears.

I joined Delphix a couple of weeks before our formal launch at the DEMO conference. Since then, we’ve engaged in more proofs-of-contept (PoCs) and more customers have rolled us into use, and we’ve continued to learn of new use cases for Delphix Server, and found the places where we needed to rethink our assumptions. And we knew this would be the case — you can’t get it right the first time. Over the past several months, we in Delphix engineering have been writing the second version of the most critical components in our stack, incorporating the lessons learned with our customers. The team has enjoyed the opportunity to revisit design decisions with new information; it’s fun to feel like we’re not just getting it done, but getting it right.

When building 1.0, you make a mental list of the stuff you’d fix if only you had the time. In the next release, you figure out all the whizzy stuff you now can build on the stable foundation. We’re excited for the forthcoming 2.6 release — more so even for new ideas we found along the way that will be the basis for our future work. We’ve got a great team working on a great product. Check in on the Delphix blogs in the coming months for details on the 2.6 release and the other stuff we’ve got in the works.

Posted on May 13, 2011 at 3:34 pm by ahl · Permalink · Comments Closed
In: DTrace · Tagged with: ,

Delphix welcomes Matt Ahrens and George Wilson

This week it was my pleasure to welcome my former Sun colleague Matt Ahrens and George Wilson to Delphix. Matt and I studied computer science together at Brown and then joined Sun in 2001. Matt joined Jeff Bonwick to start ZFS while I worked on DTrace. George joined Sun in 1996, and worked in a variety of roles, joining the ZFS team in 2006 (just as I was leaving to help start Fishworks).

George and Matt bring an amazing knowledge of ZFS — the lower, and upper halves respectively — and are also just great engineers who are already contributing tangibly to the the success of Delphix. You can take a look at Matt’s old blog, or watch George in a bunch of videos (including one of him being interviewed by a muppet).

Posted on November 19, 2010 at 4:36 pm by ahl · Permalink · Comments Closed
In: Delphix · Tagged with: , , ,

Delphix CEO joins the blogosphere

Welcome to Jed Yueh, our CEO at Delphix, who recently posted his first blog entry.

With three clicks from our intuitive, consumer-grade interface, you can cut through a decade of frustration and redundancy in enterprise datacenters. And customers are quickly seeing the value of the gun vs. the sword: in our first two quarters of sales alone, we added Fortune 1000 customers across several industries, including financial services, telecommunications, high technology, retail, manufacturing, travel, consumer goods, and SaaS.

Stay tuned to the Delphix blog for more on what we’re up to.

Posted on November 15, 2010 at 5:32 am by ahl · Permalink · One Comment
In: Delphix · Tagged with: ,

ZIL analysis from Chris George

Chris George from DDRdrive put together a great presentation at the OpenStorage summit looking at the ZFS intent log (ZIL), and how their product is particularly well-suited as a ZIL device. Chris did a particularly interesting analysis of the I/O pattern ZFS generates to ZIL devices (using DTrace of course). With writes to a single ZFS dataset, writes are almost 100% sequential, but with activity to multiple datasets, writes become significantly more non-sequential. The ZIL was initially designed to accelerate performance with a dedicated hard drive, but the Hybrid Storage Pool found a significantly better ZIL device in write-optimized, flash SSDs.

In the 7000 series, the performance of these SSDs — called Logzillas — aren’t particularly sensitive to random write patters. Less sophisticated, cheaper SSDs are more significantly impacted by randomness in that both performance and longevity can suffer.

Chris concludes that NV-DRAM is better suited than flash for the ZIL (Oracle’s Logzilla built by STEC actually contains a large amount of NV-DRAM). I completely agree; further, if HDDs and commodity SSDs continue to be target ZIL devices, ZFS could and should do more to ensure that writes are sequential.

Posted on November 15, 2010 at 2:48 am by ahl · Permalink · Comments Closed
In: ZFS · Tagged with: , , , , ,