Adam Leventhal's blog

Search
Close this search box.

Category: Delphix

Today marks my third anniversary of joining Delphix. Joining a startup, I knew there would be lots to learn — indeed there’s been a lesson nearly once-a-day. Here are my top three lessons from my first three years at a startup. Even if the points themselves should have been obvious to me, the degree of their impact certainly wasn’t.

3. Tar-Babies are everywhere

Generalists thrive at a startup — there are many more tasks and problems than there are people. The things you touch inevitably stick to you, for better or for worse. Early on I was the DTrace guy, and the OS guy, and the debugging guy. Later on I became the performance go-to, and upgrade guy, and (proudly) the git neck beard. But I was also lab manager, and the cabler (running cat 6 when we moved offices, and frenetically stripping wires with my teeth as we discovered a second wiring closet), and the real estate guy. When I got approval to open a San Francisco office I asked about the next steps — “figure it out”. And so it goes for pretty much everything that happens at a startup. At big companies roles are subdivided and specialists own their small domains. During my time at Sun I didn’t think about many of the things those people did: they seemingly just happened. Being at a startup makes you intimately aware of all the excruciating minutiae that make a company go.

The more you do the more you own. The answer is not to avoid doing work that needs to be done, but to actively and aggressively recruit people to take over tasks. The stickiness was surprising and the need to offload can be uncomfortable. But you’re asking people to take on tasks — important, trivial, or unglamorous — so you can take on some additional load.

Ownership is important, but you need to share the load or else these tar babies will drag you down.

2. Hiring über alles

It’s not surprising that it’s all about the people. The right people and the right culture are the hardest problems to solve. It was surprising how much work it is to recruit and hire the best folks. We have a great team at Delphix and I love working with them. The big lesson for me was that hiring is far and away the highest leverage thing you can do for your startup. Hiring great people, great engineers, is hard and time consuming. I can’t count the number of coffees, and beers I’ve had for Delphix — “first dates” with prospective hires.

College recruiting had been an important focus for me during my years at Sun. I had only been at Delphix a few weeks when I convinced our CEO that we should recruit from colleges, and left to interview at my alma mater, Brown University. Delphix had been more conservative on hiring; some people regarded college recruiting as a bit of a flier, but it has paid off in a huge way. Our first college hire, two years in, is now a clear engineering leader. We’ve expanded the program from just me going to just one school to a dozen engineers going to ten this fall. About a quarter of our development team joined Delphix straight out of college. The initial effort is high, and you then have to wait 6-9 months for them to show up. But done right it can be the most efficient way to hire great engineers.

Work your ass off to hire great people; they’ll repay you for the time you’ve spent. If you don’t feel like hiring is a major time suck you’re probably doing it wrong.

1. Everything is your fault

The greatest blessing for a startup is also the greatest curse: having customers. Once people are paying for and using your product they’ll find the problems with it and you’ll work all hours to fix them. The surprise was how many problems became Delphix problems. Our product has its tendrils throughout the datacenter. We touch databases, operating systems, systems, hypervisors, networks, SANs and storage. Any butterfly flapping its wings in the datacenter can result in a hurricane on Delphix. As both the new component and the new vendor, we now expect and encourage our customers to talk to us early when diagnosing a problem. Our excellent support team (see hiring above) have diagnosed problems as diverse as poor network performance from a damaged cat 6 cable to over-provisioned ESX servers and misconfigured init.ora files. Obviously they’ve also addressed problems in our own software. But we always take an expansive view of the Delphix solution and work with the customer to chase the problem wherever it leads us.

This realization has also informed the way we build our software. We not only build our software to be resilient to abnormalities and to detect and report problems, but we also use tools to find problems early. A couple of years ago customers might connect Delphix to poor-performing storage — but that would just look like Delphix performing poorly. Now we run a series of storage benchmarks during every installation and report a grade. We build paranoia into our software and into our sales and support processes.

As a startup it’s even more crucial to insulate ourselves against environmental problems, build facilities to detect problems everywhere, and own the total solution with customers.

Starting year four

I hope others can benefit from those lessons; it took me a while to fully realize them. I’m sure there will be many more as I start year four at Delphix. Leave a comment and tell me about your own startup successes, failures, and lessons learned.

I started working with flash in 2006 — fortunate timing as flash was just starting to take hold in the enterprise. I started asking customers I’d visit about flash. I’ll always remember the response from an early adopter when I asked about how he planned on using the new, expensive storage, “We just bought it, and we have no idea.” It was a solution in search of a problem — the garbage can model at play.

Flash has evolved significantly since then from a raw material used on its own to a component in systems of increasing complexity. I wrote recently about the various techniques being employed to get the most out of flash; all share the basic idea of trading compute and IOPS (abundant commodities) for capacity (still more expensive for flash than hard drives). The ideal use cases are the ones that benefit most from that trade-off, ones where compression and dedup consume cheap compute cycles rather than expensive space on the NAND flash. Flash storage is best with data that contains high degrees of redundancy that clever software can squeeze out. With those loose criteria, it’s been amazing to me how flash storage vendors have flocked to the VDI use case. It’s certainly well-suited — big on IOPS with nearly identical data from different Windows installs that’s easily compressed and deduped — but seemingly every flash vendor has decided that it’s one part — if not the part — of the market they want to address. Take a look at the information on VDI from various flash storage vendors: Fusion, Nimble, Pure Storage, Tegile, Tintri, Violin, Virident, Whiptailthe list goes on and on.

I worked extensively with flash until leaving Oracle in 2010 when I decided to leave for a start up. I ended up not sticking with flash precisely because it was — and is — such a crowded space. I’d happily bet on the space, but it was harder to pick one winner. One of the things that drew me to Delphix though was precisely its compatibility with flash. At Delphix we create virtual database copies by sharing blocks; think of it as dedup before the fact, or dedup but without the runtime tax. Creating a virtual copy happens almost instantaneously saving tremendous amounts of administration time, unblocking developers, and accelerating projects — hence our credo of agile data. Unlike storage-based snapshots, Delphix virtual copies are database aware, provisioning is fully integrated and automated. Those virtual copies also take up much less physical space, but with as many or more IOPS hitting the aggregate of those virtual copies. Sound familiar yet? One tenth the capacity with the same workload — let’s call it 10x greater IOPS intensity — is ideally suited for flash storage.

Flash storage is best when clever software can squeeze out redundancies; Delphix is that clever software for databases. Delphix customers are starting to combine our product with their flash storage purchases. An all-flash array that’s 5x the $/TB as disk storage suddenly becomes half the price of disk storage when combined with Delphix — with substantially better performance. We as an industry still haven’t realized the full potential of flash storage. Agile data through Delphix fills in another big piece of the flash picture.

I wish that none of our customers encountered problems with our product, but they do, and when they do our means for remotely accessing their systems is often via a Webex shared screen. We remotely control their Delphix server to collect data (often using DTrace). While investigating a customer issue recently I developed a couple of techniques to work around common problems; I thought I’d share them in case others have similar problems — and as a note to my future self who will certainly forget the specifics next time.

Copying and Pasting

Webex makes it fairly easy to copy text from the remote system and paste it locally: just select the text, and that implicitly copies it to the clipboard. I do this very very often as I write DTrace scripts to collect data, and then want to record both the script and the output. To that end, the Mac OS X pbpaste(1) utility is unbelievably helpful; pbpaste emits the contents of the clipboard. For example, I’ll select text in the webex and use pbpaste like this:

$ pbpaste | tee -a data.log

Doing that, I can both verify that I selected the right data, and append it to the log of all data collected. Sometimes, though, the remote data is annoying to copy because I need to scroll up — the mouse latency over webex can make this an exasperating experience. In those cases where the text I want to transfer is longer than a page, I do the following on the remote system:

$ cat output | gzip -9c | uuencode /dev/stdin
begin 644 /dev/stdin
M'XL(`..C4E`"`]5:W7_;-A!_#Y#_@>@P),&0A,<O5=X2=&LWH`_M]K`^%9TK
M2THBU+8\24[3C^UO'TG%L2D1,B6[0ZJG0+[[Z7CWN^,=PRQ-BZ?+?#:-%G<P
...

I then select the text, and back on my mac do this to dump out the data:

$ pbpaste | uudecode -o /dev/stdout | gzip -cd

By compressing and uuencoding the data, even large chunks of output easily fit on one screen. Here are the results on a large-ish chunk of data I copied from a customer system:

$ cat customer.data.txt | wc -l
 234
$ cat customer.data.txt | gzip -9c | uuencode /dev/stdin | wc -l
 44

234 lines would have had me tearing my hair out as I tried to capture the output, scrolling backward with 250ms screen refresh latency; 44 lines wasn’t bad at all. Depending on the exact text I seem to get an 80-90% reduction in lines to copy. Many thanks to Brendan Gregg who had mentioned this technique to me; I hadn’t appreciated it fully until I absolutely needed it.

Screen Savers v. Thinking/Lunch

When diagnosing a problem on a customer system, we like to be as unobtrusive as possible, so it’s annoying when we need to disturb the customer to enter his or her password because the screen lock has kicked in while I’m thinking about the next step in the investigation, or I’m getting something to eat. Many enterprise environments make it such that the screen saver delay can’t be changed. I spent a day a couple of weeks ago bringing my laptop to meetings, and running to get lunch (and elsewhere) so that I could move the mouse at least every 15 minutes.

[youtube_sc url=”3JwiDkibkGM” width=”224″ class=”alignright”]I didn’t want to modify the customer system (“I let you remotely access my computer, and you’re installing what?!”). Instead I wanted to programmatically move the mouse every so often on my system to ensure the remote system wouldn’t lock the screen. I couldn’t find anything pre-fab, but thanks to the tips at stackoverflow, I pieced something together that wiggles the cursor around if it hasn’t moved in a little while. I could post it compressed and uuencoded in keeping with the theme above (it’s just 17 lines!), but instead I’ve added a github repo: github.com/adamleventhal/wiggle.

Happy Webex-ing

I hope people find these tips useful. Given my penchant for looking up past tips on my own blog, I’m sure at least my future self will be thanking me at some point…

Tonight, my Delphix colleague Zubair Khan and I presented the integration we’ve done with git at the SF Bay Area Large-Scale Production Engineering meetup. When I started at Delphix, we were using Subversion — my ire for which the margins of this blog are too narrow to contain. We switched to git, and in the process I became an unabashed git fanboy.

Git is a powerful tool generally, but in particular has some powerful hook points that we use to enforce our code integration criteria and to do some handy things after we integrate. For this, we wrote some custom bash scripts, and python integrations with Bugzilla and Review Board. You can check out the slides, and we’ve open sourced it all on github with the hope that it might help people with their own integrations.

ZFS recently celebrated its informal 10th anniversary; to mark the occasion, Delphix hosted a ZFS-themed meetup for the illumos community (sponsored generously by Joyent). Many thanks to Deirdre Straughan, the new illumos community manager, for helping to organize and for filming the event. Three of my colleagues at Delphix presented work they’ve been doing in the ZFS ecosystem.

Matt Ahrens, who (with Jeff Bonwick) invented ZFS back in 2001, started the program with a discussion of a new stable interface for ZFS. Initially libzfs had been designed as a set of helper functions in support of the zfs(1M) and zpool(1M) commands; since then, it has outgrown those humble ambitions and a new, simple, stable interface is needed for programmatic consumers of ZFS. In Matt’s talk and blog post, he lays out a series of guiding principles for the new libzfs_core library; he’s already started to implement these ideas for new ZFS features in development at Delphix.

John Kennedy has been working on a relatively neglected part of illumos: automated testing. At the meetup John spoke about the work he’s been doing to revitalize the ZFS test suite, and to build a unit testing framework for illumos at large. I found the questions and enthusiasm from the people in the room particularly encouraging — everyone knows that we need to be doing more testing, but until John stepped up, no one was leading the charge. The ZFS test suite is available on github. Take a look at John’s blog post to see how to execute the ZFS test suite, and how you can contribute to illumos by helping him diagnose and fix the 60+ outstanding failures.

Chris Siden has been at Delphix just since he graduated from Brown University this past spring, but he’s already made a tremendous impact on ZFS. Chris presented both the work he’s done to finish the work started by Basil Crow (also of Brown, and soon full-time at Delphix) on ZFS feature flags (originally presented to the ZFS community by Matt back in May). Previously, ZFS features followed a single, linear versioning; with Chris and Basil’s work it’s not a land-grab for the next version, rather each feature can be enabled discretely. Chris also implemented the world’s first flagged ZFS feature, Async Destroy (also known to ZFS feature flags as com.delphix:async_destroy) which allows datasets to be destroyed asynchronously in the background — a huge boon when destroying gigantic ZFS datasets. Chris also presented some work he’s been doing on backward compatibility testing for ZFS; check out his blog post on both subjects.

The illumos meetup was a great success. Thank you to everyone who attended in person or on the web. To get involved with the ZFS community, join the illumos ZFS mailing list, and for information on the next illumos meetup, join the group.

It’s my pleasure to welcome Matt Amdur to Delphix, to the world of DTrace, and — just today — to the blogosphere. Matt joined Delphix about two months after 10 years of software engineering, most recently at VMware. Matt and I met in at Brown University in 1997 where we worked together closely for all four years. We’ve had in the back of our minds that our professional lives would converge at some point; I couldn’t be happier to have my good friend onboard.

Matt’s first blog post is a great war story, describing his first use of DTrace on a customer system. It was vindicating to witness first hand, both in how productive an industry vet could be with DTrace after a short period, and in what a great hire Matt was for Delphix. Working with him has been evocative of our collaboration in college — while making all the more apparent the significant distance we’ve both come. Welcome, Matt!

Recent Posts

January 22, 2024
January 13, 2024
December 29, 2023
February 12, 2017
December 18, 2016
August 9, 2016

Archives

Archives