Adam Leventhal's blog

Search
Close this search box.

Month: November 2008

The Sun Storage 7410 is our expandable storage appliance that can be hooked up to anywhere from one and twelve JBODs with 24 1TB disks. With all those disks we provide the several different options for how to arrange them into your storage pool: double-parity RAID-Z, wide-strip double-parity RAID-Z, mirror, striped, and single-parity RAID-Z with narrow stripes. Each of these options has a different mix of availability, performance, and capacity that are described both in the UI and in the installation documentation. With the wide array of supported configurations, it can be hard to know how much usable space each will support.

To address this, I wrote a python script that presents a hypothetical hardware configuration to an appliance and reports back the available options. We use the logic on the appliance itself to ensure that the results are completely accurate as the same algorithms would be applied as when then the physical pallet of hardware shows up. This, of course, requires you to have an appliance available to query — fortunately, you can run a virtual instance of the appliance on your laptop.

You can download the sizecalc.py here; you’ll need python installed on the system where you run it. Note that the script uses XML-RPC to interact with the appliance, and consequently it relies on unstable interfaces that are subject to change. Others are welcome to interact with the appliance at the XML-RPC layer, but note that it’s unstable and unsupported. If you’re interested in scripting the appliance, take a look at Bryan’s recent post. Feel free to post comments here if you have questions, but there’s no support for the script, implied, explicit, unofficial or otherwise.

Running the script by itself produces a usage help message:

$ ./sizecalc.py
usage: ./sizecalc.py [ -h <half jbod count> ] <appliance name or address>
<root password> <jbod count>

Remember that you need a Sun Storage 7000 appliance (even a virtual one) to execute the capacity calculation. In this case, I’ll specify a physical appliance running in our lab, and I’ll start with a single JBOD (note that I’ve redacted the root password, but of course you’ll need to type in the actual root password for your appliance):

$ ./sizecalc.py catfish ***** 1
type            NSPF   width  spares   data drives     capacity (TB)
raidz2         False      11       2            22                18
raidz2 wide    False      23       1            23                21
mirror         False       2       2            22                11
stripe         False       0       0            24                24
raidz1         False       4       4            20                15

Note that with only one JBOD no configurations support NSPF (No Single Point of Failure) since that one JBOD is always a single point of failure. If we go up to three JBODs, we’ll see that we have a few more options:

$ ./sizecalc.py catfish ***** 3
type            NSPF   width  spares   data drives     capacity (TB)
raidz2         False      13       7            65                55
raidz2          True       6       6            66                44
raidz2 wide    False      34       4            68                64
raidz2 wide     True       6       6            66                44
mirror         False       2       4            68                34
mirror          True       2       4            68                34
stripe         False       0       0            72                72
raidz1         False       4       4            68                51

In this case we have to give up a bunch of capacity in order to attain NSPF. Now let’s look at the largest configuration we support today with twelve JBODs:

$ ./sizecalc.py catfish ***** 12
type            NSPF   width  spares   data drives     capacity (TB)
raidz2         False      14       8           280               240
raidz2          True      14       8           280               240
raidz2 wide    False      47       6           282               270
raidz2 wide     True      20       8           280               252
mirror         False       2       4           284               142
mirror          True       2       4           284               142
stripe         False       0       0           288               288
raidz1         False       4       4           284               213
raidz1          True       4       4           284               213

The size calculator also allows you to model a system with Logzilla devices, write-optimized flash devices that form a key part of the Hybrid Storage Pool. After you specify the number of JBODs in the configuration, you can include a list of how many Logzillas are in each JBOD. For example, the following invocation models twelve JBODs with four Logzillas in the first 2 JBODs:

$ ./sizecalc.py catfish ***** 12 4 4
type            NSPF   width  spares   data drives     capacity (TB)
raidz2         False      13       7           273               231
raidz2          True      13       7           273               231
raidz2 wide    False      55       5           275               265
raidz2 wide     True      23       4           276               252
mirror         False       2       4           276               138
mirror          True       2       4           276               138
stripe         False       0       0           280               280
raidz1         False       4       4           276               207
raidz1          True       4       4           276               207

A very common area of confusion has been how to size Sun Storage 7410 systems, and the relationship between the physical storage and the delivered capacity. I hope that this little tool will help to answer those questions. A side benefit should be still more interest in the virtual version of the appliance — a subject I’ve been meaning to post about so stay tuned.

Update December 14, 2008: A couple of folks requested that the script allow for modeling half-JBOD allocations because the 7410 allows you to split JBODs between heads in a cluster. To accommodate this, I’ve added a -h option that takes as its parameter the number of half JBODs. For example:

$ ./sizecalc.py -h 12 192.168.18.134 ***** 0
type            NSPF   width  spares   data drives     capacity (TB)
raidz2         False      14       4           140               120
raidz2          True      14       4           140               120
raidz2 wide    False      35       4           140               132
raidz2 wide     True      20       4           140               126
mirror         False       2       4           140                70
mirror          True       2       4           140                70
stripe         False       0       0           144               144
raidz1         False       4       4           140               105
raidz1          True       4       4           140               105

Update February 4, 2009: Ryan Matthews and I collaborated on a new version of the size calculator that now lists the raw space available in TB (decimal as quoted by drive manufacturers for example) as well as the usable space in TiB (binary as reported by many system tools). The latter also takes account of the sliver (1/64th) reserved by ZFS:

$ ./sizecalc.py 192.168.18.134 ***** 12
type          NSPF  width spares  data drives       raw (TB)   usable (TiB)
raidz2       False     14      8          280         240.00         214.87
raidz2        True     14      8          280         240.00         214.87
raidz2 wide  False     47      6          282         270.00         241.73
raidz2 wide   True     20      8          280         252.00         225.61
mirror       False      2      4          284         142.00         127.13
mirror        True      2      4          284         142.00         127.13
stripe       False      0      0          288         288.00         257.84
raidz1       False      4      4          284         213.00         190.70
raidz1        True      4      4          284         213.00         190.70

Update June 17, 2009: Ryan Matthews with help from has again revised the size calculator to model both adding expansion JBODs and to account for the now expandable Sun Storage 7210. Take a look at Ryan’s post for usage information. Here’s an example of the output:

$ ./sizecalc.py 172.16.131.131 *** 1 h1 add 1 h add 1
Sun Storage 7000 Size Calculator Version 2009.Q2
type          NSPF  width spares  data drives       raw (TB)   usable (TiB)
mirror       False      2      5           42          21.00          18.80
raidz1       False      4     11           36          27.00          24.17
raidz2       False  10-11      4           43          35.00          31.33
raidz2 wide  False  10-23      3           44          38.00          34.02
stripe       False      0      0           47          47.00          42.08

Update September 16, 2009: Ryan Matthews updated the size calculator for the 2009.Q3 release. The update includes the new triple-parity RAID wide stripe and three-way mirror profiles:

$ ./sizecalc.py boga *** 4
Sun Storage 7000 Size Calculator Version 2009.Q3
type          NSPF  width spares  data drives       raw (TB)   usable (TiB)
mirror       False      2      4           92          46.00          41.18
mirror        True      2      4           92          46.00          41.18
mirror3      False      3      6           90          30.00          26.86
mirror3       True      3      6           90          30.00          26.86
raidz1       False      4      4           92          69.00          61.77
raidz1        True      4      4           92          69.00          61.77
raidz2       False     13      5           91          77.00          68.94
raidz2        True      8      8           88          66.00          59.09
raidz2 wide  False     46      4           92          88.00          78.78
raidz2 wide   True      8      8           88          66.00          59.09
raidz3 wide  False     46      4           92          86.00          76.99
raidz3 wide   True     11      8           88          64.00          57.30
stripe       False      0      0           96          96.00          85.95
** As of 2009.Q3, the raidz2 wide profile has been deprecated.
** New configurations should use the raidz3 wide profile.

The Sun Storage 7000 Series launches today, and with it Sun has the world’s first complete product that seamlessly adds flash into the storage hierarchy in what we call the Hybrid Storage Pool. The HSP represents a departure from convention, and a new way of thinking designing a storage system. I’ve written before about the principles of the HSP, but now that it has been formally announced I can focus on the specifics of the Sun Storage 7000 Series and how it implements the HSP.

Sun Storage 7410: The Cadillac of HSPs

The best example of the HSP in the 7000 Series is the 7410. This product combines a head unit (or two for high availability) with as many as 12 J4400 JBODs. By itself, this is a pretty vanilla box: big, economical, 7200 RPM drives don’t win any races, and the maximum of 128GB of DRAM is certainly a lot, but some workloads will be too big to fit in that cache. With flash, however, this box turns into quite the speed demon.

Logzilla

The write performance of 7200 RPM drive isn’t terrific. The appalling thing is that the next best solution — 15K RPM drives — aren’t really that much better: a factor of two or three at best. To blow the doors off, the Sun Storage 7410 allows up to four write-optimized flash drives per JBOD each of which is capable of handling 10,000 writes per second. We call this flash device Logzilla.

Logzilla is a flash-based SSD that contains a pretty big DRAM cache backed by a supercapacitor so that the cache can effectively be treated as nonvolatile. We use Logzilla as a ZFS intent log device so that synchronous writes are directed to Logzilla and clients incur only that 100μs latency. This may sound a lot like how NVRAM is used to accelerate storage devices, and it is, but there are some important advantages of Logzilla. The first is capacity: most NVRAM maxes out at 4GB. That might seem like enough, but I’ve talked to enough customers to realize that it really isn’t and that performance cliff is an awful long way down. Logzilla is an 18GB device which is big enough to hold the necessary data while ZFS syncs it out to disk even running full tilt. The second problem with NVRAM scalability: once you’ve stretched your NVRAM to its limit there’s not much you can do. If your system supports it (and most don’t) you can add another PCI card, but those slots tend to be valuable resources for NICs and HBAs, and even then there’s necessarily a pretty small number to which you could conceivably scale. Logzilla is an SSD sitting in a SAS JBOD so it’s easy to plug more devices into ZFS and use them as a growing pool of intent log devices.

Readzilla

The standard practice in storage systems is to use the available DRAM as a read cache for data that is likely to be frequently accessed, and the 7000 Series does the same. In fact, it can do quite a better job of it because, unlike most storage systems which stop at 64GB of cache, the 7410 has up to 256GB of DRAM to use as a read cache. As I mentioned before, that’s still not going to be enough to cache the entire working set for a lot of use cases. This is where we at Fishworks came up with the innovative solution of using flash as a massive read cache. The 7410 can accomodate up to six 100GB, read-optimized, flash SSDs; accordingly, we call this device Readzilla.

With Readzilla, a maximum 7410 configuration can have 256GB of DRAM providing sub-μs latency to cached data and 600GB worth of Readzilla servicing read requests in around 50-100μs. Forgive me for stating the obvious: that’s 856GB of cache &mdash. That may not suffice to cache all workloads, but it’s certainly getting there. As with Logzilla, a wonderful property of Readzilla is its scalability. You can change the number of Readzilla devices to match your workload. Further, you can choose the right combination of DRAM and Readzilla to provide the requisite service times with the appopriate cost and power use. Readzilla is cheaper and less power-hungry than DRAM so applications that don’t need the blazing speed of DRAM can prefer the more economical flash cache. It’s a flexible solution that can be adapted to specific needs.

Putting It All Together

We started with DRAM and 7200 RPM disks, and by adding Logzilla and Readzilla the Sun Storage 7410 also has great write and read IOPS. Further, you can design the specific system you need with just the right balance of write IOPS, read IOPS, throughput, capacity, power-use, and cost. Once you have a system, the Hybrid Storage Pool lets you solve problems with targeted solutions. Need capacity? Add disk. Out of read IOPS? Toss in another Readzilla or two. Write bogging down? Another Logzilla will net you another 10,000 write IOPS. In the old model, of course, all problems were simple because the solution was always the same: buy more fast drives. The HSP in the 7410 lets you address the specific problem you’re having without paying for a solution to three other problems that you don’t have.

Of course, this means that administrators need to better understand the performance limiters, and fortunately the Sun Storage 7000 Series has a great answer to that in Analytics. Pop over to Bryan’s blog where he talks all about that feature of the Fishworks software stack and how to use it to find performance problems on the 7000 Series. If you want to read more details about Hybrid Storage Pools and how exactly all this works, take a look my article on the subject in CACM, as well as this post about the L2ARC (the magic behind using Readzilla) and a nice marketing pitch on HSPs.

Recent Posts

January 22, 2024
January 13, 2024
December 29, 2023
February 12, 2017
December 18, 2016
August 9, 2016

Archives

Archives