Storage Performance or Data Resilience? Pick Any Two

Customers should not have to choose between performance and availability.

They should get both.

That isn’t just a radical idea, it’s reality at Pavilion.

Simply put storage administrators do not need to make that tradeoff, and the sooner this is broadly understood, the better for all.

Pavilion has announced a series of features, which are automatically available to every customer, that provide both high performance and data resiliency solutions. Each of these features solves an inherent problem that is endemic to the legacy, dual storage controller architecture.

These Pavilion features enable customers to enjoy consistent, high performance along with controller redundancy, fast rebuilds from drive failures, and data protection from drive firmware issues.

No one in the market can deliver this.

The first feature deals with an issue that has long plagued storage arrays. Traditional arrays are built around a dual controller model, where the two controllers could be configured as either active-passive, where one controller is not performing I/O and is simply on stand-by in the event that the first controller fails, and active-active, where both controllers are doing operations. In an active-active configuration, it is a best practice to limit the controllers to no more than 50% utilization, so that in the event of a controller failure, the surviving controller would take over while still meeting performance requirements.

While that may be the “best practice”, far too often business pressure for greater application performance has necessitated running both controllers at top speed, often with disastrous results when one controller fails and performance goes off a cliff.

The Pavilion Hyperparallel Flash Array (HFA) solves this problem through our multicontroller architecture which allows for multiple controllers to be used in an N+1 configuration. All of the active controllers operate to provide greater performance, while standby controllers are there in the event of a failure. The result is that the aggregate performance of up to 16 controllers per 4RU array can be performing I/O at any given time, with additional controllers on standby, providing far greater performance than is possible with a dual controller model, while maintaining consistent performance in the event of a fault.

The next feature is our Swarm RAID recovery. It is a reality of life that drives fail. RAID and erasure coding are the two primary methods used to protect data and maintain access when that drive failure inevitably occurs. RAID has been in use for decades, but has somewhat fallen out of favor lately. There are a wide range of RAID levels, each offering different degrees of data protection and performance enhancements. A well known and proven technology, the challenge with RAID is that rebuilding a drive from parity is very compute intensive and time consuming. The bigger the drives and the more drives in a given array, the longer the rebuild takes. It is not uncommon for array rebuilds to take hours, or in some cases, even days. During that time, performance is greatly impacted, and even worse, the possibility of catastrophic data loss is increased.

Erasure coding solves this problem by breaking data into smaller chunks and performing a calculation on it, which is written elsewhere. The greater the fault tolerance desired, the more calculations that are written. The result is that while still CPU intensive, erasure coding is faster than RAID. However the cost comes in the form of consumed capacity, in other words the amount of total space you give up for that better performance can be significant.

Again, Pavilion uniquely solves this through our multicontroller architecture. We use RAID 6, which provides dual parity protection to survive up to two drive failures. By using RAID 6, we avoid the capacity penalty of erasure coding. This enables customers to use more of their drives for data, not parity. In the event of a drive failure, multiple controllers will “swarm” the RAID set to rebuild the failed drive far faster than a controller in a legacy array ever could. Using our swarm technology, the HFA can rebuild a drive at the rate of up to 1 TB in only 15 minutes.

The final part of our announcement is the availability of one of our patent-pending data integrity features. With larger SSDs, particularly in high I/O environments, there is an occasional error where the drive will report to the metadata that a write has been performed, even though the write has not actually occurred, resulting in data corruption. It is an infrequent error, but one that does happen and can potentially bring catastrophic results.

Pavilion solved this problem by adding a field to the CRC created by a standard T10 Dif. Essentially, in a standard T10 Dif, when data is written there is a CRC done, which is compared on read to confirm that the data is valid. But if there was supposed to have been a write to that data, which did not occur, the CRC will not catch it. Pavilion adds an additional checksum to confirm that the data is both valid and current.

The Pavilion HFA already offered the highest performance density across block, file, and object workloads. Now we are the most resilient as well.

For more information, you can read our press release here.