The “P” in Pavilion Actually Stands for Performance – Part 2

This is the second half of the Pavilion blog focusing on 3 important design areas of storage products:  Bandwidth, Latency, and Density.  This is the second entry in this series, focused on Latency and Density.

 

Latency

The access latency from the Host to the media is composed of host storage stack latency, network stack latency, IO controller latency, and media access latency. The first three components here are fairly standard and we minimize the number of memory copies and limit data touches to keep latency at a minimum. The media access latency is largely governed by the type of media (NVMe NAND in our case) and associated drive controller overhead.


One of the characteristics of NAND flash is that it has a steep latency tail which is an increased latency for a small percent (typical <~1%) of accesses as shown in Figure 1 above, NAND Access Latency Tail. This is caused due to an inherent property of NAND media. Writes are performed in a log-structured format and results in some blocks holding stale data. This garbage needs to be cleaned and the dirty blocks have to be reclaimed by copying partially valid data into new blocks and performing an erase cycle on the entire block. Read operations that target a block during an erase cycle experience much longer latency and contribute to the latency tail.

Further, normal writes take a significantly longer time. For example, at the media level, typical read operations are in the 50-60us range, write operations are in 600-800us range and erase operations are in 3ms+ range. These numbers are current with the technology available today and they are improving each component. The net result is that read latency can become inconsistent depending on the temporal state of operations (write or erase in progress) at the block in question and the user can experience significant outliers.

Pavilion Array uses two methods to mitigate the effect of these outliers. There are proposals in the NVMe standards body to control the scheduling of garbage collection and other media related management operations under the general category of Advanced Background Operations (ABO) and IO determinism. This allows the IO Controller to pre-emptively (and periodically) schedule the ABOs to clean house and thus reduce the average latency at any given time. Some SSD vendors have started implementing the features in their new offerings. Pavilion Flash Array uses this knowledge along with an additional mechanism in place to mitigate the latency outliers. As part of the enterprise reliability feature set, the Pavilion Array has the capability to handle up to two drive failures in an 18-drive group, and up to 8 failures system-wide across 72 drives. The IO Controller that schedules ABO tracks the busy drives and blocks that are in progress. When new reads arrive from the Host that target a drive/block where ABO is in progress, the IO Controller will deem the current drive under maintenance as a temporary failure and recover the data from the complementary set of drives. While this has a slightly increased cost of drive access and data re-compute cost, it is significantly cheaper than the wait time for the ABO to complete. The mechanism also uses heuristics and timers to decide when to access the complementary set and when to wait for the ABO.

A second reason that read access outliers occur is due to a write followed by a read. The write can take a long time to complete and the reader can get stuck behind the operation. The Pavilion Flash Array keeps a cache of recent writes and avoids read access if there is hit. The drives also employ this mechanism, but they are limited to the amount of memory available in the drive controller.

Thus, The Pavilion Flash Array not only provides the lowest latency but also mitigates the latency tail (shifting it left) and provides consistent latency. This design advantage translates into a benefit for the end-customer who can now reduce the size of their buffers on the host (and thus reduce cost or improve effectiveness), get better application performance throughput and operational predictability.

 

Density

There are three dimensions to density as seen from a customer’s point of view:

  • Drive density or the total amount of media available per RU
  • Compute density or total CPU processing capability available per RU
  • Network Bandwidth or total number of Network ports (xBW) available per RU

The best drive density available in the market today among designs that are based on standard U.2 NVMe SSDs is about 24 Drives/2RU.   At 6.4 TB per drive, that is about 76 TB/RU. Pavilion Flash Array can house 72 SSD/4U or about 115 TB/RU using the same 6.4 TB drives.

Typical dual-socket x86 system’s 2RU storage arrays are built with two Xeon E5-266x class CPUs from intel and have 14 cores or more at 2GHz+. The total compute capability is thus limited to these two CPUs that perform both control plane and data-plane functionality. Pavilion Flash Array in 4RU form-factor has 20 Broadwell-DE 1548 class CPU with 8 cores @2GHz each dedicated to data-plane processing.

The maximum number of network ports in a typical 2RU server is either 8x40G or 8x100G (depending on the NIC silicon choice). Even though this is the nominal number of network interfaces it is important to note that they cannot be run concurrently at full bandwidth due to inherent limitation on the number of PCIe lanes available.  The Pavilion Flash Array can support up to 40x100G in a 4RU box.

These significant differences in each dimension of density can be achieved because the system is built with fabric as the centerpiece instead of the CPU. Once the scalable fabric is realized the density tradeoffs are operating within the constraints imposed by power, cost and mechanical design parameters. We will discuss fabric design and scalability in a separate article.

Besides the standard servers with U.2 SSDs, there are other storage arrays that have custom SSDs that can achieve good density. However, this path leads to a business model where SSD designs have to keep up with flash silicon vendor technology, and might not be as economically feasible.

The end-customer benefits from a dense storage design because in addition to housing a large amount of media, the Pavilion Flash Array is sharable, manageable and provides ample network connectivity.

 

Summary

In summary, The Pavilion Flash Array can provide performance advantages in multiple dimensions – Throughput, Latency, and Density. Pavilion Flash Array can sustain and maintain this edge because of the structural advantage of being a fabric centered array as opposed to a CPU centric traditional array. The design concept of the array is contrasted between a traditional array and Pavilion Array in the figure below. (Figure 2 – Traditional Flash Array vs. Pavilion Flash array)

A traditional storage array is designed around the CPU as the central element. This CPU acts as both the data plane (I/O) and the control plane processor. The scalability and performance of this type of design is inherently dependent on the I/O and compute processing capabilities of the CPU.   The Pavilion Flash Array is a fabric centric design. The scalability and performance are limited by the radix and the fabric and thermal/mechanical considerations. The fabric provides separate I/O channels for each CPU in the data plane and performance can scale with the aggregate I/O and compute processing capabilities of all the CPUs. There is a separate management CPU that controls the platform resources, services and storage management.