The “P” in Pavilion Actually Stands for Performance

Storage system designers constantly make engineering tradeoffs to maximize three key attributes – Throughput, Latency, and Density. This blog series will delve into each of these separately and cover the thought process we went through when we designed the Pavilion Storage Platform as it relates to these attributes. This first blog of the series focuses on Throughput.

Part 1 – Throughput

We can generalize the Pavilion Storage Platform as a dataflow system in order to understand the storage throughput performance characteristics. Such a system can be viewed to have well-defined system demarcation boundaries, and Amdahl’s law can be applied to identify the weakest link in the chain to determine the overall throughput performance. Figure 1 (Generic Storage Array) shows the primary data flow elements in a storage array.

Figure 1 – Generic Storage Array

The clients are shown as Host 0..N and they connect to the Storage Array over a network. There are three boundaries identified by A, B, and C in the figure. The throughput provided by the Storage Array (as seen by the clients) will be determined by the weakest of the three links A, B or C. In most designs, Bandwidth (A) < Bandwidth (B) < Bandwidth (C) and thus throughput, which follows bandwidth will be determined by A, the Network Connection Layer.

The reason for this IO Bandwidth allocation across A, B and C is primarily due to choices available for processing elements in designing an IO Controller. If an IO controller is designed using general-purpose CPUs the amount of IO bandwidth available is determined by the merchant silicon available. While a custom ASIC can be designed to provide any IO bandwidth necessary it does not usually make economic sense to build for every storage array vendor. Silicon vendors such as Intel (for x86 CPU) provide about 40 lanes of PCIe per CPU since the Sandy Bridge generation. They have upgraded the lane speed of PCIe from Gen1 (2.5GT/s) to Gen2 (5GT/s) to Gen3 (8GT/s), but have not increased the number of PCIe lanes per CPU. Further, these high PCIe lane count CPUs also have the highest number of cores (or compute capacity) and can be clocked to run at the highest frequencies. Other x86 vendors such as AMD have recently announced CPUs with much higher PCIe lane counts (64 and 128) but they are in a different power envelope and these lanes are used for inter socket communication as well. ARM vendors similarly have limited the number of PCIe lanes too well under 40. Silicon vendors make this design choice as they balance the processing and memory bandwidth capabilities for a given device. This leads the Storage Array designer to divide ~40 PCIe lanes between Networking and Media interfaces as shown in Figure 2 with a general-purpose CPU.

Figure 2 – Storage System with General Purpose CPU

The designer, in this case, has to use the built-in PCIe switch to connect storage media. The division of PCIe lanes between A and B (B & C are collapsed in a high-end x86 CPU) thus decides the throughput. Further, if multiple CPUs are used as shown in the figure, the designer has to access media on the other CPU via an inter-processor link, deal with running a multi-processor aware OS (SMP or other) or have to live with a divided media pool. In the latter case, storage media on one socket will not be available to the other, and thus storage islands are created. Note that accesses across CPU via inter-processor links will result in non-uniform access latencies and the inter-processor link itself can become the bottleneck.

In practice today, two-socket systems are most common, but when used to build a storage array they have to use one of the CPUs for redundancy and are thus throughput bound to the number of PCIe lanes they can allocate towards the network connection layer. Thus, typical throughput is limited to what can be achieved by ~16 lanes PCIe (40 PCIe Gen3 lanes split into 16 towards network and 24 towards storage) which roughly translates to ~12.5GB/s or ~4M IOPs.

This is a fundamental limitation imposed by silicon availability, and the challenge for Pavilion was to come up with a design that can surmount this throughput limit and scale with the number of CPUs, not divide the storage media pool, run each IO processor under a uniprocessor OS and still use generally available merchant silicon. Such a scalable solution should also meet practical realities of cost, schedule and power density. The resulting design is the first generation of Pavilion Storage Platform and is shown in Figure 3 (Pavilion Storage Platform) below.

Figure 3 – Pavilion Storage Platform

Pavilion Storage Platform

In order to overcome the limitation of PCIe lanes, we chose to switch the focus of the design from the CPU and move it to a PCIe fabric. Once we could build a PCIe fabric with a high radix we added the most optimal CPUs (not necessarily the ones with most PCIe lanes). This allowed us to scale the throughput linearly with the number of CPUs. The number of CPUs was limited by the radix of the PCIe fabric and power density. We further chose to build a fabric with the highest port density devices available, and the scalability was limited only by the field widths determining the number space in the PCIe protocol.

Building such a storage system involved developing a fabric using a combination of hardware and software that enabled the media pool to be available to all the CPUs, while allowing logical and physical management through a separate management plane. The challenges of maintaining independent logical domains while residing in the same physical domain will be discussed in a separate article.

The important point to note here is that throughput limitations due to the number of PCIe lanes in a monolithic CPU is now removed. Further, each CPU can run an independent uniprocessor OS and fully saturate its network connection interface bandwidth, while operating in a lower power envelope. This is the reason why a standard 2-socket server cannot match the throughput of the Pavilion Storage Platform.

The new Ryzen CPU from AMD and others have more PCIe lanes, but they consume significantly more power than the CPU in a Pavilion Storage Platform. In the future, 2-socket systems built from those new devices can improve the throughput. It will be interesting to see if such a design can balance the compute capacity, memory bandwidth and power density as well as Pavilion Storage Platform. From a 30000 foot perspective, what we have built is really such a high lane count, many-multi-core CPU using discrete low power multi-core CPUs and an external fabric. As silicon vendors ride the technology curve and integrate components and arrive at this solution, we will embrace it and adapt it to provide even better solutions.

The key takeaway here is that Pavilion Storage Platform has delivered a design that brings benefits to the end customer. In the first generation of the product, we have 20 IO Controllers (CPUs) delivering an aggregate throughput of 120 GB/s and 20 Million 4K Read IOPS.

To be continued … keep an eye out for Part 2 (Latency and Density) of this series, coming soon!