Mapping the Human Genome was just the beginning of the life sciences discipline of bioinformatics. Now, the goal is to apply every map of every person against a perfectly healthy baseline to identify sequences of DNA that carry diseases like diabetes, asthma, migraine, and schizophrenia. However, even with the most powerful compute, network and storage resources on our planet, full genomic map comparisons are not possible.
The International HapMap Project is a more efficient method to isolate genetic mutations using massively parallel processing with Hadoop® MapReduce and Spark® analytics to obtain statistically significant comparisons. HapMap would not exist without the collaboration of hundreds of scientists, data center architects and software engineers leveraging a powerful data infrastructure that exists at Texas Advanced Computing Center (TACC) on multiple large clustered computing infrastructures all designed for different use cases.
TACC has a particular cluster, Wrangler, that is used for some of the most demanding data analytics workloads. According to TACC, Wrangler is: “The most powerful data analysis system allocated in the Extreme Science and Engineering Discovery Environment (XSEDE). The system is designed for large-scale data transfer, analytics, and sharing and provides flexible support for a wide range of software stacks and workflows. Its scalable design allows for growth in the number of users and data applications.”
Wrangler includes:
- Geographically replicated, high-performance data storage (10PB each site)
- Large-scale flash storage tier for analytics with bandwidth of 1TB/s and 250M IOPS
- More than 3,000 embedded processor cores for data analysis
- Flexible support for a wide range of data workflows, including those using Hadoop and databases.
Pavilion Benefits for Clustered Applications
Clustered applications like the ones that run on Wrangler at TACC require lots of storage performance, which becomes a big challenge if you want to centrally provision low latency flash in a large clustered environment. Pavilion Data Systems has had the opportunity to validate our 4U NVMe All-Flash Array with TACC for use on big data projects in the future.
While our technology is relevant for massively parallel operations in supercomputing, the underlying applications (Hadoop and Spark) can be deployed on smaller enterprise systems, with the Pavilion NVMe All-Flash Array.
A fundamental tenant of the Pavilion design is to leverage standards. There are no proprietary hardware controllers or proprietary host-based drivers. The system uses standard NVMe-Over-Fabrics drivers, Intel processors and Gigabit Ethernet or Infiniband NICs. The beauty of this simplicity is from the unique software that Pavilion has developed. In conversation with Chris Jordan, Data Management and Collections Group Lead, Texas Advanced Computing Center, we learned that “TACC is committed to enabling discoveries that advance science and society through the application of advanced computing technologies. The performance offered by Pavilion Data systems, along with their flexibility and adherence to standards, will help us to provide researchers with unique data-intensive computing resources.”
Summary
Whether you are mapping genomes, streaming video content, or deploying a next-generation NoSQL database, the demands of your systems will outpace today’s server-side SSDs and proprietary All-Flash Arrays. If you want a standards-based approach to dial in the exact storage capacity, at the exact IOPs or bandwidth needed, rethink the traditional approaches, consider less disaggregation, consider Pavilion.