Real-world challenges of Cryo-EM in the cloud – Part 1

Cryo-EM for drug development

Availability of structural information of high-value pharmacological targets is critical to the successful design and development of new therapeutic agents. Structure-based drug discovery (SBDD) relies on Cryo-EM techniques to determine a 3D structure of biological molecules.

Advancements in the development of direct electron detectors and modern computational techniques have made it possible to create high-resolution structures of many particles and molecular assemblies using Cryo-EM.

While researchers want to focus on biochemistry, sample preparation, instrument calibration, and other lab techniques to improve outcomes, they are often burdened by ineffective IT services needed to complete the data collection, analysis, and structure determination phases of the experiment.

To procure and deploy necessary IT resources for Cryo-EM, many researchers are looking at Cloud-based compute services. This blog (the first of two) highlights real-world data-related challenges in executing Cryo-EM pipelines at scale in cloud environments. A subsequent blog will demonstrate how Pavilion can solve these data-related problems, helping maximize research potential.

Cryo-EM in the Cloud: Managing Time and Risk

New experiments need storage and data services before computational activities can begin. Inadequate compute and data services are often the reason for delays and analysis backlogs at many Cryo-EM facilities resulting in wasted days and weeks of valuable research time. Moreover, many of these workflow tasks are iterative and need to be performed several times to produce an accurate result. This research imperative further compounds the problem.

Let's look at a real-world scenario to understand the storage-related problem and how it directly impacts workflow time.

Data Acquisition

For example, the latest detectors from Gatan, working in conjunction with a state-of-the-art Cryo-EM microscope like Titan Krios, can produce multiple terabytes of data per experiment per day. To meet this type of data capture needs (tens of GB/sec), one needs a highly reliable high-speed data platform to sustain these high data ingest rates. In addition, this platform should be scalable to accommodate multiple experiments from multiple data sources. In the absence of such a platform, researchers are immediately faced with a massive data storage bottleneck.

Many Cryo-EM facilities routinely try to deploy legacy NAS appliances for data acquisition. Unfortunately, these appliances lack the required throughput capabilities. Data is first cached on local SSD-attached workstations and then manually destaged to a NAS appliance to solve this problem. If data is not migrated off the workstations fast enough, then the microscopes stop recording. An additional data-prep process to convert raw data into analysis-ready datasets can also be impacted until all the data resides on the NAS appliance. This first data copy activity immediately adds time and risk to the process.

Data Transfer

Once the data is available on a NAS appliance, these datasets must be reliably transferred to a public cloud. To mitigate the risk associated with data transfer, data is typically copied from data-generating research facilities to cloud-based storage. For this purpose, customers typically need a specialized file-based data copy software package. Data integrity and error checks must be performed at the most granular file level to ensure data is consistent between source and target systems. Adequate network bandwidth is an essential consideration for time bound data transfers. This initial data copy can be expensive and may take a few days, based on the choice of data copy mechanism, the amount of data, and the available network bandwidth. This adds significant time and monitoring effort to ensure all data has been successfully transferred. Customers commonly choose object storage buckets/services as targets for data transfer in a public cloud. The added data transformation from the original file format to an object storage format can also increase risk data inconsistency.

Data Processing

While cheaper than conventional storage systems, object storage systems cannot deliver on latency and throughput requirements for large-scale data transformation and analysis. A low latency, high throughput filesystem is often the right choice.

Researchers are now faced with developing a custom software utility, which can be used to choose the dataset of interest, identify the associated objects that comprise the dataset of choice, and copy the same to a high-performance SSD-backed filesystem. Once again, a rigorous integrity checking mechanism must be in place as data is once again transformed from object to file and copied onto a filesystem. Yet another complex, time-consuming, error-prone, and risk-inducing activity before any valuable scientific analysis has commenced.

Once data is available on a high-performance filesystem, compute resources can be deployed, and the work of scientific analysis and molecule structure assembly begins. Using multi-core CPU and GPUs where relevant, a large compute complex is put to work to solve these structural biology problems. Once complete and checked for accuracy, the newly produced data must be copied to an object storage location and associated with the original datasets. This is a critical final step that is needed to ensure data lineage and experiment integrity. Development and use of a custom tool adds to the complexity of the process and adds an element of risk associated with data copy back to object storage.

Conclusion

Researchers desire an environment wherein scientific experiments and activities are not hampered by IT operations. Furthermore, scientists are constantly finding new techniques and tools to improve the capabilities and accuracy of Cryo-EM outcomes. Scientists are continually looking to their IT counterparts to optimize the workflow executions of these extremely data-intensive workloads. Data movement and copying are responsible for many elongated workflow timelines and research backlogs. Coupled with availability and scalability issues, many Cryo-EM facilities are struggling to exploit their microscope investments.

In part 2 of our blog, we will explain how Pavilion helps remove these data-related impediments to help research scientists accelerate their workflow completions. In addition, Pavilion makes it possible for IT departments to deliver on the promise of unmatched performance and uptime, along with flexible scaling and deployment options, on-prem or integrated with cloud services. Together, Research & IT can truly collaborate to fast-track new drug discovery activities.