Inside the HIVE, the FDA's Multi-Omics Compute Architecture

By Aaron Krol

October 22, 2014 | The FDA has always been a data hungry organization, keeping extensive records on the clinical trials it uses to validate new drugs and medical devices. But as industry and academic organizations scale up their computing systems to handle the flood of multi-omics data now entering the life sciences, the agency is bracing itself for datasets orders of magnitude larger than it’s accustomed to working with.

The major driver is next generation sequencing (NGS), which can generate gigabytes of raw data just reading a single bacterial genome at high coverage. That volume quickly mounts up in larger studies, especially when working with human samples. While high-throughput sequencing has been a mainstay of basic biomedical research for several years now, the FDA has only recently started to see regulatory submissions that include NGS data. New drug submissions for antivirals, for instance, have lately included NGS screens to monitor for mutations that might confer drug resistance.

“We also anticipate receiving more submissions where next gen sequencing data might be used to support things like vaccine stability or safety of vaccine cell substrates to demonstrate that they’re not contaminated with unintended infectious agents,” says Carolyn Wilson, of the FDA’s Center for Biologics Evaluation and Research (CBER). “And of course there is a great deal of interest in using next gen sequencing to develop a new class of diagnostics,” she adds, in which NGS assays are themselves used as clinical tests — especially for rare hereditary disease and key cancer mutations. The FDA’s Office of In Vitro Diagnostics cleared the first of these new tests in November 2013, and with that precedent set, many more are likely to enter production in the near future.

“It’s clear that what’s come in to the agency is probably just the very tip of the iceberg,” says Wilson. “The use of next generation sequencing has seen exponential growth, and all of that is eventually going to hit our doors.”

To get ahead of NGS data submissions, CBER has implemented a custom distributed computing model called the High-performance Integrated Virtual Environment, or HIVE. The choice to build an in-house system, rather than adopt one of the commercial or open source platforms available for NGS storage and analysis, is partly due to the FDA’s unique security, access and quality control needs. But according to Vahan Simonyan, the lead scientist on the HIVE project, the team at CBER also has the chance to create a faster, more capable computing environment for this kind of data than any currently on the market.

Building the HIVE

HIVE began over a decade ago, when Simonyan and his colleague Raja Mazumder, currently a professor at George Washington University, developed the code base for a new means of processing biological data. Since then, HIVE has split into two independent projects: the deployment at CBER, and a public version at GWU. While HIVE only entered use for internal FDA research last year, Simonyan says that many of its essential tools are already outpacing their commercial equivalents.

“HIVE is not just a conventional virtual cloud environment,” says Simonyan. “It’s a different system that virtualizes the services.” Most cloud systems store data on multiple servers or compute units until users want to run a specific application. At that point, the relevant data is moved to a server that acts as a node for that computation. By contrast, HIVE recognizes which storage nodes contain data selected for analysis, then transfers executable code to those nodes, a relatively small task that allows computation to be performed wherever the data is stored. “We make the computations on exactly the machines where the data is,” says Simonyan. “So we’re not moving the data to the computational unit, we are moving computation to the data.”

When working with very large packets of data, cloud computing environments can sometimes spend more time on data transfer than on running code, making this “virtualized services” model much more efficient. To function, however, it relies on granular and readily-accessed metadata, so that searching for and collecting together relevant data doesn’t consume large quantities of compute time.

HIVE’s solution is the honeycomb data model, which stores raw NGS data and metadata together on the same network. The metadata — information like the sample, experiment, and run conditions that produced a set of NGS reads — is stored in its own tables that can be extended with as many values as users need to record. “The honeycomb data model allows you to put the entire database schema, regardless of how complex it is, into a single table,” says Simonyan. The metadata can then be searched through an object-oriented API that treats all data, regardless of type, the same way when executing search queries. The aim of the honeycomb model is to make it easy for users to add new data types and metadata fields, without compromising search and retrieval.

Honeycomb also serves CBER’s security needs. Because the FDA works with potentially sensitive data on individual patients and clinical trials, it is important to control access carefully. When a large number of users need to access data on the same network, this can give rise to unwieldy security rules. However, HIVE stores metadata as objects in a branching hierarchy, making it possible to define access privileges based on each object’s position in the tree. Members of a given project, for instance, might be able to see data associated with a certain piece of metadata — say, a specific clinical trial — plus all the data organized below it.

“[This] allows us to have a high level of granularity in our security conditions, while minimizing the number of security rules,” says Simonyan. “We are working with not just a single hierarchy, but a forest of hierarchies… Any particular user can be a member of multiple tree branches, so when you share certain objects with a user, or a group, you can direct your permissions going down or up.”

The Working HIVE

The HIVE infrastructure is able to support all the standard tools in the bioinformatician’s kit, and CBER has already imported most of the basics: BLAST, BWA, Bowtie, Cufflinks and Tophat, phyloSNP, and many more. “When we adopt a tool from outside, we can run it much faster,” says Simonyan. “Because HIVE is a huge cluster environment, we can run multiple computations at the same time.” For example, he says, running Bowtie on a human genome at 100x coverage, which could eat up days on a desktop platform, can take as little as two hours in the HIVE infrastructure.

However, Simonyan, Mazumder and colleagues have also built a growing set of tools specifically for use in HIVE. One of the fundamentals is HIVE-hexagon, which was the subject of a paper in PLOS ONE this June. HIVE-hexagon is an aligner, mapping the short reads that come out of NGS instruments to a reference genome, to stitch them together in order.

Unlike popular aligners Bowtie and BWA, which try to map every read in a dataset to the reference, HIVE-hexagon begins by filtering out multiple reads that cover the same area of the genome. “Current technologies do produce huge amounts of coverage, even with smaller arrays,” says Simonyan, noting that viral genomes are often sequenced at 1000x read depth or higher, creating a great deal of duplicate data.

“Recognizing this level of redundancy, we use some double hashing and prefix tree algorithms to detect identical sequences,” he says. The result is that HIVE-hexagon works with a much smaller set of reads when it begins mapping them to the reference, saving time that would otherwise be spent repeatedly realigning reads that cover the same regions. Meanwhile, the read depth is stored separately as metadata, ensuring that information is not lost by the filtering algorithm.

HIVE-hexagon was also built specifically to run in the HIVE cloud architecture, taking steps to minimize the use of random access memory in the system, which provides further efficiencies. Simonyan and Mazumder have reported aligning 100 million reads to the human reference genome in between 15 and 90 minutes, depending on the parameters used, saving hours over Bowtie or BWA. Because of the filtering of redundant reads, the efficiencies offered by HIVE-hexagon are greater at higher coverage.

HIVE also contains a suite of interesting metagenomic tools. Metagenomics deals with the mishmash of genetic data created when sequencing mixed communities of microbes, which needs to be disentangled to reveal which organisms are present in a sample. Because most NGS data submitted to CBER is viral and bacterial, sorting out the species of origin is a high priority for HIVE users.

The central metagenomics tool in HIVE is CensuScope, which estimates the species composition of a sample by mapping random reads against the NCBI taxonomy database. Unlike most metagenomic algorithms, which focus on certain signature gene regions, CensuScope takes advantage of the speed of HIVE-hexagon to more thoroughly cover both the sample and the reference genomes.

“CensuScope is a critical tool,” says Simonyan. “In fact, we run it by default every time we deposit data into HIVE.” The speed of CensuScope allows CBER to use it not only on metagenomic samples, but also as a quality control measure to find contamination in any sample submitted to the HIVE database. “With certain samples, you get contamination from phages, or pollen,” Simonyan adds, “and these are important because we sometimes rely on mutations at such a low level that accidental contaminations may violate the picture.”

Filling the HIVE

The FDA’s maxi-HIVE deployment on its White Oak campus runs on 1500 CPU cores, and has roughly two petabytes — two million gigabytes — of storage. Within this network, Simonyan’s team has been experimenting with pre-loading important databases, as a backdrop to key analyses that CBER expects to run time and again.

For example, CBER has imported a large number of public databases describing cancer genomes into maxi-HIVE, giving it a base of information on mutations that indicate a certain prognosis or course of therapy. “Let’s say you’re analyzing a cancer tissue and looking at the mutations, and you’re interested in how it overlaps with known types of cancer,” says Simonyan. “We have tools that will do that mapping for you, and go across all the cancer databases we have uploaded into HIVE, and take your particular sample and quickly detect what annotations are relevant to your cancer genome.”

Other Centers at the FDA are also beginning to look into HIVE as a platform for both internal research and clinical trial evaluation. While the architecture was built with NGS data in mind, the data-neutral API makes it easy to build out applications that support a wide variety of data types.

“We’ve been performing pilot studies to look at whether it could support postmarket surveillance data, clinical trial data, and potentially also preclinical data,” says Wilson. “We are also heavily engaged on the research side in various aspects of proteomics. For example, we’ve been looking at stem cell-derived therapies, and asking whether we can apply proteomics to identify new characteristics of those cells that would correlate with desired clinical outcomes.” To bring this kind of multi-omics analysis into HIVE, Simonyan’s team has been building mass spectrometry analytics tools to handle protein data.

As the use of HIVE continues to grow, the system will face new logistical challenges. Already, the FDA has used 25% of maxi-HIVE’s storage capacity, and use of the platform is only accelerating. Before long, CBER will have to work with the Department of Health and Human Services to develop a long-term policy for the storage of its regulatory data.

Still, one way or another the FDA will shortly have to confront the sea change in data management strategies that has already reached the life sciences industry. Implementing HIVE gives the agency a head start at a time when NGS data, and other massive data sets, are only just beginning to trickle into the regulatory process.

“It’s important to recognize that we’re living in a new era of biomedical technologies that have led in recent years to immense amounts of healthcare-relevant information,” says Simonyan. “Not just next generation sequencing, but the whole field of bioinformatics is changing.”