Genome Analysis in the Cloud and for the Masses

By Allison Proffitt

May 30, 2014 | 2014 Best Practices Winner | When the Human Genome Sequence Center (HGSC) at Baylor College of Medicine took on the sequencing and analysis of genomic data from 14,000 individuals, they knew they had a problem of scale. The solution for getting the data analyzed and out to 300 collaborators worldwide won the team and their partners at DNAnexus a 2014 Bio-IT World Best Practices Award for IT Infrastructure and High Performance Computing.

Heart disease and stroke kill millions of people each year. The World Health Organization predicts that by 2020, heart disease and stroke will become the leading cause of both death and disability worldwide. To expand our understanding of genetics’ role in heart disease, the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium enlisted five institutions to participate in a global sequencing project.

HGSC served as the sequencing center for the consortium, sequencing samples collected at the five institutions from over 14,000 individuals—3,751 whole genomes and 10,771 whole exomes. The HGSC then used its own in house Mercury pipeline, a semi-automated and modular set of bioinformatics tools for the analysis of next generation sequencing data, to process the sequencing results.

With 18 Illumina HiSeq instruments, the sequencing center was well stocked to be able to handle the load, said Narayanan Veeraraghavan, lead programmer scientist at the Human Genome Sequencing Center and technical project lead. “I don’t think the bottleneck is the sequencing, because we have the capacity.” The problems lie in the analysis.

HGSC estimated they would either need to quadruple their current compute core capacity in the short term to complete the project and not inconvenience other users by monopolizing the cluster for 4-5 weeks in an attempt to complete the job, keeping all other users out of the system. Once the data were processed, the results needed to be securely and efficiently delivered to 300 CHARGE collaborators worldwide.

“It was pretty obvious that we just didn’t have the muscle to churn through this analysis within the given time. It probably would have taken us a few weeks if not a few months,” Veeraraghavan said.

HGSC needed a solution for two main problems. “Number one is just being able to do the compute at scale and at a pretty fast pace. It became obvious that we needed to look to the cloud for compute capacity given the short turnaround time for the project,” Veeraraghavan said. “Beyond the compute, once the results were generated, we also wanted to create an environment where people can access the data securely, then be able to work with the data securely and do fancy science.”

To address these challenges, HGSC, DNAnexus, and Amazon Web Services (AWS) teamed up to deploy a cloud-based infrastructure that could handle this ultra large-scale genomic analysis project quickly and flexibly, with zero capital investment.

“One of the reasons we went with DNAnexus and not directly with Amazon was because they have a platform with a user interface that is very intuitive, very easy for a person who has no programming background and no computation background… to just log on to his DNAnexus account, and just like he accesses his email, he can access these different pieces of data and tools,” Veeraraghavan said.

The DNAnexus API allowed the data to be securely delivered only to the researchers with the appropriate permissions to access it, and served as a repository for bioinformatics tools from HGSC and other collaborators. AWS has also provided DNAnexus with a Business Associates Agreement (BAA), allowing DNAnexus to offer best-in-class security and compliance with healthcare laws both in the US and internationally.

The first analysis HGSC did on the cloud took 10 days and 21,000 cores—an estimated five times faster than on the local cluster. At the time of the Best Practices entry submission, the project had used 3.3 million Amazon EC2 core hours of compute time to do the sequence analysis. Collaborators are using 860TB of Amazon S3 and Glacier storage for storing intermediate and results data.

The Best Practices entry authors estimated that using the DNAnexus-Amazon solution saved HGSC about 30% over the costs of using the in-house cluster.

“Our realization has been that it’s extremely cost-effective—probably much, much cheaper—to compute on a need basis, than having to organically expand our infrastructure just for this specific project. Once the project is done, the utilization is not that great. You still have that hardware and you have to maintain them, but there’s not much utilization.”

The CHARGE samples are still coming in. Veeraraghavan says samples are batched for periodic sequencing and analysis, so that the CHARGE workload is quantized, but heavy. He expects to launch the next sequencing and analysis with a larger dataset sometime in late June or early July. And he’s happy to do it on the cloud.