September 27, 2011 | Guest Commentary | We have emerged from the “early adopter” phase of Next Generation Sequencing (NGS) and are well into a critical phase of rapid technology expansion. The goal is to make exome and whole genome sequencing routine methods in the arsenal of discovery biology accessible to biologists beyond those immersed in specialized genomic research. While sequencing vendors are diligently updating their platforms to simplify their use, organizations supporting informatics infrastructure often lag behind, encumbered by a tangle of Perl and Python scripts, single-user software applications and limited availability of expertise—all of which impede the scaling of NGS to meet the diverse needs of the enterprise. In the rapidly evolving NGS space, it is critical that the supporting software infrastructure provides versatility, agility and scalability to meet researchers’ needs.
Supporting Multiple Projects
Informatics teams supporting the use of NGS data need a harmonized NGS infrastructure that can be used across the organization in various applications. Life science research organizations, whether in academia or industry, tackle a broad spectrum of research activities, running the gamut from basic disease research to translational approaches to oncology and infectious disease therapies. Even within a single pharmaceutical company, use of NGS may extend from basic research on disease targets through toxicology profiling to stratified clinical trials.
Although individual computational elements are often shared among these groups, the NGS analysis pipelines differ widely because researchers look at different things. Oncology researchers, for example, might be focused on radically reorganized tumor genomes that differ fundamentally from those of normal tissues. Researchers studying neurodegenerative diseases like Alzheimer’s might be looking for genes or promoter regions associated with the formation of amyloid plaques in the brain. An infectious disease group might be looking at mutations in microbial populations that compromise therapeutic interventions. Different research areas require different computational methods. Deploying next generation sequencing as a common assay platform across the enterprise means more than providing the raw sequence data; it means equipping diverse research groups with the computational pipelines they need while also ensuring that workflows can be shared efficiently and interpreted correctly by groups doing similar work.
Integrating New Methods
In the NGS space, tomorrow’s computational experiments will invariably differ from those of today. Massive, complex NGS datasets, changes in sequencing technologies and continuously evolving analytical methods require an agile software platform to keep pace. Moving from short-read technologies to the longer-read lengths anticipated with the next round of NGS platforms will produce continuing change, prompting bioinformaticians to address new challenges such as combining short-read and long-read data in mapping experiments and implementing new methods for detecting variants.
Beyond changes in sequencing technologies, our understanding of the genome is also evolving at breakneck speed. Today, one area of vigorous research focuses on the relative importance of single-nucleotide polymorphisms (SNPs) and minor indels versus large-scale structural variation. Recent work suggests, rather counter-intuitively, that large-scale structural variants may be extremely important in explaining normal phenotypic variation among humans. Computationally, this means currently popular methods for detecting SNPs and minor indels by mapping short-read sequences against reference sequences may not reveal the most important genetic variants; de novo assembly methods must be used first and resulting contigs aligned to reference sequences. Genomics is far from a settled science. To get full value from NGS data, research organizations must keep up with best practices and enterprise NGS platforms must have the agility to rapidly integrate and deploy new computational methods and algorithms across the organization.
Scalability in Supporting HPC
To accelerate the analysis and interpretation of the huge volumes of data generated by today’s DNA sequencing instruments, organizations are implementing computational pipelines on high performance grid systems, either on premises or in the Cloud. Locally-managed grid deployments of hundreds of nodes and thousands of computational cores are commonly used to handle the massive NGS computing requirements of larger organizations. The emergence of Cloud-based “utility computing” is bringing these once exotic computing architectures within the grasp of much smaller organizations, requiring the same “elasticity” from NGS software systems that Amazon and others are providing for computational hardware.
Beyond hardware, scalability of another sort is just as critical. Modern research organizations can recognize enormous productivity benefits from software systems that facilitate sharing of computational assets. Few software systems make this easy. The greatest productivity gains are realized by systems that minimize barriers to use of a common computational framework through which bioinformaticians can share knowledge that is otherwise difficult to capture and deploy. An organization’s ability to capture and share abstract computational knowledge within a collaborative framework will enable research, now and into the future. •
Clifford Baron is the product marketing director for Accelrys. He can be reached at Clifford.Baron@accelrys.com.
This article also appeared in the 2011 September-October issue of Bio-IT World magazine.