August 8, 2007 | This year has seen a torrent of reports in scientific journals and mainstream media announcing the results of studies linking a gene variant to an increased risk for common diseases — diabetes, Alzheimer’s disease, and heart disease, for example. As the New York Times stated recently, “The sudden spate of new genes marks an acceleration, and perhaps a turning point, in the ability to find disease genes.”
Recent breakthroughs include the discovery of multiple genetic variants linked to type 2 diabetes, thanks to genome-wide association (GWA) efforts by consortia led by the Broad Institute’s David Altshuler (Science 316,1331-1336; 2007), Michigan’s Michael Boehnke (Science 316,1341-1345; 2007), and Oxford University’s Mark McCarthy ( Science, 316,1336-1341; 2007), as well as by deCODE Genetics. The Icelandic firm was also the source of a July report in Nature (doi:10.1038/nature06007) linking two variants to risk of atrial fibrillation, the most common heart arrhythmia. Interestingly, the mutations lie adjacent to a gene known to be involved in heart development. And in June, deCODE reported that a common variant on chromosome 9 affects the risk of myocardial infarction (Science 316,1491-1493; 2007).
Alzheimer’s disease has been another focal point for GWA studies. A joint effort by the Phoenix-based Translational Genomics Research Institute and other Arizona Alzheimer’s organizations surveyed more than 500,000 SNPs and identified six variants in the GRB-associated binding protein 2 (GAB2) gene associated with late-onset Alzheimer’s disease (Neuron 54, 713-720; 2007). Another industry-academia collaboration with PGxHealth (New Haven, CT) studying schizophrenia used 500,000 markers to pinpoint several novel rare mutations in a cytokine receptor gene (Molec Psych 12, 572-580; 2007). The group concluded that this candidate gene could help explain epidemiologic data suggesting an increased risk for schizophrenia in individuals with autoimmune disorders, prenatal infection, and familial leukemia.
In May, reports in Nature (447,1087-1093; 2007) and Nature Genetics (39, 870-874; 2007 and 39, 865-869; 2007) identified up to five common genetic variants that confer an increased risk of breast cancer. These findings emanated from a consortium led by Doug Easton (Cancer Research UK), David Hunger (Harvard School of Public Health), and deCODE.
Mapping disease-related genes and SNPs marks the beginning of a new genomic revolution that will see the convergence of sequencing data, genotyping data, differential gene expression data, pedigree and population data, historical/archived data, and clinical data to enable the assessment of disease risk, guide screening efforts, and pave the way for disease prevention strategies, early diagnosis, safer and more effective treatments, and better prognostic tools.
Several factors are driving the dramatic discovery of these gene-disease associations, notably exponential improvements in hardware and software that enable large-scale GWA studies. And contributing to the ease of implementation of this evolving technology base is the cooperative effort among the vendor community to maximize compatibility of the hardware and software.
Hardware and software providers rightfully claim a contributory role in accelerating the long-awaited outcomes of large-scale GWA studies. Although it is the synergism between microarray platforms, software, and user expertise and feedback that represents the ultimate spark igniting the current firestorm of SNP/disease associations, vendors want to tout their contributions. For example, Affymetrix would like you to know that its technology was used by the Wellcome Trust Case Control Consortium for a major study of 14,000 cases of seven common diseases (Nature 447, 661-678, 2007; see p. 6). And Illumina technology was at the heart of recent GWA studies of prostate cancer and amyotrophic lateral sclerosis.
The Affymetrix Genome-Wide Human SNP Array 6.0 GeneChip, introduced in May, contains nearly one million probes for detecting single nucleotide polymorphisms (SNPs) and copy number variants (CNV) (See “Adding CNVs to the Mix”). In June, Illumina launched a more than 1-million SNP chip — the Human1M DNA Analysis BeadChip — that also includes about 50,000 probes targeted to genomic hotspots for CNVs.
These two microarray rivals are ironically furthering collaboration across industry and academia by advancing large-scale genotyping efforts. Affymetrix has a GeneChip Consortia Program and the Affymetrix Developers’ Network, a forum for software developers. The Illumina-Connect bioinformatics software partnership, launched in May, includes companies such as Biocomputing Platforms, GenoLogics, GeneData, GeneSifter, Golden Helix, InforSense, Ingenuity Systems, Partek Genomics, Rosetta BioSoftware, Sapio Sciences, and SAS/JMP Genomics. Partner companies create customized plug-ins that facilitate integration of their algorithms and statistics packages with the data generated on Illumina or Affymetrix platforms and software.
“Our partners clearly ... recognize that no single software solution will meet the needs of all researchers,” says Omoshile Clement, Illumina’s senior product manager of platforms. “The goal is to shorten the time and improve the quality of data analysis, ensure that the informatics component of GWA is not a bottleneck, and enable a seamless transfer of data from the microarray platform to diverse computational applications.” In June, Illumina initiated iControlDB, a genotyping control repository for use by researchers conducting case-control GWA studies. The aim is to provide access to a collection of genotypes from nearly 10,000 control samples.
A Turning Point
A host of new computational strategies and tools are increasing the speed and breadth of data analysis and the statistical power of GWA studies. They are also driving improvements in the QA/QC of genotyping projects and inspiring fresh confidence in the significance of disease association studies. This is vital as researchers move beyond the low hanging fruit, or more intuitive disease-gene linkages, and begin discovering variations in genes that are not necessarily active in pathways or related to known disease processes.
“It is important to build into the software avoidance of flaws and statistical error, so users are not generating spurious association results,” says Andy Ferrin, VP of sales and marketing at Golden Helix. As with any experimental technique, GWA comes with noise and limitations. All too often, researchers do not avail themselves of the expertise of statisticians when designing an experiment, says Brian Gilman, president of Panther Informatics and SciLink. The main statistical challenge in GWA studies concerns the “potential for inherent flaws, especially when combining experimental platforms,” says Gilman. An example might be analyzing case-control data from an Affymetrix or Illumina platform with a more focused SNP analysis using the Sequenom platform.
“The biggest use of software systems in large-scale genotyping projects today is for performing QC and QA” — laboratory information management types of applications, says Gilman. Companies need to evaluate the raw data and verify the quality of the sample spotting, the positive and negative controls, the quality of the chips, and the data output.
Gilman says the improving working relationship between industry and academic partners represents a major change, enabling companies to recognize early on the limitations of their software platforms, allowing rapid improvements to commercial products. “I think people now better understand the use and limitations of a platform for a particular kind of study,” says Gilman, thereby contributing to the glut of recent GWA wins.
Tracking a disease-related SNP is one thing, but researchers want “to understand the biological context and significance of the statistical association,” says Megan Laurance, senior scientist at Ingenuity Systems. Indeed, many experts see a shift from simple genetic confirmation — reproduction of an association in a second population — to systematic functional confirmation studies, which validate the results by taking into account the biological context of the variation in specific genes.
Ingenuity’s flagship product, Ingenuity Pathways Analysis, takes experimentally derived gene lists and plugs them into the software to identify which pathways, cellular processes, disease phenotypes, and molecular interactions are implicated for a particular set of genes.
“The challenge on the software side is to provide a fluid workflow,” says Laurance. For example, the UC-Santa Cruz “Golden Path” Genome Browser is a popular portal for identifying SNPs and mapping genes. Ingenuity ensures that its software supports research workflows from the browser, allowing users to extract and plug genes into Ingenuity software and put them into a biological context. Laurance adds that, “for pharma companies that typically have large bioinformatics/computational teams and might be ahead of public mapping efforts, we provide programmatic tools that facilitate a seamless link to our applications.”
Commenting on the importance of compatibility with the leading microarray platforms, Kevin Cramer, CEO of software producer Sapio Sciences, developers of Exemplar LIMS and Exemplar Analytics for genotyping data, says, “Ninety percent of the challenge is not in running the software but rather in the upfront loading and formatting of the data.” Data input must be as seamless as possible and accommodate researchers using various public domain and vendor tools, each with their own peculiar or proprietary data formats.
Illumina’s Clement identifies several key challenges, including the need for enhanced statistical power to mine large datasets and achieve P values that yield at least an 80% confidence interval. Ideally, novel computational and data analysis methods could increase statistical power without the need to move to larger sample sizes, by maximizing the signal-to-noise ratio of GWA studies. “The crown jewel will be the ability to resequence a specific SNP/gene and validate it as a biologically relevant marker for a particular disease,” says Clement.
Trends and Trajectories
The incorporation of increasingly diverse forms of data in GWA studies, such as analyses of methylation patterns, is pressuring software firms to develop a variety of solutions. Another priority in the evolution of GWA software is to make it more accessible to a broader range of end-users, including clinicians who may be applying the results to patient populations to guide screening, diagnostic, prognostic, and treatment planning decisions.
“With the price of genotyping plummeting, the density of chips increasing,” and the number of GWA studies worldwide soaring, “it is no longer practical for all the data analysis to be done by statistical geneticists at large institutions,” says Ferrin of Golden Helix. The key is to strike a balance between providing powerful functionality in an intuitive yet flexible workflow that can be readily applied by non-programmers and non-statisticians.
As companies are increasingly becoming overloaded with raw microarray data, software vendors are moving toward novel data management solutions. These will also have an important role in helping end-users leverage historical datasets to perform large comparative, population-based studies. “There is a strong need to take legacy data and merge it with the data being generated today,” says Sasha Paegle, senior product manager at Rosetta BioSoftware.
Recognizing the “intimate relationship” between statistical geneticists, clinicians, and genotyping core groups, Rosetta designed its Syllego system to provide a project workspace for organizing genotype and reference data, to give context to the analysis results and provide a way to share the results. Says Paegle: “The goal is to enable scientists to solve their scientific question using computational methods or tools of their choosing.”
“Genetics is an ‘old’ field,” says Paegle, “and as a consequence there are many different file formats, which introduces some friction between collaborators,” who want to share both raw data and analysis results. “There is currently no consistent way to share data,” Paegle adds.
Partek’s GS software “combines powerful statistical analysis, interactive visualization, is intelligent about the human genome, and has a user interface designed to be used by the scientist — putting the analysis into the hands of the medical researcher,” says Partek president and CEO, Tom Downey.
Shifting the Bottleneck
The growing number of GWA data points is staggering, with 1-million SNP microarrays and GWA studies comprising up to 20,000 patient samples. Some groups are even contemplating studies up to 100,000 patients, says Christophe Lambert, president and CEO of Golden Helix. The growth in genotyping chip density is even exceeding Moore’s Law, up some tenfold in the past 18 months, estimates Lambert. “When you consider the 2 gigabyte limit of 32-bit architecture processors, we’re beginning to hit a wall,” he says.
These expanding numbers have forced companies to evolve. Four years ago, Sapio Sciences took root in the days when state-of-the-art technology was the Affymetrix 10K Array, recalls CEO Cramer. Over the years the company has added clustering, haplotype analysis, quantitative trait analysis, gene ontology mapping, and, most recently, covariate analysis modules to its Exemplar Analysis Suite. Cramer says Sapio’s newest product release can fully utilize 3 GB of memory (making use of a 3-GB switch in the boot file of Windows XP Professional or Server) allowing users to process 50% more samples than with a traditional 2-GB system.
But the common view that hardware, particularly memory capacity, limits the analysis of large datasets is “a big fallacy in the industry,” says Gilman of SciLink. Bioinformaticians are typically not computer scientists, nor are they trained in how to use memory efficiently and how to tweak a system or algorithm to maximize computer performance.
BC/SNPmax is Finland’s Biocomputing Platforms’ scaleable database management system for GWA that facilitates the collection and integration of genotype and phenotype data and the interface with downstream data analysis tools. “We focus on the workflows in a study and [by creating a user-defined interface] make it possible to split a problem into small pieces, run these in parallel in a computer cluster, and link to different analytical tools,” says Timo Kanninen, Biocomputing’s founder and technology director.
The genome work pioneered in large genome centers is spreading to smaller groups that typically lack the tools or expertise to create their own software environments for managing data. From a funding perspective, Pekka Mäkelä, managing director of Biocomputing Platforms, observes that whereas grant money has been readily available to develop GWA hardware, funding agencies “now want to see a well-developed plan to manage and analyze the data” and are more willing to provide funds to support bioinformatics needs.
One could debate the relative merits of distributed or grid computing strategies for sharing the load of highly data-intensive analyses across multiple processors located either within a computer or across tens, hundreds, or thousands of CPUs. Regardless of the exact approach, grid computing offers a potential solution to the need for increased speed and memory capacity for GWA studies. Multiple computers can independently analyze small packets of data, such as discrete ranges of SNPs, and the accumulated data can then be combined, interpreted, and presented to the user.
Golden Helix has developed a GWA grid solution that can manage the data produced by the next generation of GWA platforms using chips with more than one million SNPs and tens of thousands of patient samples. Compared to the open source strategies often developed in academia to handle GWA data in batch jobs, Lambert says that this product can “interactively monitor the progress of these batch jobs within the context of our user interface.” That interface supports both the Condor Pool (a grid system developed at the University of Wisconsin) and United Device’s Grid MP frameworks.
For now, though, according to Josh Forsythe, marketing director at Golden Helix, the company’s HelixTree software can perform interactive GWA studies on a standard computer in a reasonable time. For example, HelixTree software was deployed in a recent study that identified a schizophrenia susceptibility locus conducted in several hundred patients using the Affymetrix 500K Array (Molec Psychiatry 12, 572-580; 2007). Forsythe adds that HelixTree has been successfully used internally in conjunction with the Affymetrix SNP GeneChip 6.0 Array.
In a similar vein, Agilent’s desktop tool, GeneSpring GT, has been used to analyze high volume, high density genotyping data in combination with GeneSpring GX for gene expression data analysis to study schizophrenia and bipolar disorder.
Prepping for Personalized Medicine
In May, SAS Institute hosted the FDA-led MicroArray Quality Control Consortium (MAQC), which is conducting a study of microarray technology. MAQC participants from SAS have been using the company’s JMP Genomics software to analyze the MAQC datasets and display results in a graphical format. The consortium recently formed a subgroup that will focus on GWA studies, evidence of the growing interest in using microarray data to predict disease risk and clinical outcomes as part of the trend towards personalized medicine.
JMP Genomics is a tool for importing, analyzing, and visualizing SNP, expression, and proteomics data. Designed for users with limited programming knowledge, it allows them to access advanced SAS genetic analysis tools. Russ Wolfinger, SAS’ director of scientific discovery and genomics, says the company’s software can be used to visualize genotyping data, plotting P values along a representation of a chromosome, with peaks revealing hotspots of association. “From this high level visual view of the data, the software allows you to highlight the most significant results and ‘drill down’” to focus on those gene regions, says Wolfinger.
JMP Genomics can perform principal components analysis of genetics data to look for population substructures or family relationships. Principal components analysis is one of several dimension reduction techniques that can be used to identify patterns or trends in data.
InforSense’s analytical workflow platform targets translational research projects and is designed to support clinical hypothesis generation. “Researchers use the company’s ClinicalSense product, which enables you to slice and dice through large datasets, to generate patient subpopulations such as responders versus non-responders,” explains CSO Jonathan Sheldon.
Describing that as a “logical entry point into a personalized medicine study,” Sheldon says that combining ClinicalSense with the newer GenSense genome-wide genetic analysis software package helps identify the genetic basis for these differences.
“How you interpret the statistical results and share them with the wider user community is a challenge for the GWA field,” says Sheldon, describing the interactive portal approach InforSense took in designing GenSense, which “enables non-statisticians to make biological sense of genetic studies carried out at the whole genome scale.”
Panther’s Gilman predicts that the next evolution of GWA technology will be to use the haplotype map as a tool for stratifying patient populations for clinical trial and pharmacogenomic applications.
In Spring 2007, Washington University’s Genome Sequencing Center (GSC) received a $156 million, four-year grant from the National Human Genome Research Institute (NHGRI). “We are moving more and more to a medicinal focus, re-sequencing regions of the human genome and targeting genes of interest for different diseases,” says GSC assistant director David Dooling. “We are using expression array and information on [CNVs] and loss of heterozygosity to inform decisions on which regions [of the genome] to re-sequence.”
For now, whole genome SNP and CNV analysis are helping to map regions of interest for medical sequencing, but Dooling predicts, “whole-genome sequence is not too far off, making genome-wide analysis at the sequence level possible.”
Genizon BioScience’s John Raelson agrees that whole-genome sequencing technology will likely dominate the genome analysis landscape within the next five years. “The real future [for GWA studies] will be in pharmacogenomics, incorporating a whole genome association study into a clinical trial to identify genes affecting response to a drug,” says Raelson.
Sidebar: Genizon Horizon
The Montreal firm Genizon BioSciences leverages its access to DNA samples from more than 50,000 subjects in the Quebec French Canadian founder population to study the genetic basis of 25 diseases (See “The Galileo Code,” Bio•IT World, Feb. 2005). The company has completed ten whole genome scans, including Crohn’s disease, asthma, Alzheimer’s disease, schizophrenia, and attention deficit hyperactivity disorder.
Genizon’s IT infrastructure incorporates an FDA-compliant LabVantage Sapphire LIMS, an eResearch Technology Clinical Database Management System with electronic data capture and signature capabilities, a 32-terabyte storage area network, and a 288-CPU computing grid for data analysis.
Borivoj (Bora) Stojkovic, Genizon’s software architecture leader, describes the company’s proactive approach to the inexorable growth in SNP densities and GWA sample sizes. Genizon simulates large datasets to identify and develop methods aimed at preempting current bottlenecks. By experimenting with compressed versions of the datasets, “we can load them into memory and still have access to all the data without chewing up all of our memory,” says Stojkovic. “We have to decompose the domain so we can converge to a smaller dataset,” partitioning the data into smaller blocks and then processing individual blocks using multiple CPUs.
Stojkovic says Genizon’s latest compression strategies will reduce the firm’s file storage footprint “by a factor of up to 20.” At various compute-intensive stages of an analysis, Genizon may partition its databases among grids and clusters integrated into its pipeline. The company’s internal grid automatically distributes the data among thousands of CPUs in a cluster. To manage different types of data, Genizon standardizes on its own internal data format, funneling data from external sources through conversion filters.
Stojkovic emphasizes, however, that while relying on various approaches to distributed or grid computing offers definite advantages, the first goal should be to streamline and optimize the computational thought processes and the algorithmic fluency, improving integration and links, and to tighten and refine the language and flow.
Subscribe to Bio-IT World magazine.