Aug 15, 2005 | The Human Genome Project has provided gene hunters with a rich terrain to search for errant genes responsible for a host of diseases. But as countless researchers can attest, some disease genes are easier to pinpoint than others. In particular, the quest for genes that underlie extremely rare disorders or contribute to complex traits confounded by environmental factors remains a formidable challenge.
Category: Discovery and Basic Research
Title: Whole-Genome SNP Scanning Pipeline
to Identify Disease-Causing Lesions
Organization: Translational Genomics
Research Institute (TGen)
Partners: Affymetrix and Silicon Genetics
At the Translational Genomics Research Institute (TGen) in Phoenix, Ariz. (see “A Diamond in the Desert
,” December 2003 Bio-IT World
, page 26), Dietrich Stephan, David Craig, and colleagues have developed an industrial-style informatics strategy and discovery infrastructure that has pinpointed some 25 disease genes in the past 24 months, many of which are pending identification.
The TGen pipeline uses Affymetrix microarrays and the VARIA algorithm from Silicon Genetics (recently acquired by Agilent Technologies) to locate and identify disease-causing mutations. TGen claims a 100-fold increase in throughput using array-based genetic scanning approaches, automated data extraction and warehousing, and automated analysis.
“This pipeline has allowed us...to scan the genomes at ultra-high density [more than 11,500 positions in each individual] in over 10,000 individuals — and identify the genetic bases of five human conditions,” says Stephan. These include a form of sudden infant death syndrome (SIDS), intractable epilepsy, and forms of mental retardation and spinocerebellar ataxia.
The genotyping strategy uses single nucleotide polymorphisms (SNPs) rather than microsatellite markers, which provide superior “accuracy, informativeness, marker density, availability of analysis options, and throughput,” says Stephan. The core of the TGen pipeline is Affymetrix 10K SNP GeneChip arrays. “We’ve fully equipped the lab with the Affy hardware and find it very easy and reproducible,” says Stephan. “We did buy a Sequenom and an Illumina for the Institute, but haven’t yet worked our way into those technologies.”
| ||DATAGENIC: TGen's Dietrich |
Stephan and colleagues
assisted in pinpointing 25
disease genes in two years.
Setting up the pipeline required an extraordinary amount of validation for accuracy and replication quality. The quality and reproducibility of the Affymetrix arrays were checked against sequence data from various sources, yielding excellent concordance and reproducibility rates from 99.5 to 99.99 percent. Although a SNP carries less information than a microsatellite marker, the sheer number and speed of SNP genotyping produces much greater information content at any point in the genome.
Stephan says the SNP pipeline workflow “generates two orders of magnitude increase in throughput.” With a genome-wide panel of 400 microsatellites, it would typically take the TGen group two weeks to type 384 individuals — an annual pace of 3.5 million genotypes (8,640 individuals). By contrast, a single SNP array (which takes three days to run) can produce 11,555 SNP genotypes. The TGen lab runs 1,000 arrays per week in parallel, or close to 600 million genotypes per year, with no additional call checking. A bar-coding system tracks all data through the pipeline and into an Affymetrix GCOS warehouse, before they are exported into Silicon Genetics’ VARIA algorithm for analysis.
“What sets VARIA apart,” explains Stephan, “is that it works seamlessly with the Affymetrix 10K genotyping data. The algorithm is able to handle the huge number of genotype calls where other freeware fails... Additionally, the software has wonderful graphics which make visualizing pedigrees with haplotypes trivial.” Stephan’s group served as an alpha site for the software. “Working with [Silicon Genetics], we ironed out the bugs and got it working at a practical level.”
Stephan’s team’s first real test was to apply VARIA on data from a family with SIDDT (SIDS with dysgenesis of the testes). After tracing the errant gene to chromosome 6, DNA sequencing revealed a single-base insertion in a gene called TSPYL. Impressively, the linkage scanning and gene identification portion of the SIDDT project was finished in just five days. These results were published in 2004 (see “Genome Scan Yields SIDS Clue,” August 2004 Bio-IT World, page 10).
The TGen data management and analysis engine workflow is recognized as the largest linkage scanning genotyping pipeline in the country. Stephan’s division is an Affymetrix Center of Excellence in Genotyping and Resequencing. More recently, using Affymetrix 100K and 500K arrays, the group is partnering with other disease consortia, including Alzheimer’s, bipolar disorder, and multiple sclerosis. It is tracking samples in such a way as to enable pharma partners to select subsets of patients with certain clinical nuances or genomic subtypes and then enroll them in prospective trials. Stephan has devoted about one-third of his lab over the past six months to the National Alliance for Autism Research’s Autism Genome Project, performing linkage scans on more than 7,500 DNA samples from 1,500 families.