By Kevin Davies
June 12, 2002 | Much of the fanfare that greeted publication of the human genome sequence last year focused on the curious discovery that we carry only one-third of the previously suspected total number of genes. Although compiling the gene inventory remains an important priority, an equally important quest — to catalog the common discrete variations, or SNPs (single nucleotide polymorphisms), between unrelated individuals — is being mounted in both academic and biopharma centers.
By understanding the scope and pattern of genomic variation in control and diseased populations, scientists hope to pinpoint key genetic variations that shape susceptibility (or resistance) to common ailments. Such efforts have been underway for several years, but with the exception of some Herculean efforts, notably in type II diabetes and Crohn's disease, progress has been slow.
The reported genome sequences are typically considered a consensus sequence, but in fact they
|R. Judson, B. Salisbury, J. Schneider, A. Windemuth, & J.C. Stephens. "How many SNPs does a genome-wide haplotype map require?" Pharmacogenomics: 337-392 (2002).
J.C. Stephens, J.A. Schneider, D.A. Tanguay et al. "Haplotype variation and linkage disequilibrium in 313 human genes." Science 293: 489-493 (2001).
N. Patil, A.J. Berno, D.A. Hinds et al. "Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21." Science 294: 1719-1723 (2001).
M.J. Daly, J.D. Rioux, S.F. Schaffner, T.J. Hudson, & E.S. Lander. "High-resolution haplotype structure in the human genome." Nature Genetics 29: 229-232 (2002).
are a composite of perhaps 15 or 20 different individuals. Celera Genomics Group used DNA from five donors of different ethnic backgrounds in its sequencing efforts, including, we now know, that of former Celera president J. Craig Venter. The public consortium data probably includes sequence from a dozen or more individuals who contributed DNA for various genomic libraries used over the past decade.
The level of genetic variation between any two unrelated individuals has been estimated to be 1 in 600 bases, for a total of some 5.3 million common SNPs in the human genome. Rapid progress is being made in cataloging these polymorphisms, notably through an organization called the SNP Consortium, a loose nonprofit alliance of 11 big pharmas, IBM, Motorola, and the Wellcome Trust. These SNPs are stored in public databases such as dbSNP (National Institutes of Health) and HGBase (Karolinska Institute/European Bioinformatics Institute).
But even a comprehensive catalog of SNPs is of limited usefulness for finding disease associations without information on their frequency and inheritance patterns. Throughout the genome, certain stretches of DNA, defined by contiguous groups of SNPs, tend to be inherited in blocks of varying lengths rather than randomly. In such instances, the typing of just one SNP could define a haplotype of 10 or more polymorphisms. Moreover, many SNPs lie in vast stretches of junk DNA, and therefore might be expected to have little or no physiological significance.
Hail the HapMap
Over the past year, several important papers have appeared on the frequency of variation in the human genome, providing estimates of the total number of SNPs required to saturate the genome and fuel the search for the genetic factors that govern susceptibility to common diseases.
Writing in the May issue of Pharmacogenomics, a team from New Haven, Conn.-based Genaissance Pharmaceuticals Inc., led by Richard Judson and J. Claiborne Stephens, presents a detailed comparison of three of these studies, and offers guidelines for the development of a haplotype map. In addition to extending their own dataset, published last year in Science, the Genaissance team considers a study from Patil and colleagues from Perlegen Sciences Inc., an offshoot of Affymetrix Inc., cofounded by Stanford University geneticist David Cox. The third study included comes from Mark Daly, Eric Lander, and colleagues at the Whitehead Institute Center for Genome Research.
The Genaissance group expanded on their report in Science last year, which examined 313 genes, looking for variations among the sequence—coding and some flanking non-coding—of 1,274 known genes. By comparing DNA from 82 unrelated individuals of different backgrounds, it found 11,339 SNPs (with the minor variant frequency of at least 1 percent), or about 9 SNPs per gene. From this, Judson et al. estimate that a gene-based haplotype map would require some 138,000 SNPs, and a complete genome-wide haplotype map would necessitate on the order of 600,000 SNPs. This is in the same neighborhood as the 500,000 first predicted by Leonid Kruglyak, a Howard Hughes Medical Institute associate investigator at the Fred Hutchinson Cancer Research Center, in a widely discussed paper in Nature Genetics a few years ago.
The Perlegen study was an exhaustive high-density microarray survey of the coding regions of chromosome 21, about two-thirds of the sequence, a task that required the synthesis of more than 3 billion oligonucleotides and 160 Affymetrix wafers. The study revealed close to 36,000 SNPs from a survey of 20 individual chromosomes using a pattern recognition algorithm, with 24,000 containing the minor variant more than once. Only about 4,500 of these SNPs are required to define the common haplotypes on the chromosome. Extrapolating across the entire genome, the estimate comes to 423,000 SNPs, with 245,000 required for a gene-based map.
The Whitehead study by Daly et al. examined a 500,000-base region of chromosome 5 in 516 samples. Thirty SNPs were required to define the haplotypes in this short region, which works out to 180,000 SNPs for the genome. This figure represents a lower estimate, because unlike the previous two studies, the sampling population was exclusively of European descent. There is considerable genetic variation between different ethnic groups, such that more SNPs would be required to adequately sample a heterogeneous population.
The results of the Genaissance team's survey can be summed up as follows: Sampling the entire genome would require somewhere between 180,000 and 600,000 SNPs. If the sampling is confined to the 30,000 or so genes (and their flanking areas), the estimate falls to between 138,000 and 245,000. Reasons for the discrepancy include differences in sample size and heterogeneity, and the amount of flanking sequence incorporated in the SNP search. (The Perlegen study included almost twice as much flanking DNA per gene as Genaissance's.) There were also differences in the heuristic algorithms used in the three studies to select the minimal number of SNPs. As Judson and colleagues point out, "an important analytical challenge is to find the SNP selection algorithm that maximizes the total relevant information from a survey while minimizing the overall cost."
The Genaissance group also discusses various scenarios and costs for producing a genome-wide haplotype map. The genotyping of 2 million existing SNPs, for example, in an ethnically diverse panel of 96 chromosomes, would be about $68 million, assuming a cost of $1 per SNP genotype. This could come down as genotyping costs fall (see "Genotyping Prices Plummet," page 34). The authors acknowledge efforts in several other companies, including Illumina, Motorola, Orchid, and Sequenom, that should help drive down costs further still, putting a $1,000 genome scan within reach. The National Institutes of Health has taken note, and recently issued a request for applications for the construction of the so-called "HapMap."
Blocking pattern: Haplotype patterns for 147 SNPs among 20 unrelated copies of chromosome 21, as recorded by researchers at Perlegen. The SNPs are stretched along about 100,000 bases of the chromosome; the blue boxes represent the common variant, yellow the less frequent base. The right side shows a block of 26 contiguous SNPs spanning 19,000 bases; different patterns of just two SNPs in this region define 80 percent of the haplotypes.
REPRINTED WITH PERMISSION FROM PATIL ET AL., SCIENCE 294, 1719 (2001). (C) 2001 AMERICAN ASSOCIATION FOR THE ADVANCEMENT OF SCIENCE