By Kevin Davies
March 1, 2008 | Next month marks the fifth anniversary of the official completion of the Human Genome Project, but bioinformaticians are still grappling with a pesky problem: just how many protein-coding genes exist in the human genome?
For the past few years, the conventional wisdom has pegged the total at about 25,000. But recent work from Michele Clamp, Eric Lander, and colleagues at the Broad Institute points to a much lower total — a mere 20,500 (give or take).
In a genome as large as human, many spurious open-reading frames can occur that appear to code for proteins — many are even transcribed into messenger RNAs — but in fact do not. But Clamp and colleagues believed that a sizeable fraction of the roughly 24,500 gene sequences listed in databases such as Ensembl are spurious. For example, only 20,000 show matches in the dog genome, which Lander’s team sequenced in 2005. The key question is: are the novel sequences in human functional, novel genes that lack evolutionary similarity with other mammalian species, or artifacts?
Clamp built a computational pipeline in which she used BLASTZ to compare nearly 22,000 putative genes from Ensembl with the high-quality sequence of the mouse and dog genomes. For example, as expected, the majority of human genes — 18,752 — had corresponding genes, or orthologs, in either the mouse or dog genome. After the first round of analysis, 1,285 “orphan” genes remained, with no discernible relatives in the other genomes. That figure was revised to 1,177 after removal of artifacts and other adjustments, genes “lack[ing] orthology, paralogy, or homology to known genes.”
The next step was to characterize the orphans to see if they resembled genuine protein-coding genes or spurious open reading frames. The Broad group found, “The set of orphans shows no tendency whatsoever to conserve reading frame,” suggesting strongly they are not genuine protein-coding genes.
But that left the alternative, namely that the orphans are real genes, which are either relatively novel (arising after divergence from dog and mouse) or more ancestral (but having been lost in dog and mouse). So Clamp’s group compared the orphans to the genomes of two primate relatives, chimpanzee and macaque. The results were categorically negative.
“If the orphans represent valid human protein-coding genes, we would have to conclude that the vast majority of the orphans were born after the divergence from chimpanzee. Such a model would require a prodigious rate of gene birth in mammalian lineages and a ferocious rate of gene death erasing the huge number of genes born before the divergence from chimpanzee. We reject such a model as wholly implausible,” Clamp et al. wrote.
In short, a reappraisal of Ensembl, Vega, and RefSeq produces a revised gene tally of 20,488 (of 24,551 entries).
In another promising piece of computational biology, Stanford University’s Samuel Gross, Serafim Batzoglou, and coworkers described CONTRAST (Conditionally Trained Search for Transcripts), a gene predictor program that uses a novel training algorithm. By having the ability to use information from multiple genomes, or informants, the authors claim that CONTRAST’s performance exceeds earlier de novo gene predictor algorithms, such as TWINSCAN and N-SCAN.
According to the team, CONTRAST predicted coding region structures for 65% more human genes than other programs, while it missed 46% fewer exons. The authors believe their methods will “expedite efforts to verify the complete set of protein coding genes in human and other organisms experimentally.”
Clamp, M. et al. 2007. Distinguishing protein-coding and noncoding genes in the human genome. PNAS 104, 19428-19433.
Gross, S.S. et al. 2007. CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction. Genome Biol. 8:R269.
This article appeared in Bio-IT World Magazine.
Subscriptions are free for qualifying individuals. Apply Today.