Prediction Method Drops Human Gene Total



Loading...

By Kevin Davies

March 1, 2008 | Next month marks the fifth anniversary of the official completion of the Human Genome Project, but bioinformaticians are still grappling with a pesky problem: just how many protein-coding genes exist in the human genome?

For the past few years, the conventional wisdom has pegged the total at about 25,000. But recent work from Michele Clamp, Eric Lander, and colleagues at the Broad Institute points to a much lower total — a mere 20,500 (give or take).

In a genome as large as human, many spurious open-reading frames can occur that appear to code for proteins — many are even transcribed into messenger RNAs — but in fact do not. But Clamp and colleagues believed that a sizeable fraction of the roughly 24,500 gene sequences listed in databases such as Ensembl are spurious. For example, only 20,000 show matches in the dog genome, which Lander’s team sequenced in 2005. The key question is: are the novel sequences in human functional, novel genes that lack evolutionary similarity with other mammalian species, or artifacts?

Clamp built a computational pipeline in which she used BLASTZ to compare nearly 22,000 putative genes from Ensembl with the high-quality sequence of the mouse and dog genomes. For example, as expected, the majority of human genes — 18,752 — had corresponding genes, or orthologs, in either the mouse or dog genome. After the first round of analysis, 1,285 “orphan” genes remained, with no discernible relatives in the other genomes. That figure was revised to 1,177 after removal of artifacts and other adjustments, genes “lack[ing] orthology, paralogy, or homology to known genes.”

The next step was to characterize the orphans to see if they resembled genuine protein-coding genes or spurious open reading frames. The Broad group found, “The set of orphans shows no tendency whatsoever to conserve reading frame,” suggesting strongly they are not genuine protein-coding genes.

But that left the alternative, namely that the orphans are real genes, which are either relatively novel (arising after divergence from dog and mouse) or more ancestral (but having been lost in dog and mouse). So Clamp’s group compared the orphans to the genomes of two primate relatives, chimpanzee and macaque. The results were categorically negative.

“If the orphans represent valid human protein-coding genes, we would have to conclude that the vast majority of the orphans were born after the divergence from chimpanzee. Such a model would require a prodigious rate of gene birth in mammalian lineages and a ferocious rate of gene death erasing the huge number of genes born before the divergence from chimpanzee. We reject such a model as wholly implausible,” Clamp et al. wrote.

In short, a reappraisal of Ensembl, Vega, and RefSeq produces a revised gene tally of 20,488 (of 24,551 entries).

By CONTRAST
In another promising piece of computational biology, Stanford University’s Samuel Gross, Serafim Batzoglou, and coworkers described CONTRAST (Conditionally Trained Search for Transcripts), a gene predictor program that uses a novel training algorithm. By having the ability to use information from multiple genomes, or informants, the authors claim that CONTRAST’s performance exceeds earlier de novo gene predictor algorithms, such as TWINSCAN and N-SCAN.

According to the team, CONTRAST predicted coding region structures for 65% more human genes than other programs, while it missed 46% fewer exons. The authors believe their methods will “expedite efforts to verify the complete set of protein coding genes in human and other organisms experimentally.” 

FURTHER READING:

Clamp, M. et al. 2007. Distinguishing protein-coding and noncoding genes in the human genome. PNAS 104, 19428-19433.

Gross, S.S. et al. 2007. CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction. Genome Biol. 8:R269.

 

___________________________________________________

 This article appeared in Bio-IT World Magazine.
Subscriptions are free for qualifying individuals. 
 Apply Today.

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1



White Papers & Special Reports

oracle20723
The Role of Analytics in Transforming Healthcare
Sponsored by Oracle

Sharing many of the data challenges and opportunities faced by Healthcare, the Life Sciences industry remains focused on delivering new, innovative therapies and solutions to patients in a cost effective, timely and safe way. With spiraling R&D costs, new methods such as adaptive trials, and never ending need for deep pharmacovigilance, the Life Sciences companies that effectively use analytics to explore, monitor and optimize their business will rapidly become the new leaders.

Oracle’s strategy—built upon Enterprise Health Analytics and Health Data Warehouse Foundation—provides a powerful, practical, and extensible approach to delivering the IT analytics infrastructure required to confront the worldwide healthcare challenge.



pegasystems
BPM-Based Case Management Approach to Optimizing Clinical Trial Efficiency
Sponsored by Pegasystems

Business Process Management (BPM) software offers liberation in the planning and management of clinical trials today. SmartBPM provides the components for automating critical clinical trial processes ranging from protocol development and patient enrollment to site management and investigator payments. Advantages are:

  • Potentially stunning return on investment at multiple levels.
  • A 500%, or better, increase in application development time by directly executing business requirements
  • Improved customer retention
  • A 50% possible reduction in training time

Discovered is opportunity to enhance relationships with investigators, subjects, and regulators while bringing momentum to a technology-impaired study startup phase. Learn more about SmartBPM in this complimentary white paper.



Cmed paper
Next-gen Cloud-based eClinical
Sponsored by Cmed Technology

New technologies are available to leverage Cloud Computing in  managing clinical trial data. This paper discusses a next generation eClinical
platform that:

  • Speeds trial set up
  • Accommodates changes with zero downtime
  • Integrates effectively with other clinical trial technology systems

It is offered with either software-as-a-service (SaaS), or turnkey infrastructure options in which the user organization operates their own cloud using their IT teams, within their data centers. Read this paper to learn and decide how best to leverage cloud computing’s many strengths for your organization’s  particular needs.



Job Openings

mskc logo
Software Engineer – Computational Biology Center

Memorial Sloan-Kettering Cancer Center seeks an Engineer to design and develop complex data analysis systems in support of cancer genomics research projects at the Computational Biology Center. Qualified candidate will have a BA, 5+ years of software development experience and expert knowledge of Java, SQL, and HTML.

Apply: www.mskcciscareers.org.  Equal opportunity and affirmative action employer.

Web Symposia
Loading...

Bio-IT World proudly presents the Bio-IT World Web Symposia Series covering a broad array of topics within the life sciences and drug development enterprise.

Leveraging BPM to Increase Efficiencies in Clinical Trial Case Management
August 3, 2010 | 1:00 – 2:30 p.m. EST
Sponsored by: Pegasystems
Program Details | Register Here 

 


Loading...

For reprints and/or copyright permission, please contact The YGS Group, 3650 West Market Street, York, PA;

(717) 505-9701 ext. 125, or via email to Ashley.Zander@theYGSgroup.com.