GUEST COMMENTARYDec. 10, 2002 | THE SEQUENCING of the human genome is essentially complete, but scientists are still puzzling over the final tally of human genes. From an initial estimation of some 100,000 genes, provisional estimates of the human sequence reveal a mere 30,000 to 40,000 genes — less than twice the number of genes in the nematode worm Caenorhabditis elegans, and just 5,000 to 15,000 more than in the mustard weed Arabidopsis thaliana. Assuming we are fractionally more sophisticated than such lowly organisms, how do we account for such complexity? The explanation appears to lie in alternative splicing.
Although discovered in 1980, alternative splicing was considered very rare until just a few years ago. Indeed, researchers generally agreed that no more than 5 percent of human genes undergo alternative splicing. A major breakthrough came when scientists realized that the information from sequencing projects could be used to computationally predict alternative splicing. With the application of multiple-alignment algorithms, messenger RNA (mRNA) sequences and expressed sequence tags (ESTs, representing fragments of mRNA) can be used to identify alternative splicing events. Indeed, numerous independent studies have shown that 35 percent to 59 percent of all human genes undergo alternative splicing, with an average of about three splice variants per gene.
The notion that about half of our genes have more than one splice variant requires the old dogma of "one gene, one protein" be replaced by "one gene, multiple proteins." Understanding the genome is no longer sufficient for understanding the full assembly of proteins in an organism, the proteome. We now have to consider a new concept, the "transcriptome," or the total complement of spliced mRNAs.
Scientists are beginning to realize the implications of alternative splicing for large-scale biological studies. For example, comprehensive knowledge of the transcriptome is essential to the design and interpretation of DNA microarray experiments (DNA chips), even when alternative splicing is not specifically studied. With knowledge of alternative splicing, probes could be designed to measure expression of specific splice variants, or measure all the possible variants of a gene, as desired. And in the search for novel drug targets, previously unknown splice variants of known genes could serve as new drug targets.
Current predictions for the transcriptome size vacillate between 70,000 and more than 100,000, with little sign of consensus. One problem is that computational alternative splicing prediction studies depend solely on EST alignments.
ESTs, pioneered by J. Craig Venter's team in the early 1990s, enabled large-scale predictions of alternative splicing but are a problematic source of information. EST databases are flooded with artifacts that complicate accurate predictions. Some artifacts, such as bacterial contamination, sequence errors, and chimeric sequences, could be easily cleaned and removed; others, like genomic DNA contamination, are more difficult to find.
And even if all ESTs were clean, it would still not mean that all the predicted splice variants have biological importance. Like other complex biological machineries, the cellular splicing machinery, known as the spliceosome, may also make mistakes. The protein-synthesizing ribosome, for example, incorporates incorrect amino acids at a rate of one to every 10,000 amino acids translated. If the spliceosome had a similar error rate, and assuming that every gene contains an average of 10 introns (non-coding sequences spliced out of the final mRNA molecule), there would be a splicing mistake in one out of every 1,000 spliced mRNAs. The products of such splicing errors would appear in the cell as mature mRNAs, but would not be functional molecules, and would probably be degraded. If sequenced as ESTs, however, they would appear as normal, rare splice variants.
Mironov, A.A., Fickett, J.W., and Gelfand, M.S. "Frequent alternative splicing of human genes." Genome Res. 9: 1288-1293; 1999.
Brett, D., Hanke, J., Lehmann, G., Haase, S., Delbruck, S., Krueger, S., Reich, J., and Bork, P.
"EST comparison indicates 38% of human mRNAs contain possible alternative splice forms." FEBS Lett. 474: 83-86; 2000.
Croft, L., Schandorff, S., Clark, F., Burrage, K., Arctander, P., and Mattick, J.S. "ISIS, the intron information system, reveals the high frequency of alternative splicing in the human genome." Nat. Genet. 24: 340-341; 2000.
International Human Genome Sequencing Consortium. "Initial sequencing and analysis of the human genome." Nature 409: 860-921; 2001.
Thus, alternative splicing validation requires other methods to determine which predicted splice variants are functional and biologically important. To understand the size of the transcriptome, large-scale laboratory experiments are essential. In these studies, probes are designed to match exon (coding sequence) boundaries and report on specific expression of alternative splicing events. These studies, however, are expensive and time-consuming.
Analysis of the recently deposited mouse genome sequence, along with more than 2.5 million publicly available mouse ESTs, might be the perfect solution for cost-efficient, computation-based alternative splicing verification. In evolutionary biology, conservation means functional importance. Therefore, a splice variant supported by both human and mouse ESTs is proving its functionality without laboratory validation.
The mouse genomic sequence might also be used directly for verification of alternative splicing. Preliminary comparative analyses between genomic sequences of human and mouse reveal an average sequence conservation of 85 percent between coding sequences, but only 35 percent between introns. If aligned to the mouse genome, a suspected alternatively spliced exon could be judged by its mere level of conservation between human and mouse.
Comparative analyses between genomic sequences of different organisms — "comparative genomics" — are highly valuable for gene prediction and many other experimental fields. The millions of mouse ESTs and the complete sequence of the mouse genome will allow, for the first time, large-scale "comparative transcriptomics" studies. With these, we may finally pin down the size of the transcriptome. From there, it is only a short way to the ultimate goal — the proteome.
Rotem Sorek is a researcher at Compugen Ltd. He can be reached at firstname.lastname@example.org.