"Over the past twenty five-years, a mere sliver of recorded time, the world of biology — and indeed the world in general — has been transformed by the technical tools of a field now known as genomics ... The amount of factual knowledge has expanded so precipitously that all modern biologists using genomic methods have become dependent on computer science to store, organize, search, manipulate and retrieve the new information."
It may seem surprising that a peer-review journal — one aimed squarely at geneticists, no less — should deem it necessary to publish a "how to" guide for the genome sequence. But as co-authors Andy Baxevanis and Francis Collins from the National Human Genome Research Institute (NHGRI) point out, the degree of bioinformatics literacy in the research community is shockingly low. According to a recent survey conducted on behalf of the Wellcome Trust, only 50 percent of the biomedical researchers actually accessing genome databases are able to extract the information they require. For the majority of biologists, the human genome remains terra incognita — an unacceptable situation given the significant resources invested to compile the sequence.
The Nature Genetics guide, which is freely available online, should provide much-needed practical information on how to access, use, and interpret this information. The guide was written by Baxevanis, Collins, and three NHGRI colleagues — Tyra Wolfsberg, Kris Wetterstrand, and Mark Guyer. Presented as a series of 13 commonly asked questions, the guide illustrates how information can be extracted from the three major portals for genome sequence information — the Golden Path at the University of California at Santa Cruz (the first publicly available genome browser); the National Center for Biotechnological Information (NCBI); and the European Bioinformatics Institute (EBI) in Hinxton, England.
The Genome in View
The public genome-sequencing consortium has agreed that each portal should display the results of a single genome assembly — that produced by the NCBI. Each of the three major portals presents data from this database, but offers different tools and capabilities depending on the question at hand. The Map Viewer portal at NCBI is also linked to other important databases including GenBank, OMIM (the online catalogue of genetic disorders and loci compiled by Johns Hopkins University's Victor McKusick), and PubMed (Medline). The UCSC Genome Browser contains multiple tracks aligned in parallel, allowing researchers to zoom in or out, viewing data at resolutions ranging from chromosome to gene to sequence. Ensembl, produced by the Wellcome Trust Sanger Institute and the EBI, contains data on the mouse, mosquito, and zebrafish genomes, in addition to the human.
The guide offers simple, worked examples showing how data on sequence, genes, and polymorphisms can be extracted and used from the various portals. Each of the sites has particular strengths. For example, if you wish to catalog all the SNPs (single nucleotide polymorphisms) in a certain stretch of the genome, then dbSNP at the NCBI is the optimal choice. Ensembl provides a facile comparative view of human and mouse DNA sequence. The UCSC browser is customizable, allowing users to change display parameters and annotate the data, for example as a text box.
In keeping with the public consortium's dogged determination to make the genome sequence publicly available, the guide is also freely available online. As Baxevanis and Collins state, "For the full benefits of this dramatic revolution to be felt, all scientists on the planet must be empowered to use these powerful databases to unravel longstanding scientific mysteries."
|Nature Genetics, Vol. 32 supplement, pp 1-79 (September 2002).
But solving those questions — inferring function from sequence — is a notoriously hazardous exercise, particularly as alternative splicing can result in the production of several functionally distinct proteins from the same gene sequence. For example, phosphoglucose isomerase exists in four distinct forms with functions ranging from glucose metabolism and neuronal survival to cell differentiation and metastasis. The liver enzyme alpha-enolase is produced from the same gene as T-crystallin, a structural protein in the eye lens — the difference being that the genetic code is read in a different phase.
Just as the genome sequence is only about 90 percent complete right now, this guide is not the last word. Expect an updated version of the guide in April 2003 — the same time that the human sequence will be declared officially (99.99 percent) "complete."