By Kevin Davies
May 12, 2008 | Any conference that brings together the likes of Nobel laureates Sydney Brenner and Rich Roberts, Craig Venter, and Francis Collins, is worth a look. When it marks the 25th anniversary of a resource as valuable as GenBank, it proved irresistible*.
"It's hard to imagine where we would be without the dedication of GenBank," said Collins. GenBank was created in 1982, and moved to its present home at the National Center for Biotechnology Information (NCBI) ten years later. Exchanging data on a daily basis with the European Molecular Biology Laboratory (EMBL) and the DNA Data Bank of Japan (DDBJ), GenBank is a critical, yet largely overlooked, component of the biological enterprise.
A procession of speakers paid tribute to the tireless work of NCBI director David Lipman, James Ostell, and their colleagues. Remarkably, the number of staff overseeing Genbank has hardly changed over the past 15 years, despite the exponential growth in data. Following Moore's Law, the volume of data doubles every 18 months. Today, GenBank contains more than 110 million sequences and 200 billion bases from 260,000 organisms.
Craig Venter is one of GenBank's most prolific depositors. He essentially launched the expressed sequence tag (EST) database when he catalogued the first 331 back in 1991. By 2001, there were some 10 million. Today, that number is close to 51 million, including more than 8 million human sequences.
Venter - who quipped that his institute was selling its old ABI capillary sequencing instruments for a bargain $50,000 - said that both his effort and the international genome sequencing consortium "were dramatically flawed. We thought they'd have the same sets of genes, and only 0.1% difference." That proved to be a stark underestimate. Venter also ruefully remembered that Celera sequenced DNA from five individuals (including himself). "Had we sequenced from one individual, we'd have gotten the right answer," about genome variation, he said.
Sydney Brenner said there is an information crisis in biology that needs to be solved. "Data is not enough - we have to convert the data into knowledge. You get credit for collecting data, credit for distributing data, but nobody gets credit for organizing data. The task we have now, I think, is not to lose this tremendous capital that has been accumulated, which will be forgotten."
Brenner said it was "a scandal" that most research papers don't cite papers further than the mid 1990s. "We must not lose all the information that has been gathered," said Brenner, before adding mischievously, "We could get grants to rediscover it all again." His goal, he said, was to "turn GenBank from a bank into an organization where [scientists] can make withdrawals with interest."
"Most biology today is low input, high throughput, no output biology," said Brenner. "The idea [that] we'll dissect [cellular] complexity by making lots of measurements is bound to fail... Everyone's hoping for a magic computer program - experimental data, pharmacogenomics data, the whole lot - and it will come out with the answer. That's a vague hope. Because I have to tell you, computers are incredibly stupid! It's better to combine human intelligence with artificial stupidity than the other way around."
Collins recalled assembling the longest gene region on paper - 40,000 bases - back in 1984. He paid his then 14-year-old daughter $2/hour for proof reading. Having led the assembly of the human genome, he offered a few "notes from the frontlines." Collins cited areas such as comparative genomics, the Cancer Genome Atlas, the 1000 Genomes, and the ENCODE projects as exciting areas. "We may get to the $1000 genome much sooner than the 7-8 years that people have been predicting," he said. No kidding!
Progress in mapping genes for common disease has made for "a glorious 18 months," said Collins. The HapMap project helped us understand genetic variation across the genome, enabling 500,000 SNPs to serve as proxies for the rest. Of course, the precipitous drop in genotyping costs, from 50 cents in 2002 to 0.1 cent today, hasn't hurt.
In Collins' own field of type 2 diabetes, the field has "moved into totally new territory," with the identification of 16 new gene loci. Many of these rare variants confer only a modest odds ratio, suggesting these loci would not make good drug targets for the broader population. But Collins noted that of the first ten diabetes genes identified, two (KCNJ11, PPARG) are mainstays of diabetic therapy. A new NCBI database, dbGaP (Genotype and Phenotype), would prove further stimulus to the collaborative analysis of vast genome-wide mapping data.
*GenBank Celebrates 25 Years of Service, NIH, April 7-8, 2008. Video Webcast: www.tech-res.com/GenBank25
This article appeared in Bio-IT World Magazine.
Subscriptions are free for qualifying individuals. Apply Today.