The deluge of data and accompanying proliferation of databases is spiraling out of control. New federations and solutions may offer partial relief.

By Elizabeth Gardner

June 15, 2003 | Whatever gripes a biologist might have in the new genome era, lack of information isn't one of them. The problem is tracking it down.

And the field is ripe for a new solution that could "change the way science is done," according to Mark James, program manager for the coordinating center of the Biomedical Informatics Research Network (BIRN), an initiative funded by the National Institutes of Health that aims to get researchers to share their data quickly and easily — in spite of themselves.

There are hundreds — perhaps thousands — of freely accessible, genomics-related databases on the Web. In the area of mouse genetics, for example, they range from indispensable

Building a Better Mouse Database 
Mouse Brain Library may be a shortcut out of the maze of genomic database standards

Read More 
general-interest resources, such as the Mouse Genome Database at Jackson Laboratory; to more specialized collaborative projects, including the Mouse Brain Library at the University of Tennessee (see "Building a Better Mouse Database," right); to esoteric cottage efforts such as the Genome Pathways Database Browser at Case Western Reserve University. While some are funded by government grants specifically for database development, others may spring out of spontaneous enthusiasm from a graduate student.

The journal Nucleic Acids Research lists some 350 curated, peer-reviewed Internet databases in its most recent annual roundup of bioinformatic data sources. Andy Baxevanis, a bioinformatician at the National Human Genome Research Institute (NHGRI), notes that this compendium includes only those sources large and thorough enough for their creators to consider them worthy of peer review.

Squeaking by: Joe Nadeau, a professor in the genetics department at Case Western Reserve University, points out flaws, such as the lack of standards, about current genomic databases.
Bret Peterson, project officer at the National Center for Research Resources (NCRR), the NIH component responsible for BIRN, has heard estimates of as many as 3,500 Web repositories of genomic data. A Google search turns up more than 7,000 individual Web pages with "genome database" in their title, and that doesn't include sites with more imaginative names.

These resources are probably all useful to somebody — assuming interested parties can locate them. Many of the larger databases link to one another routinely, but word of smaller ones — what Peterson calls "boutique databases" — may leak out only at meetings or through random mailing-list postings. "I find them by word of mouth," says Kirsten Youngren of the department of genetics at Case Western. "Someone will say, 'Go to [the "Golden Path" portal at University of California at] Santa Cruz for this.'"

David Lipman, head of the National Center for Biotechnology Information (NCBI), is impressed with the persistence and ingenuity of the user community, given the hurdles to finding their target information on the Internet. "If a resource's content really is useful, it tends to get used, no matter how arcane the interface or obscure its [Internet] location," he says. "But there are certainly exceptions to this rule, and it's terrible to see underused systems. We've had some experiences with these internally and found it possible in some cases to increase usage by over 60 percent with only small changes in navigation."

Even when a new resource is tracked down, it can be a time-consuming challenge to master. "Sometimes I wonder if I'm going crazy," says Desmond Smith, head of the Functional Neurogenomics Lab at University of California at Los Angeles. "Everyone is raving about how powerful and useful these new databases are, but to me they seem unrewarding and cumbersome, and they never give me exactly what I need. You can tell something's not working well when there are 50 different solutions to a problem."

Program Guide 

Lipman agrees that many databases are not as accessible as they could — and should — be. "We have a heck of a time working to make it easy for folks to find the best place to answer their questions just among the NCBI resources," he says. "Given the challenges we've had just for our own stuff, the difficulties users face overall are quite daunting."

NCBI is working to make all its databases, including GenBank, UniGene, and dbSNP, accessible through its Entrez gateway, with uniform query interfaces and improved interlinking of resources. In the works is a cross-database search facility that Lipman hopes will make it easier for users to find the most logical starting point for their quests.

Joe Nadeau, a professor in the department of genetics at Case Western, says many of the smaller databases lack "content and correctness." He adds: "The academic community prides itself on academic freedom and doing things their own way. Getting them to get it in usable shape is more complicated." Free Web databases tend to come and go, Nadeau says. "They're not stable. And you have to go to lots of different sites and learn lots of sets of tools."

A Database Day in the Life 
Toshimori Kitami, a student in Joe Nadeau's lab at Case Western Reserve University, describes a typical journey through the multitude of online data sources, illustrating both the wealth of information available — and the torture of tracking it down.

Read More 
Nadeau's research group has seen the problem from both sides. "Tools on the Web don't do what I want," says graduate student Toshimori Kitami, who mapped a typical tortuous online search (see "A Database Day in the Life," right). "If you make a database specific to a certain type of information, it makes it easier to distinguish your project and get funding for it. But I can't rely on other people to make the search tools I need."

So, Kitami spent a week learning basic MySQL — setting up the databases, downloading the data, and doing simple queries — and several more weeks learning how to run complex queries. He rates the whole process as about an 8 on a pain scale of 1 to 10, and says: "If there were a book or a course on MySQL for biologists, with examples from genomic databases, the pain would be a lot less."

Kitami would like to perform complex queries on multiple genes simultaneously, but most databases don't have sufficiently sophisticated user interfaces — including the one created by his own lab. The Genome Pathways Database Browser suffers from a syndrome common among boutique databases: the absentee programmer. The database houses several dozen genes that the group has studied, with interactive maps showing their interrelationships, structure, and function, along with links to other databases. It's ingenious but incomplete. "Someone from engineering was building the actual database, and I don't think he's interested in doing it any further," Kitami says.

The state of affairs in mouse data is mirrored for most other species, from human to fruit fly. Standard user interfaces? Forget it. Downloadable query results? Sometimes, but don't count on it. Common nomenclature? If you're lucky, but too often, genomics today is back where chemistry and physics were before everyone agreed on the names of the elements. The same gene may go by multiple different names and acronyms if its discoverers aren't cooperating with organized efforts such as the Gene Ontology (GO) Consortium, whose goal is to establish a common genomic vocabulary.

"If the gene symbols are out of date, it's impossible to connect with what the rest of the world has," says Janan Eppig, director of the Mouse Genome Database and the GO Consortium member charged with coordinating vocabulary for the mouse genome.

The Woeful Web

In some ways, the Web set back the cause of scientific cooperation, says Mark Boguski, visiting investigator at the Fred Hutchinson Cancer Research Center in Seattle and one of the original gurus of GenBank, the voluminous NCBI sequence database begun in 1989. GenBank is one of the few databases with enough clout in the community to force researchers to submit their data to it; virtually all journals require scientists to obtain GenBank accession numbers for all the sequences mentioned in their papers. (Even Science, which published draft sequences of the human genome by Celera in 2001 and the rice genome by Syngenta in 2002 without requiring deposition in GenBank, says it will follow a new policy doctrine dubbed UPSIDE [Universal Principle of Sharing Integral Data Expeditiously], as laid out in a recent National Academy of Sciences report.)

"When there was no Web, GenBank was the only place to go to publish electronically," Boguski says. "The Web increased the incentive and lowered the bar. There's been a flowering of databases, but there are no standards and no understanding about having links back to common records."

Controlled BIRN 
Funded by the NIH, the federated Biomedical Informatics Research Network (BIRN) is designed to facilitate the sharing of information among researchers and institutions.

Read More 
"It seems unlikely that too many more GenBanks will be funded: They're expensive and once you've committed the money, it's committed in perpetuity," NCRR's Peterson says.

Mouse geneticists searching for a robust, well-maintained database with a comfortable, flexible user interface can turn to the commercial Celera Discovery System — at a cost of $2,000 per year per academic user. The database has sequence information from three mouse strains, compared with the single strain available through public sources.

Celera recently lowered its academic fees. "We needed to be more in line with the funding cycle," says Tony Kerlavage, senior director of bioinformatics for Celera. He estimates the product has 2,500 academic users across 250 institutions, but expects an increase thanks to the pricing change.

"We started subscribing to Celera a couple of years ago, and, at the time, it blew any other resource away," says Lesil Brihn, another student in the Nadeau lab. "Celera is easier because you can type in a large or small range of sequence you want to see, and you get the sequence, genes in the area, potential genes, BACs (bacterial artificial chromosomes) that cover the area, and SNPs (single nucleotide polymorphisms) all at once."

Celera paid dearly to get its product to its current level of sophistication, with a staff of 150 working over several years. "Realistically, it is not practical for academic groups to design software up to this level," observes John Hancock, head of bioinformatics at the Mammalian Genetics Unit of the U.K. Mouse Genome Centre.

Nor is it realistic to expect more commercial solutions. Though Celera's database business finally achieved profitability last year, other companies, such as Structural GenomiX, Incyte, and Gene Logic, have either abandoned the database business as such or have tailored their offerings to the drug discovery market. And Celera has delegated the marketing of the Discovery System to its sister company, Applied Biosystems, which is selling assays and other add-ons to supplement it.

BIRN to Run 

For better or for worse, boutique databases will play an increasingly important part in research, and that's where BIRN comes in.

The idea is to create a "federation" of databases using a mediating technology that will unite all of them — no matter what their size, structure, or originating software — and allow researchers to query them all simultaneously, without having to understand the query interfaces of each individual database, or even necessarily knowing where the data come from.

NIH has awarded four grants, totaling $30 million, to kick off BIRN. The first funds the coordinating center at the University of California at San Diego, where program manager Mark James is based. The other three are for the initial test projects, on brain morphology, mouse models of human disease, and the development of a magnetic resonance imaging protocol for a large-scale study of schizophrenia. Twelve universities are participating so far, with a total of 75 users that include both tech support and researchers. The network uses the 10-gigabit bandwidth of Internet 2 and is designed to be scalable to thousands of users.

"We're attempting to change the way science is done, getting away from small studies at one particular site" and into broader multi-institutional collaborations. "The idea is, let's put everyone together in the same playpen."

Mark James, Biomedical Informatics Research Network

James' team is using the Storage Resource Broker, middleware created at UCSD, to mediate among the various databases. Metadata on each file go to a central catalog, indicating the type of content and who's authorized to access it. The system contains about 170,000 files so far. "We're attempting to change the way science is done, getting away from small studies at one particular site" and into broader multi-institutional collaborations, James says. "The idea is, let's put everyone together in the same playpen."

Controlling access is key to the success of the federated model, James says. "People are used to competing with each other. Some scientists are scared that someone will scoop their data. We have to allay their fears."

BIRN is a closed system rather than an Internet free-for-all, and participants need a special rack of servers provided by NCRR. The whole thing runs on Red Hat Linux, using Oracle for the information found by the Storage Resource Broker. Potentially sticky issues, such as licensing commercial tools among the federation, or giving proper credit to a data source in a paper, are being worked out along the way.

Discovery Channel 

IBM's DiscoveryLink product, on the market since 1995 and embedded in its DB2 software since 2001, has given organizations the ability to unite their various databases internally, as long as they had some commonality of structure. IBM will shortly introduce the DiscoveryLink Information Integrator, which can handle data in virtually any format: databases, spreadsheets, Web sites, XML, and free text.

Product Manager Laura Haas calls it the next best thing to a fully integrated data warehouse. "Warehousing is the best approach when it works, but if you have organizational issues, federation technology can be very useful in allowing people to bring data together and share it without having to give up control. DiscoveryLink preserves the autonomy of individual data sources."

Haas thinks the federated approach is ideal for universities and other academic environments. Though it involves having someone familiar with the data source create a "wrapper" — a piece of software that translates the contents of the data source into a form that the federation can access — she bluntly observes that "labor is cheap in universities. We might as well use it to integrate data."

But Janan Eppig of the Mouse Genome Database says software such as DiscoveryLink or SRB won't solve the fundamental problem of semantics that currently plagues the genomic field. "The definitions of objects have got to be the same [across the databases in the federation], or you'll just get [unusable] stuff when you query," she says. The choice lies between getting everyone to agree on the same vocabulary, or being prepared to create translators for each data source that's speaking a different language. James says BIRN participants are using a combination of the two.

Nonetheless, Eppig sees the need for some kind of cooperative effort, given the sheer size and complexity of the available information. "No one has solved the problem of how to completely cover a domain in a single, integrated, accessible database," she says. "We'll never have everything, and neither will anyone else."

"What's interesting to me about this overall problem is that there's no simple solution," NCBI's Lipman says. "We all want the most knowledgeable people involved in creating these resources, but that inevitably leads to nonuniformity in interfaces, retrieval, and navigation, and that can impact utility. Finding the right balance is tough!" * 

Elizabeth Gardner is a Chicago-area freelance writer who specializes in technology. She wrote the profile of Structural GenomiX that appeared in our September 2002 issue.