By Tracy Smith Schmidt
December 15, 2002 | The National Institutes of Health recently awarded a three-year, $15-million grant to establish a single global resource for protein sequence information, known as the United Protein Database (UniProt).
The money will go to one American and two European groups to combine and improve three of the most valuable biological databases in the world -- SWISS-PROT, the Protein Information Resource (PIR), and Translated EMBL (TrEMBL). Once the merger is complete, UniProt will provide “one-stop shopping” for scientists seeking comprehensive and up-to-date information about a protein sequence of interest. UniProt will be free of charge on the Web, and the entire data set will be available for downloading.
SWISS-PROT contains more than 100,000 entries, and PIR, nearly 300,000. These databases have been laboriously hand-curated to reflect the most recent literature and to highlight many key protein features such as function, domain structure, and post-translational modifications -- information essential for guiding and interpreting experiments in numerous areas of biology. In contrast, the 700,000 entries in TrEMBL are mainly computer-annotated translations of coding regions in the nucleotide sequence databases. Before including them in UniProt, scientists from the SWISS-PROT and PIR teams will carefully inspect and modify the TrEMBL “rough drafts” to bring them up to the standards that SWISS-PROT users have come to expect.
The UniProt project is spearheaded by principal investigator Rolf Apweiler, head of the sequence database group at the European Bioinformatics Institute (EBI) in Cambridge, England. His co-investigators are Amos Bairoch, the founder of SWISS-PROT and a scientist at the Swiss Institute of Bioinformatics (SIB) in Geneva, and Cathy Wu, director of PIR, which is managed by Georgetown University Medical Center and the National Biomedical Research Foundation in Washington.
The SWISS-PROT and TrEMBL groups have had close interactions for many years, but including the PIR group, which has its own procedures, will be more of a challenge. Yet Apweiler is confident that the mix will work and strengthen the database. “We’ve been discussing [the possibility of this merger] with PIR for two years,” he says. “We have different styles, but that will be good, to question how things are done, to add some spice.” Wu echoes Apweiler’s view: “We tend to specialize in annotating different types of features, which are complementary.”
The six NIH institutes and centers supporting the grant, led by the National Human Genome Research Institute (NHGRI), demonstrated interest in supporting a central protein database by hosting a workshop in May 2001. In February 2002, the NIH put out a request for grant applications, encouraging scientists to compete for the funding they had earmarked for such a resource. The SWISS-PROT, PIR, and TrEMBL proposal shone as the best of the bunch.
“This award represents our commitment to databases,” says Peter Good, program director in genome informatics at the NHGRI in Bethesda, Md. Such commitment comes as welcome news to many in the scientific community, particularly at a time when there is no specific international funding strategy to support database infrastructure -- a deficiency that is illustrated by SWISS-PROT’s own history.
In the late 1990s, SWISS-PROT lost some of its funding from Europe. To keep the database running, its leaders were forced to start charging fees for nonacademic use. The money collected from industry has covered about a third of SWISS-PROT’s operational costs in recent years. Now, nearly two-thirds of the NIH grant for UniProt will go toward replacing those license fees, eventually making the data free once again to all users.