|A BETTER WAY: Cancer genomics informatics guru Pablo Tamayo helped shape Oracle's new 10g.
Oracle's new 10g database comes complete with bells and whistles that only industrial-strength, compute-savvy life scientists will appreciate.
By Mark D. Uehling
February 18, 2004 | Science today lives in a database. Touching discovery research and clinical practice, reaching academia and industry, Oracle provides the foundation for more major vendors' software and instruments than any other company. Its new database, Oracle 10g, has been stuffed with bioinformatic and statistical algorithms designed to serve scientists.
Especially for analysis that requires uniting data from high-throughput environments and multidisciplinary research, Oracle is the standard for storage. With 10g, the company's larger goal — creating an environment for the integration of data and applications — is clear.
It's too soon to say whether all of 10g's new features will affect network performance or the time needed to run large jobs. But a handful of vendors, including Accelrys, MDL, and InforSense, are adding or evaluating the new functionality. Waters has announced 10g capabilities in its chromatography software. LION Bioscience's new LION SRS engine allows users to connect Oracle and LION. Sooner or later, most of the IT professionals in pharma and biotech will be obliged to examine what 10g might enable in their shops.
Here, Bio·IT World offers three early glimpses at 10g, which could turn out to be the product of the decade.
MIT's Pablo Tamayo: Helping Data Speak
Leukemia. Lymphoma. Brain tumors. Again and again, Eric Lander and his group at MIT's Center for Genome Research have stared at gene-expression data that confound the most eminent physicians. More often than not, the person who extracts insight from those data is Pablo Tamayo, the center's manager of cancer genomics informatics.
"Anyone who hopes to make significant advances in genomic data analysis needs to be able to do three things well," says physician Todd Golub, one of Tamayo's colleagues. "They need to really understand the biological question at hand. They need to understand the underlying statistical and data-mining principles. And they need to be able to convert all this into useful software. There are precious few people who can pull this all together, and Pablo is one of them. He is at the heart of everything we're trying to do with microarray analysis and knowledge discovery."
So why did Tamayo, an academic who also consults for Oracle, become involved in shaping the feature set of a new database? Because he thinks disseminating the best methodologies for analyzing data should not be confined to the ivory tower. "There is a big payoff in facilitating the development, use, distribution, and deployment of computational methods," he says. Tamayo suggested several algorithms be added to the new database, and views such suggestions as part of his larger mission at what will become the Broad Institute: finding methodologies that combine data and applications to solve a variety of molecular pattern-recognition problems.
One big challenge in his research, Tamayo explains, has been to digest huge quantities of microarray data, whether he's watching thousands of genes flicker on and off, or tracking hundreds of proteins. "We can ask more global questions," he says. "Why does this patient respond to treatment and not this other patient? There should be a reason. We can ask that question at the molecular level."
Before 10g, many of the searching and data-integration challenges he's imagining now would never have been attempted. "It's all about integrating and enabling certain types of analysis to be done easier and better," Tamayo says. "The problem is not so much the tools. The problem is not so much the data." The real trick: identifying and validating a computational methodology that combines the data and the tools in one swipe.
Some analytical challenges will be helped by the new database, he says. Some will require other tools. "What is really important is the method," says Tamayo, a physicist by training. That's where tools such as BLAST come in. The sequence-searching algorithm runs natively in Oracle 10g, as do several less well-known bioinformatics tools, including support vector machines (SVM) and non-negative matrix factorization (NMF).
"The [Oracle] environment enables the computational biologist to spend more time asking the big, deep questions and exploring the corresponding analysis method," Tamayo says, "not developing computational plumbing."
In the case of BLAST and SQL, Tamayo says, Oracle is enabling more than just simple searching. It's helping scientists easily limit a search to certain organisms, or to sequences added after a certain date, or to those for which an FDA-approved protein inhibitor exists. "With this capability added to the database, I can do this easily," Tamayo says.
When it comes to SVM algorithms, he says, the code is suited for large data sets that have a few very subtle patterns that define categories of cancers or patients. SVM has been a key component in the recent work of Golub and Lander, classifying cancers into categories that even oncologists were not previously aware of. "Will SVM solve all the problems of classification?" Tamayo asks. "Probably not. But this algorithm happens to be very good with this type of data."
Oracle's Oracles: What the Company Says
|NOT YOUR EVERYDAY SEARCH ENGINE: With BLAST built in, 10g enables scientists to easily fine-tune their searches, using query options such as maximum number of alignments.
Enlisting notables like Tamayo was part of Oracle's plan to make the database even more indispensable. As Charlie Berger, Oracle's senior director, life sciences, recounts the virtues of 10g, MySQL seems to be its most likely target. "With MySQL there are a lot of closed doors," Berger says. "You want to handle an image? You can't do it. You want to mine the data? You can't do it. You want to do statistics? You can't."
Berger takes pains to say that one selling point of IBM's DiscoveryLink middleware — integrating data from a variety of sources — is a capability Oracle can match. "If you do have a lot of data that comes off your laboratory information management system (LIMS) or instruments," Berger says, "and you store it in an external file system, and you want to keep that data forever, IBM would call that a federated approach. We have the same functionality. We can index and query and function on those data just as if they were in the Oracle database."
With an estimated 80 percent market share in the life sciences already, Oracle believes it can grow only by absorbing additional data not presently stored in a relational format. Even Berger allows that someone, somewhere, may have a few scraps of information not stored in his favorite product. Such data may be in flat files, or in specialized applications. They may be online, in public databases, or even on individual scientists' hard drives.
"Going forward, don't you want to do a regular expression search on those [data] or mine them or query them?" Berger asks. "Over time, you're going to want to put more and more of those data into a secure, backed-up, real-application-cluster-enabled grid." That remains a point of controversy, as much of the data in vast public databases are not stored in a relational format.
Regardless of whether every last 1 and 0 on the planet gets sucked into Oracle, there is no argument about whether existing quantities of data are getting harder to manage, much less analyze. Oracle's Susie Stephens, life science product manager, says: "As the volumes of data grow, it makes more sense to move the algorithms toward the data rather than take the data out of the database toward the algorithms."
That resonates with Xing Heming, manager of bioinformatics at Charles River Associates. Although his shop uses DB2 and MySQL, he is considering 10g. The idea of leaving data in the database "definitely has a lot of value," Heming says.
As one example of what 10g will be able to manage, search, and analyze, Stephens cites images. Some scientists doing systems biology, she notes, want to store mass spec traces. "The images are not used in the core part of the analysis, but people want to be able to link to them," she notes. "Being able to annotate them and query the annotations — it's just critical."
Oracle Does Proteomics: The Prolexys Story
Marcel Davidson almost denigrates himself as "just a DBA" — a database administrator. He's worked with Oracle since version 6. But as his conversation quickly reveals, the scientific and the IT components of his work can be hard to separate.
|10g: Life Science Oomph Inside the Database
|Oracle has added new features that will be exploited by both third-party vendors and IT departments. This is a partial list of new or improved functionality in the database.
Head of data management at Prolexys Pharmaceuticals (formerly known as Myriad Proteomics), a Utah-based proteomics company, Davidson is proud of the company's vast wet-lab operation and its archive of 280,000 protein-protein interactions. He believes Oracle is making a serious attempt to make its database easier for life scientists to use — once their IT departments have tweaked existing applications to work with it. In general, he says, shorter, simpler SQL queries empowered by 10g will let scientists dredge public databases and compare their contents to other reservoirs of proprietary information.
"We've been able to demonstrate things in our data that we've never dreamed about looking at before," Davidson says. "We have been able to annotate a lot of our proteins with extra data that we have not been able to characterize before," he adds, noting that the annotations are largely automated and rapid, not cumbersome and manual, such as adding motif data.
To him, the big picture is almost as important. "We have this information," Davidson says. "It lives inside Oracle. For us to be able to use it inside that context makes it a lot easier. As they discover they can leave the data in one spot, other people will still do everything to the data that they used to do — without having to move it around."
In the case of protein-protein interactions, he notes, comparisons to other species are exceedingly difficult to make, largely because of the number of other databases to check — and the difficulty of checking his voluminous, sensitive Prolexys data in and out of the database. Now, in 10g, he discovered he can mine a major public yeast database and quickly identify which yeast protein-protein interactions correspond to those in humans.
Says Davidson, recalling earlier versions of the database: "When you got into these really difficult situations, things that were hard to express in traditional SQL, you were limited. A lot of times, I think, people said it's not worth trying to bend biology into the relational database. Now Oracle's trying to take a step closer to the solution and say, 'Hey, we recognize there are limitations in what traditional relational databases can do. We'll put these tools in there to make it easier.'"
Davidson is quick to say that the grid-enabled 10g code, with the addition of BLAST, will no doubt need to be patched and debugged a bit as real-world users begin to use it. But on balance, he's about as enthusiastic as a sober, understated scientist can be. "I'm trying to tone it down," Davidson says, "but I am really excited."
PHOTO CREDITS: PHOTO BY KATHLEEN DOOHER