INNOVATION | Software solutions provide context for genomic data analysis
By Malorye A. Branca
July 14, 2004 | As they struggle to extract meaning from ever-growing genomic data sets, an increasing number of scientists are taking the pathways approach. Ideally, scientists would like mathematical representations of these networks, such as systems biology provides, but few researchers have such sophisticated models in hand yet. Nor can most tackle the complex series of experiments needed to establish them, particularly in humans.
Instead, many researchers are building simpler models, mainly from the information on hand. By using the right tools, even a modest genomic data set can generate a good view into a particular biological pathway. Now, a range of new technologies is arising from academia as well as the commercial sector to meet this need.
"A pathway approach lets you interpret complex data using existing biological information, or, even better, you can use additional data from a small directed experiment," explains the Broad Institute's Justin Lamb. Such an experiment might exploit chemical or genetic probes with well-known effects to specifically illuminate a single pathway. Alternatively, "systems biology is predicated on the notion of a mathematical representation of a biological system," Lamb says.
Lamb's group has pioneered a technique called Gene Set Enrichment Analysis, which is driven by the well-worn Kolmogorov-Smirnov statistical test. Gene-expression studies typically generate very "noisy" data, Lamb says. But by using this tool, researchers can analyze those data and estimate the activity of certain pathways in various tumors. This affords researchers valuable information about the specimens, and profoundly changes how they view their microarray data.
"When this notion of a gene set enrichment test came up, it was really a revolution here in how we analyze gene-expression data," Lamb says. A pathways approach highlights trends that researchers might otherwise miss, even in multiple experiments. As Lamb says, the up-regulation of a handful of genes in a sample "doesn't provide much understanding." But if those genes are functionally connected in the same biological process, then "you'd have much more confidence those changes are relevant. Those formerly uninspiring data now have meaning."
The Broad group's approach requires first determining a "gene signature" that lights up when the relevant pathway is active. They can cull this signature from public databases or through their own experiments. Next, the group mines sample data sets to see which tumors express the signature.
The tool has already helped Lamb and colleagues make discoveries about the mechanisms involving cyclin D1 underlying cancer and diabetes. Most importantly, essentially all the work was carried out in silico (Lamb, J. et al. Cell 114, 271-2; 2003; Mootha, V.K. et al. Nature Genetics 34, 244-5; 2003).
Plethora of Pathways
Broad has a particularly rich data set to mine, but now a wide range of new commercial software is helping many other scientists attempt pathway analysis.
At one end of the spectrum, text-mining software from companies such as timeBlaster, bioAlma (TextMiner), and Ariadne Genomics (MedScan TextMiner) can cluster genes together, highlighting functional relationships. Some of these tools come with bells and whistles, including automated article retrieval, which can help scientists keep abreast of fast-moving topics.
For those who want to tweak actual models, new software and databases from companies including Ariadne (PathwayAssist 2.5), Biobase (Transpath), BIOSoftware Systems (BioPathway Explorer), Ingenuity Systems (Pathways Analysis), InforMax (Vector PathBlazer and GeneGo's MetaCore), and Paracel (PathWorks) allow creation of diagrams and provide other miscellaneous features.
For example, with PathwayAssist, one can search all the proteins that interact with one known actor, such as p53. Alternatively, one can search for a "thread" that links together multiple proteins.
"Our base is an ontology that defines the type of facts we want to find," says Anton Yuryev, Ariadne's director of application science. The company has also built the ResNet database, containing more than 200,000 biological interactions extracted from published reports using proprietary natural language processing software. "There are several approaches to natural language processing," he says. "We decided to do full sentence parsing."
Yuryev says pathways software is catching on because "many users want to build their own subnetworks or models." With distribution courtesy of Stratagene, PathwayAssist sales are growing. Ariadne recently inked a deal with Jubilant Biosys, obtaining access to the Indian company's hand-curated PathArt database of signaling, metabolic, and disease pathways.
PATHS OF GLORY:The BioAnalytics Group's Model-Based Assay technology reveals upstream sites of action based on even thin bits of downstream data. Here, the pathway is represented in BIOSoftware Systems' BioPathway Explorer.
Source: The BioAnalytics Group
Ingenuity has pioneered the hand-curated approach with Pathways Analysis. "It's a unique combination of software and content," explains Rudy Potenzone, senior vice president of business development and strategic planning. Ingenuity describes its product as "the world's largest manually curated biological database created from millions of individually modeled relationships." Human intervention is critical, the company maintains, to stay abreast of rapidly changing results theories in molecular biology.
In addition to its extensive database, the company has novel software that searches those data, and advanced visualization capabilities. The software is not only valuable to the individual expert but is also a valuable communication tool.
"It allows the scientists to capture concepts and illustrate them to others in the company," Potenzone says. Clients are using the tool for myriad applications, including target validation, mechanism of action, and biomarker discovery.
Further along the spectrum are full-blown networks, such as those provided by KEGG (Kyoto Encyclopedia of Genes and Genomes) and SRI International. Some of these are so well defined that they border on systems biology. But in the human realm, many details are usually still left to fill in.
SRI, which focuses on metabolic pathways, has relatively few entries in the HumanCyc section of its BioCyc database collection. But growing that collection is one of the organization's goals.
The proprietary Pathway Tools, which can predict metabolic pathways from genomic data, should help. Development of this software was sponsored by a large pharmaceutical company, which prefers to remain anonymous. "They are using it, and we get moderate traffic on the site for it," says Peter Karp, director of SRI's bioinformatics program. "But I don't think a lot of people know about it yet."
Using SRI's Tools, a scientist can "quickly get a first draft of a pathway," Karp says. It allowed his group to generate about 150 metabolic pathways in only two months. SRI's four full-time manual curators then tweak and enhance the actual models.
In a similar approach, a collaboration between Cold Spring Harbor Laboratory and the European Bioinformatics Institute has resulted in the Reactome, a curated database of human and some animal metabolic and regulatory pathways.
PATHWAY PLOTTER: Paracel's PathWorks software is designed to make it easy to draw and annotate pathways, as well as store related data. Elements such as DNA, membranes, and reaction arrows are built in.
Filling in these models will require much hard work. BIOSoftware Systems and The BioAnalytics Group (TBAG) have a unique agreement aimed at using the BioPathway Explorer to "solve urgent problems," says TBAG president G. Scott Lett. Both companies sprang from the ashes of former systems biology pioneer Physiome Sciences after it merged with drug discovery company Predix Pharmaceuticals last year.
BIOSoftware provides the modeling software while TBAG applies additional tools and hard-won experience to extract actionable information. "A big piece of what we do is the consulting," Lett says, although customers can also directly license about 80 pathways. A typical project may involve helping a client identify additional targets in a pathway that a compound is hitting, thereby helping to narrow a long list of clinical prospects.
"Things are starting to kick in," Lett says. TBAG has several top 20 pharmaceutical companies among its clients. Lett attributes part of the success to the value packed into the software and services, but he also credits market acceptance. "Scientists are getting a lot more quantitative data from tools like high-content screens and microarrays, so they are more familiar with computational tools and realize they can't do all of it themselves," he says.
The importance of pathway analysis has not been lost on other kinds of software providers. Several microarray analysis software makers, such as VizX Labs and Silicon Genetics, let customers plug into KEGG.
"It's one of the most important sources of pathway data," explains Kevin Wandryk, vice president of marketing and business development at Silicon Genetics. Clients can also access GenMAPP (Gene MicroArray Pathway Profiler) through GeneSpring. Both Ariadne's PathwayAssist and GeneGo's MetaCore connect easily through the GeneSpring application programming interface.
According to Wandryk, a recent informal survey suggested that 30 percent to 40 percent of the company's clients use pathways in some part of their analysis. "It was a smart decision for us to add that component," he says. That number should increase as more clients start combining data types. "We pride ourselves on helping customers not only sort the signal from the noise but also put the data in a biological context."
It may take a while to work out mathematical models that allow precise predictions, but for now, scientists who were limited to studying one gene or one protein at a time are making great progress — one pathway at a time.