Like sailors of old, genomic data miners dream of discovering riches and fame. Given the recent improvements in analytics — and a little more time — they just might succeed.
By Malorye Branca
Senior Informatics Editor
July 11, 2002 | Soheil Shams once worked in defense, analyzing sonar data to distinguish the whispers made by passing submarines from the noise of surface ships, scuttling shrimp, and the myriad other sounds in the sea. Today Shams is applying the same skills to a different challenge: picking out flecks of "gold" submerged in oceans of genomic and proteomic data.
Shams, president and chief scientific officer of BioDiscovery Inc., is not alone in migrating to genomics. Attracted by
|The New Analytical Challenge
|Comparatively speaking, searching and matching gene sequence data is straightforward compared to the tasks data miners face today.
the data boom and the challenges it poses (see "The New Analytical Challenge," right), as well as the potential rewards of drug discovery and development, many people with analytical skills are turning to genomic data mining and bringing new tools to the field. Shams, for example, has degrees in electrical and computer engineering and is considered an expert in pattern recognition and parallel processing.
Luckily for these prospectors, new types of data are filling in the blanks in our understanding of how genes govern health and disease. Information from sequencing databases, gene expression chips, protein mapping experiments, and cellular assays all represent the raw material from which data miners hope to extract a deeper understanding of living systems and relevant targets for new drugs. Given the right tools and expertise, these companies plan to turn an effort that used to be like picking a needle from a haystack into a knowledge-driven process with a high degree of success.
"The pharmaceutical industry is facing the question, 'How do I get more information about each of the drug targets or leads that I've selected to go forward with?'" says Rich Hamer, vice president and chief scientific officer of Viaken Systems Inc., an informatics specialist. "We can help them with this."
At least that is the hope. The basic idea is simple: Carefully identified data points can be plugged into maps of cellular and physiological processes to help narrow the search for better drug targets. Of course, separating the hype from the hope can be a challenge. The tools are new and don't yet have a track record. The bioinformatics industry — of which data mining is a subset — has struggled with profitability. Integrating diverse data types remains difficult. And the quality of data, upon which data mining depends, varies widely.
Nevertheless, many data-mining companies like Viaken are focused on delivering results for themselves and their customers. If they succeed, they will become real players in the drug discovery field, selling integral software and services.
Garbage In, Garbage Out
One particularly pernicious problem for data miners is the wide variability in data quality. Seemingly interesting results derived from clever algorithms and rigorous statistical analyses are only as meaningful as the data they are based on. For example, most of the Human Genome Project sequence data are regarded as reliable. But areas exist in which the genetic code is highly repetitive. Because the genome was essentially pieced together with many very small sections, some of those repetitive regions may not be accurate.
In more tedious, expensive, and complex types of studies, such as those of gene and protein expression, the potential for error is even greater. "Almost every chemical manipulation you do introduces a bias," explains Jack Pollard, a principal investigator at 3rd Millennium Inc., a bioinformatics firm. "You hope that people did the controls, but in my experience, they often don't." Badly done studies produce results that are irreproducible and data that are often misleading.
Even data from carefully conducted experiments must be cautiously reviewed and "cleaned up" (normalized, for example) before being analyzed. It is de rigueur to throw out the data that are clearly garbage, and to make adjustments for biases in the experiment. Though these steps may seem obvious, they have become increasingly important as the volume of data produced by high-throughput processes soars and amplifies error potential. "We have customers generating 100 [10,000-well] DNA chips a day. That's a million gene expression values in one day," says Shams.
"Not knowing the quality of the information is a big problem, because if you are going to integrate, and you don't understand the errors in the data, then the integration becomes meaningless," says Pollard.
One answer is to grade the data before plunking them into a database. "People need to put quality scores on their data," Pollard says. He advises using such simple classifications as "good, medium, and bad," to keep things practical. Some companies have already started using this practice. "CuraGen has been very successful in implementing quality scores, and that allows them to cross-validate their findings bioinformatically at many levels," notes Pollard.
Although there are many ways to mine genomic data, they all tend to fall into one of two categories. "Today, people use either a bottom-up or a top-down approach," explains Pollard. The top-down approach is now called systems biology, and it essentially draws a picture of the interacting parts of a system. Data are used initially when developing the "systems" algorithms, and again later, when they are put through the algorithms to identify targets for therapeutic intervention. The bottom-up approach, meanwhile, involves gathering results from one or more types of experiments, and fishing out correlations from them.
A single type of data is sometimes sufficient to turn a pattern into a product. For example, data about gene or protein expression can reveal key disease "signatures" that can then be used as diagnostics (see April Bio·IT World, page 44). "People want to find a set of predictor genes and proteins for a medical diagnosis," explains Tom Downey, president and CEO at data-mining specialist Partek Inc. "So you build a classifier, which is a data-mining tool that allows you to classify different tissues based on a subset of markers."
But researchers need to be wary of the "astrology trap" — that is, believing they can find patterns in the stars, so to speak. "The danger with microarrays is that there are so many data points, you can find whatever you want to find," says Downey. "The challenge is going the extra distance to show that what you have found is beyond chance."
Sometimes findings can be validated just by further analysis, but increasingly, researchers must pull together
|Breakthroughs in Image Analysis
|Genomics is increasingly a story that's told with pictures. These varied images include protein spots that have migrated across a gel, various intensities of gene expression on a chip, the shape of a cell after exposure to a drug, the altered structure of a mutated protein, or samples of tissue from treated and untreated patients. Now, tools to capture, store, search, integrate, and analyze these images are finally available.
evidence from a variety of experiments (see "Breakthroughs in Image Analysis," right).
One likely pairing is gene and protein expression data, but "mapping the relationships between genes and proteins is not as easy as it sounds," says Eric Neumann, vice president of bioinformatics at Beyond Genomics Inc. Just because a gene is overexpressed, that doesn't mean the protein it codes for will be, too. "Sometimes we find the accompanying protein to be down-regulated, even though the gene is up-regulated," adds N. Stephen Ober, Beyond Genomics' president. This finding should not be too surprising, because there are so many steps between gene and protein expression.
Each protein can have multiple forms, because of alternative splicing and modifications (see May Bio·IT World, page 36). Proteins are also involved in many processes within the cell. A protein may be packed into a complex and then secreted, for example, or broken down in a particular cellular compartment. "If you see more transcript for a gene, then the protein will most likely be synthesized in higher amounts," says Ober. "But at a statistical level, they will not be linearly correlated, and you need to use nonlinear techniques to track these relationships."
OmniViz Inc. provides a suite of tools that can be used to analyze the whole range of data in
Researchers use a range of tools, include clustering and more sophisticated statistical tools, to explore and analyze genomic data. Companies such as OmniViz aim to make these tools user-friendly for biologists, who may not be familiar with hard-core statistical packages.
the drug discovery and development process, starting with sequence information and extending all the way through to toxicology data. Literature, patents, and post-market data can also be analyzed using its tools. The software generates a common type of signature for any data set. "That allows us to apply the same sets of analytical and visualization [techniques] to all types of data," says Jeffrey D. Saffer, president and chief technology officer at OmniViz.
Users of data-mining tools span a wide spectrum, including the statistically naive as well as seasoned number crunchers. Running the software is relatively easy, thanks to built-in "wizards" and other helpful features, but some users stumble with the statistics. "[We] have to provide great technical support, and we have consulting available, too," says Saffer. The company's software and consulting businesses are both growing, he says, because "it is increasingly recognized that the value is in the ability to integrate analysis across the different types of information."
Heidelberg, Germany-based Phase-It Intelligent Solutions AG has taken another route. It doesn't sell analysis software at all; rather, it prefers to partner with other groups, thereby marrying complementary expertise. "We provide a unified approach for analyzing genomic and clinical data," says Roland Eils, the company's chief scientific officer.
Phase-It uses machine learning techniques such as neural networks to derive classification schemes from a range of clinical and functional genomic data, including metabolism, gene expression, and protein-protein interactions. Machine learning is the ability of a computer program to improve at solving a given type problem over time, and pattern recognition is a key element in making that happen. "We have a deep and broad understanding of machine learning, as well as domain knowledge in areas such as cancer and toxicology," says Eils. Some of Phase-It's staff come from the German Cancer Research Center.
In traditional statistics, Eils says, researchers have to know the relevant parameters to build a statistical test model. In cases where the parameters are poorly understood, it takes a long time to build good models, and they may never end up being very useful. But with machine learning, a computer is taught how to pick out the telltale signatures from the data. With "training," the computer does a better and better job.
"You can use a large class of machine learning techniques as 'black boxes,'" says Eils. In other words, data gets dumped in, answers come out, and you figure out later what they mean. For example, after a particular signature is found to predict a disease, researchers can go back and rank the role of key parameters to determine if the model also makes biological sense.
Phase-It has already tested its approach on clinical samples, including some from B-cell chronic lymphocytic leukemia and colon cancer patients. The group starts by "reclassifying" samples according to molecular characteristics. Next, the Phase-It researchers try to find highly specific markers for the different diagnoses. An overarching goal is to better understand the molecular mechanisms of disease, but they are making practical findings as well.
"We've developed models that can predict survival time and the likelihood of distant metastasis development," says Eils. "We have results coming out that will have a high impact for the clinician." The company is starting on diagnostics, but therapeutics are part of its long-term vision, he says.
For many common drug targets, enough data has been gathered to start sketching out the dynamic pathways they act in. Such a pathway model can serve as a scaffold to which new data points are pinned. This practice is common among the most sophisticated labs.
"Everybody wants to reconstruct metabolic and signaling pathways," says Andrew DePristo, president of GeneData Inc., a
|Data Mining Tool & Service Suppliers
|Web site toolbox.
subsidiary of Basel, Switzerland-based GeneData AG. "They say, 'I have an organism and a pathway that I'm interested in, and I don't have all the steps in the pathway. Can you reconstruct it?'" GeneData fills in these pathways using data that includes gene expression, metabolic pathways, and protein families. "It's not an easy task," says DePristo. The company has particular expertise in the area of microbial genomics, which has given it a leg up in the anti-infective field, where it has deals with AstraZeneca, Bayer, and Novartis, among others.Perfecting Pathways
With a detailed pathway or other system model in hand, it should be possible to use data mining and other tools to pinpoint the ideal place to interfere with a disease process. This hope has led to the rise of systems biology companies, such as Beyond Genomics, BioSeek, GeneGo, and Genomatica.
"We are a new breed of biotech that relies on leading-edge technology in several fields," says Beyond Genomics' Ober, "but bioinformatics is the crucial underpinning of what we do."
One challenge the large-scale systems biology companies face is that most genomic tools provide just a snapshot of any biological process. A DNA chip, for example, shows the downstream effect of some event, such as the administration of a drug or the presence of cancer. It doesn't pinpoint what caused those changes, and that's the billion-dollar question. Data gathered over time can help in this respect, but some think an entirely different approach is required.
Systems biologists try to knit together every bit of evidence they can find or generate to build networks of correlations. The foundation for this approach is accurate, high-throughput technology — such as the advanced mass spectrometry and nuclear magnetic resonance tools that Beyond Genomics uses. But the analysis and the models that the analysis generates form the glue.
The trick is to avoid falling in love with a model before it's proved. "A lot of people are jumping in with mathematical simulations too soon," says Neumann. "There are many ways to write any equation, so you can always write an equation that gets you close to the data but is completely wrong." He emphasizes the importance of basing any model on extensive experimental testing. "You have to do the experiment under many different conditions to confirm a model," he says.
Perhaps not surprisingly, Beyond Genomics says it has done sufficient work to get some models to the point where it can say, "this causes that." "Only a handful of people are at this level," says Ober. "But if you have enough evidence and facts, you can see cause and effect. You do not need time series data; instead, you can do it by looking at data from hundreds of patients, and using the right models." Put another way, researchers can substitute a greater sample size for more lengthy specific studies.
Time to Deliver
To prosper, the genomics industry needs to move quickly from information explosion to drug discovery, and many observers believe data mining is the key to successfully making the transition.
"The Human Genome Project was a large hardware and computational issue," explains Viaken's Hamer. "After that task was
automated, the question became, 'How do I reassemble all these things together?' Today, the challenge is to extract value and to integrate that value into research processes." Sequencing, gene expression, protein mapping, and cellular assays are all new sources of data feeding into an established system. Because pharmaceutical companies have traditionally looked at very few targets at once, it requires a paradigm shift for the industry to adjust to the exponential growth in target numbers.
But that shift can't take too long. "There has to begin to be a return on the investment," acknowledges Hamer. "I think people initially assumed that some algorithms would just run through their data, giving them what they wanted," says GeneData's DePristo. "Now, more people realize the first step is understanding the application domain and creating a target data set."
The trend toward pathway mapping and systems biology will help data miners succeed. "I think the different types of data all sit next to each other," says Shams. "The key is to create a network that interlinks them. Then, the value doesn't come from a single node, but rather collectively from all the interactions."
Is there enough data already in hand to find new drugs? Or do new fields have to be conquered first? Proteomics in particular has received a lot of attention, because proteins are the ultimate actors in disease. "Do you need proteomics?" asks DePristo. "For understanding the biology of organisms, you will have to have it. But if your goal is to find a new compound, then you can't absolutely need proteomics, as we have found a lot of drugs without it."
Success also depends on how pharmaceutical companies use these new tools and services. Ideally, they need to be able to pull together data from all their laboratories, and that's not always feasible if departments are used to working independently. "We need to keep pushing to help people work with disparate types of data, and to break through organizational silos if those get in the way," says OmniViz's Saffer.
Tighter integration between target discovery and lead discovery and optimization would also speed success. "The next step is to integrate genomics with high-throughput screening diversity," says DePristo. "There is chemical diversity and target diversity. I would love to see them all come together."
Saffer concurs. "Think what you could accomplish if you could feed the information from high-throughput screening back into target discovery," he says. "You'd have a new synergy, and the effects would be amplified throughout the discovery process."
Technically it's all feasible, the data miners say. Just stand back and let them roll up their sleeves, plunge their hands into the murky waters of data sets, and see what of value they extract.
ILLUSTRATION BY CURTIS PARKER