Sept. 13, 2007 | As drug discovery data grow ever more complex, scientists struggle to visualize their results. New products, such as the 1.8 million markers on the Affymetrix SNP Array 6.0, simply add to the problem. Today’s computers can handle most of the data surge, but the real hurdles arise when combining information from different sources, such as gene-expression data, microscopic images, and so on.
“The largest challenge is the number of variables we now deal with,” says Georges Grinstein, professor of computer science and director of the Institute for Visualization and Perception Research at the University of Massachusetts at Lowell. “If you take all of the properties that a drug might have — chemical and other structure information, hydrophobicity, percent oxygen, calculated logP, and so on — the number of variables is huge, as high as 500,000.”
Today’s visualization techniques, however, remain far from ready to handle so many variables. No one can look at a plot of 500,000 dimensions — say, a two-dimensional display with 250,000 color-coded points on it — and make sense of it. “So the most common first step is to do dimensional reduction,” says Grinstein. “You can reduce the number of variables with principal components analysis or self-organizing maps.”
The key is to find ways to simplify data enough to understand them without losing much information. That requires advances in both software and how scientists think about data visualization. For the moment, such changes seem underway in every corner, from academic and government labs to a wide range of software developers.
In one approach to making drug discovery data more meaningful, Grinstein and his colleagues use RadViz. This visualization tool lays out dimensions around a circle, with the data inside. To get the most out of this method, however, scientists need analytical tools that accompany the visualizations. For example, a cluster might appear in data shown in RadViz. “Then you need to ask which dimensions lend themselves to producing that cluster,” says Grinstein. “This would give the dimensions that are most meaningful.”
Some scientists envision even bigger changes. David Covell, a principal investigator at the National Cancer Institute, says, “We need huge changes in drug discovery, so that scientists can explore new chemical areas.” To do that, he says, “We must develop tools that analyze screening data, genomics data, and so on, and put it in a form that allows one to classify, categorize, and interpret those data.”
Take NCI-60, a collection of 60 human-cancer cell lines. “The capacity to visualize across dimensions of data is very limited,” says Covell. To improve that, Covell and his colleagues developed a self-organizing map that “brings together many different snapshots across 60 dimensions, and puts it into a visualization that aids in the identification of those cells that act similarly or differently when exposed to test compounds.”
Some of the organizing simply means gathering the data together. For example, data related to drug discovery often end up scattered across different databases, documents, and so on. IDBS tackled that problem with its K3, which gathers and integrates data in a virtual warehouse. IDBS also provides other data-handling and visualization tools, such as XLfit, which integrates with Microsoft Excel to analyze and fit curves to data.
Adding Some Automation
Although users must play a role in the analysis, computers need to pitch in, too. “Despite all the gadgets, animating visualizations, and so on, scientists basically look at two-dimensional graphs,” says SPSS senior vice president Leland Wilkinson — a statistician who specializes in visualization and author of The Grammar of Graphics. The key, according to Wilkinson, is adding machine intelligence. “There’s too much to look at without some guidance,” he says.
To start building such guidance, Wilkinson and Dan Rope of Illumitek developed nVIZn. This Java-based program serves as a tool for analytical graphics. They designed nVIZn to reveal statistical relations in data and to let users zoom in on specific aspects of a data set. “nVIZn is driving all graphics in SPSS products,” says Wilkinson. In some SPSS research projects, programs let users drag in any sort of data — from numbers to text — and the program finds the most-interesting variables and displays the relationships. “You could even put in Moby Dick,” says Wilkinson, “and ‘whale’ would come up in the middle of the display of relations.”
SPSS’s Clementine, a data mining software, also looks for hidden patterns in data. “Clementine really excites me,” says Wilkinson. “You drag things around and line everything up, press a button, and there’s your analysis.” Future versions of Clementine will also include SPSS graphics tools, such as nVIZn.
Tracking Broad Topics
Scientists also face problems tracking wide-ranging information, including journal articles, patents, and more. Help is available through DyNet, developed by Advanced Technology Assessment (ATA) in Italy (See “DATAbiotech Website Tracks Innovation,” Bio•IT World, Jan. 2007.)
Mark Supekar, ATA senior life sciences consultant, says DyNet can “analyze and mine heterogeneous knowledge bases.” Given a query — such as ‘drug targets related to Alzheimer’s disease’ — DyNet searches multiple databases for related information, and displays the results as a network of relations. Supekar adds, “We have just finalized the release of a life sciences version featuring a semantic engine able to retrieve key concepts like genes, proteins, drugs, biological processes, diseases, chemicals, and molecular functions from the literature — patents and scientific publications.” This analysis also includes the people who did the research. For example, Supekar says that DyNet can “look at the author’s position in both social networks — co-authorship and scientific collaborations — and semantic networks, such as key research interests.”
Keeping track of so much information and how it all relates cannot be done manually — at least not efficiently or in a reasonable amount of time. Supekar says that DyNet can “track down in a dynamic setting, over time, relations that help data miners and users understand complex interconnections and unearth hidden relations.”
Other tools cover data from drug discovery through regulatory submission. As an example, Jascha Minow, who heads regulatory issues at Image Solutions, says, “Our products help customers visualize data for the FDA or other regulatory groups.” He adds, “This includes data from the entire drug-development cycle.”
Visual i|o helps companies visualize another critical brand of data — business operations. “The dimensions of operational data that are key to good management and decision-making include: project status, attributes, progress/trending, plans, resources, budgets and expenditures, goals, risks, and value,” says Angela Shen-Hsieh, CEO of Visual i|o. “Our visualizations transform hundreds of static reports into interactive pictures of information, with controls and filters to perform on-the-fly analysis.”
In many cases, scientists want to keep control over visualization. At the Duke University Center for Computational Immunology, director Thomas Kepler uses flow cytometry to find changes in populations of cells after administering a vaccine or drug. This involves tracking 7-14 surface molecules. “This is a substantial challenge to visualize changes in 14-D space,” says Kepler. Instead, he lets users select three markers to examine in three dimensions.
To create the tool, Kepler and Cliburn Chan, assistant professor of biostatistics and bioinformatics at Duke, started with the IMSL Numerical Libraries from Visual Numerics, which allowed them to build mathematical models of cell changes. In later stages of development, these scientists switched to the graphics software standard, OpenGL. “We let the user choose which components to use,” says Chan, “but our software will provide hints along the way with statistical summaries for interesting columns of data.” Such a combination of inputs from users and technology can reveal new information. “Given sets of surface markers will appear together and change together,” explains Kepler. “By looking at these patterns of correlations, there’s a wealth of information that we don’t get from looking at markers one or two at a time.”
Sometimes, however, the human brain provides the best analysis if partnered with the right technology. At Stony Brook University, Robert Rizzo, assistant professor of applied mathematics and statistics, uses such combinations in structure-based drug design. His lab starts with a crystallographic structure of a drug target, and then models how compounds would interact with it. To setup and analyze the simulations, Rizzo uses an ordinary PC with an NVIDIA Quadro graphics card, which provides output to a CRT monitor and stereo glasses, which give that 3D-movie effect.
The entire setup, including the computer, costs only around $3,000. “It’s inexpensive and works great,” says Rizzo. Moreover, human eyes easily “mine” stereographic images. “Sometimes, you can discover things with this kind of visualization that you’d never discover otherwise,” Rizzo says. As an example, he points out that a drug’s binding can be examined quickly. “Put on the stereo glasses,” he says, “and you see immediately what’s going on.”
Toolboxes of Control
The MathWorks also gives users control, such as toolboxes based on MATLAB that let scientists build semi-custom applications. “It’s open architecture,” says Kristen Zannella, biotech and pharmaceutical industry marketing manager at The MathWorks. “You can tweak it for your own use, but the goal is to not reinvent the wheel.”
In addition, MATLAB provides the same environment for different types of information, including gene-expression data, images, and statistics. “Once you get it in MATLAB,” says Zannella, “it’s all the same environment. You can mix and match and merge the data.” One pharma customer used MATLAB to visualize liquid chromatography-mass spectrometry data. “This was 3-D data that was highly sampled and difficult to analyze,” says Zannella. “So we built a tool for them to visualize the data in a couple of different 3-D ways. By trying the different visualizations, the peaks of interest started to just jump out at you.”
Mats Holmqvist, a research investigator at the Novartis Institutes for BioMedical Research, uses The MathWorks to study drug-related hERG potassium channel cardiotoxicity. Holmqvist and his colleagues developed a hERG-functional assay that uses MATLAB to run quality control on data, automatically generate IC50 plots, and so on. In addition, this software can be customized fairly easily. “If you are an experienced MATLAB user,” says Holmqvist, “it’s not very difficult,” particularly as The MathWorks provides support. He notes, “I’ve used MATLAB a lot, but I didn’t have much experience querying databases. So they helped me.”
In some cases, companies provide researchers with even more than tools. For instance, customers looking for disease-related genes can turn to the National Center for Genome Resources, which provides sequencing and Web-based analysis and visualization with its proprietary Alpheus software.
“Dynamic data visualization has been very underappreciated,” says Doug Robinson, JMP Genomics applications scientist manager at the SAS Institute. He notes that the latest release of JMP Genomics is “very visual and dynamic software.” For example, with gene expression data, JMP Genomics lets users select pieces of the results for further analysis. “This kind of dynamic data visualization,” says Robinson, “lets you explore data you already have. You can go in and look for more underlying structure.” For instance, a user can point and click on part of a data set to apply different visualization tools, such as distribution analysis or clustering.
Last June, TIBCO Software acquired Spotfire — a company long considered a leader in data visualization. Christian Marcazzo, Spotfire’s senior director of life sciences analytics, says its software is, “dynamic in terms of responding to user input.” For instance, simple mouse movements query data sets and change graphs. “You can quickly move across time or from one patient to the next,” he says.
According to Marcazzo, TIBCO Spotfire DXP, which was recently updated, simplifies the analysis and visualization of multiple data types. “In a cancer study,” he says, “you often have genomic and genetics data. We provide the ability to look at each one and navigate across them.” A Web version of DXP will be launched too. Marcazzo likens it to, “a really useful dashboard — you can interact with it.”
To capture more therapeutic knowledge, technology must combine data. Rene Hermes, vice president of marketing at Definiens, says, “Our Definiens Enterprise Image Intelligence Suite 7 handles numerous assay types during all phases of drug discovery and development, including high-content screening, predictive toxicology, and more.” The software also provides statistical tools to analyze images on an enterprise level, combining intelligence from different data sources and departments. “Quantification of disparate data types is the key to consistent and reliable results,” says Mark Watson, head of life science product marketing at Definiens. “Our technology enables companies to automate the data-analysis process and to harmonize their existing IT infrastructure.”
Symyx Software (See p. 6) also produces integrative applications. Company president Timothy Campbell says these products feature “real-time, data-warehousing solutions, visualization, and experiment planning and design for scientists.” Using Symyx’s Polyview as an example, Campbell says that it “acts as a comprehensive search engine allowing researchers to query and visualize data using a sample-based querying and reporting tool to compare multiple experimental figures.”
In the long run, getting the most out of drug discovery data depends on analysis, visualization, and user interactions. That combination could spawn entirely new therapies.
Subscribe to Bio-IT World magazine.