Oct. 9, 2002
|Mark D. Uehling, senior science editor, spoke with visualization expert Georges Grinstein about how scientists should analyze large data sets.
| If you can't really see what's in your data, is there any point to generating it? If your data set is too large for you to see if any trees have fallen, should you still be in the timber business?
Georges G. Grinstein is happy to offer answers. A professor of computer science at the University of Massachusetts at Lowell, Grinstein is a leading expert on scientific visualization who has led seminars on the topic
for the Drug Discovery Technology conference, the Institute of Electrical and Electronics Engineers, and the Association for Computing Machinery. Grinstein is also someone who has periodically started his own companies, most recently AnVil Inc., which specializes in helping scientists excavate and illuminate the nuggets of gold amid vast heaps of data slag. Q: You're a former NATO and Pentagon consultant. Is visualization in the life sciences as sophisticated as it is within the U.S. military or NASA? A:
The really high-end visualizations are still in the aerospace industry. I'm not sure anyone in life sciences is really doing that kind of work. For modeling an airplane or the space shuttle, there are tremendous numbers of equations, points that need to be sampled. That requires large workstations, multiple gigs of RAM, multiple CPUs. For the visualizations in the life sciences, we are nowhere near needing that yet. Q: Some of the newer techniques that can show all the data — radial diagrams, dendrograms, parallel coordinates, heat maps — are not easy to interpret. Will scientists learn to understand these techniques? A:
If you take one of my employees, or someone who's never worked with parallel coordinates, they don't understand at all. It's a black box. But it is possible to train people. I could train a student to use some of these visualizations and become reasonably good with one graduate course. Q: Why are additional dimensions beyond the traditional X, Y, and Z coordinates necessary for looking at large scientific data sets? A:
Visualization is a tool that allows you to see the end result of computation, to steer the computation, and even to go back to the beginning with the raw data, to get a sense of where it should go. You keep mentioning visualization, and I keep trying to push you back to the analysis of the visualization. My definition of visualization is that it's an interface to the data and the algorithm.
Q: OK, but you don't seem to think 2-D bar graphs and fever charts are adequate for visualizing large genomic or proteomic data sets ...? A:
|AnVil Gets Its Eyes on Head Data
AnVil Inc., based in Burlington, Mass., has developed its own analysis and visualization software called RadViz, but the company doesn't play favorites. If competitors like Spotfire Inc. or Omniviz Inc. have tools that do the job better, AnVil will use them. For each AnVil client, the company's programmers, physicians, scientists, and statisticians use whichever tools they deem appropriate.
Confidentiality agreements prevent the company from discussing recent projects in detail, but the AnVil Web site notes alliances with Applied Biosystems Group and the National Institutes of Health. In July, AnVil inked a deal with HealthSouth Corp., the nation's largest provider of rehabilitation. The company sees perhaps a third of all U.S. stroke patients.
AnVil will peer into the depths of the HealthSouth data and look for trends in the efficacy of various cardiac medications on stroke patients. AnVil will also analyze HealthSouth data about drugs used after orthopedic surgery. "What we find with a lot of clients, including HealthSouth," says AnVil's marketing vice president, William Wiegler, "is that the data are so big that they're not really completely sure how to approach it and get the best information out of it."
A person who flies an airplane is monitoring 10 to 15 parameters. Scientists will begin to look at five, 10, 15 dimensions over the next few years. The machines they're using to generate the data do generate lots of dimensions. That is part of the reason most scientists are turning to analytical approaches. We're at a stage where the quantity of data and the amount of work that the life scientist has to do are too large. Q: What advice would you give to a small academic lab that would like to analyze and visualize its data more proficiently, but is only comfortable with Microsoft Excel?
A: I wouldn't suggest they buy a tool — tools have limitations. There are so many of them. They may not work. With the competitions that are now occurring with data mining, it's not just one tool that pops up each time as the clear-cut winner. It's dozens and dozens of them. People get reasonable successes with a variety of tools and reasonable failures with a variety of tools. Q: What would you say to a large pharmaceutical company with many in-house, open source, and commercial visualization and data-mining tools — but no clear insight into how to use such tools? A:
I would tell them that an integrated environment would be wonderful. Human curation and human participation are going to play a key role. You need the visualization-knowledgeable individual, you need the analysis-knowledgeable individual — and you need the domain person [expert in a particular scientific field]. You put those three together, you're more likely to succeed at the task at hand than if one or two of them were present. Most labs will begin to combine domain knowledge and analysis. And in some cases they realize they now have to add visualization. But they may not hire the visualization expert yet. Q: They don't think visualization is important? They would prefer to buy new tires before they have the car repainted? A:
They forget the windshield had better be clear to see the right way. What they have is a very dirty windshield. They say: "That's OK, I can see the gist of most things. That's not really important as long as I have my radar on." Q: What's different about very large data sets?
A: The interpretation of statistics changes. The way to apply statistics with large data sets is quite different. In some cases, you don't have ground truth. If you have 10 million patients, you can sample them. But if you have a hundred-dimension data set, you can't sample that space and get a sense of what that distribution is like. Because the number of dimensions is so large, we don't have a sense of the behavior of the data in that large dimensional space. You cluster, you find a bunch of clusters. But we don't have a very strong intuition about a hundred-dimensional dataspace. Q: Clustering techniques can seem foolproof. A patient falls into a particular cluster, the doctor predicts the odds of surviving chemotherapy at 85 percent. What's the biggest misconception about predictions from clusters?
A: How did you — statistically or using other techniques — validate your prediction? Did you toss a coin? Did you compute it on a couple data sets? Or did you actually begin to statistically and perhaps in vivo validate your predictor? It's easy to use analysis from a software package and out comes a bunch of clusters or predictors. The question now is how valid is [that predictor], how accurate is it, how sensitive is it, how robust is it? Hidden Clues Revealed
Working with a large but unnamed pharmaceutical company, AnVil's scientists were able to look at data from a drug designed to inhibit the growth of tumors. The company's own analysis of the gene expression data was frustrating, showing a wide range of failure and success for the medication depending on who got the drug. AnVil examined the data and found possible gender effects of the candidate drug that were not well supported by the company's existing epidemiological data. That allowed the drug company to push to find contraindications — reasons not to take the drug — which could prevent surprises later, in the more expensive clinical phases of development.Deciphering Complex Data Sets
In 1999, a group led by Todd Golub and Donna Slonim of MIT published genetic data about two forms of leukemia. Using a patient's genetic information, they programmed a computer to predict — with no other human involvement — which patients had which form of the disease. The Golub and Slonim data are now used in academia and industry to test the capabilities of both data-analysis software and human experts in the field.
Using the MIT data, Anvil's experts were able to create a series of analytical visualizations. Their algorithms "learned" which genes were associated with which diseases and patients, just as MIT had done. But Anvil scientists took the MIT work a step further. Their tools allowed them to classify the leukemia patients more accurately. Using a subset of 76 genes, AnVil was able to predict how patients with specific collections of genes would respond to medical treatment. In the first image (top), green dots indicate patients who will be helped by treatment, red dots indicate patients who will not be helped. The yellow dots represent leukemia patients with more ambiguous genetic makeups. In the second image (bottom), some of the MIT data is presented by AnVil's RadViz software.