Getting the most out of microarray data requires statistical rigor, says Insightful's Michael O'Connell.
By John Russell
November 19, 2004 | Clinical biostatisticians have long ruled the statistical roost in biomedical science. Their complicated calculations determine the fate of compounds in clinical trials. Discovery statisticians generally get by with using simpler methods to sift data. That dichotomy is ending, says Michael O'Connell, director of biopharmaceutical solutions at statistical software supplier Insightful Corp. Bio·IT World executive editor John Russell talked with O'Connell about trends in microarray and mass-spec data analysis and Insightful's recent deal with visualization specialist Spotfire.
Q: Statistical analysis tools don't have a user-friendly reputation. Who really understands and uses these tools?
A: Big pharmaceutical companies have traditionally been divided into the nonclinical statistician group that services discovery, and the clinical statistics group that services the clinical trial area. Our core customer is a statistician or analyst, so we have been providing tools to both groups. In recent times, say, the past five years, these informatics groups have grown up on the discovery side, and we've been servicing them directly as well. Of these groups, the clinical biostatisticians have been far more rigorous folks.
Each month the SourceBook will serve as your guide to a single product category. We'll discuss technology and product trends, and provide a list of resources to help your organization plan purchases.
Laboratory information management systems
But that's changing. Doing microarrays or mass-spec experiments to determine if a drug is effective by looking at expression of various genes or proteins means you have to know what you're looking at. That sort of pharmacogenomics work has brought statistical rigor back into the discovery. I see the microarray platform as a consolidated platform being widely used in functional genomics and drug discovery and making its way into the clinical world as part of the merging of clinical and discovery data.
What are the big issues when interpreting microarrays?
Well, the chips can be big. Some contain 60,000 genes. Screening out variability is an important objective. It's also critical to be able to identify false negatives. Most people concentrate on identifying false positives, but false negatives are just as important — you don't want to screen out a compound that actually works. Until a year or so ago, there weren't many tools powerful enough to do that. The issues of experiment design, as well as data cleaning, are also critical.
We hear complaints about microarray data quality and the difficulty of comparing results from different platforms. What's being done to solve these problems?
It's funny that you bring up the cross-platform issue because I think that is a frontier being tackled right now as all these groups of data are building up all over the place. In fact, there was a session on it at the Eastern North American regional meeting of the International Biometric Society.
"It turns out the way of managing the data — putting the framework in place for statistical analysis — is very similar for mass spec as for microarrays."
Michael O'Connell, Insightful Corp.
There is interesting work being funded by the NIH on determining good methods for combining data from multiple trials run at different sites. Folks at Johns Hopkins doing some of that work have found that the signal-to-noise ratio or the T statistics has turned out to be a really nice scale to work with when you're trying to combine data rather than [using] some of the raw data. If one experiment is noisy, it's going to get less weight when you combine it with the other experiments.
What about FDA efforts on microarrays? Will they help?
We've been monitoring [the FDA's work on providing microarray guidance] pretty closely, although I'm not on any of the committees. You know there are some obvious things there, like using p450 as a safety flag, and other clearly obvious things that are becoming standard practice in that area. When we're brought into an engagement, we always go back to the CEL file (contains only intensity information from a given probe) on the Affymetrix side because we don't trust the way that someone may have summarized the individual probe data. That's helpful.
How are the challenges of dealing with proteomic data different? And what is Insightful doing in that arena?
It turns out the way of managing the data — putting the framework in place for statistical analysis — is very similar for mass spec as for microarrays. This whole world of genetic biomarkers, proteomic biomarkers, and microarray biomarkers is a priority for us. It's also a newer area for us. There was a very controversial paper* that came out on ovarian cancer using proteomics and mass spec. That got our attention. We've gotten involved looking at those data, and are putting together a toolkit for working with those data in S-PLUS right now. We've also got a new set of tools for genetic biomarkers and the new Affymetrix SNP chip.
Perhaps you could comment on the recent collaboration between Insightful and Spotfire. Why did that come about?
We think that our strength is on the analytics and statistics and predictive analytics, and we see Spotfire as having strength with the biologist and chemist. It's more than the visualization — though that's definitely a part — it's the user experience that the Spotfire interface provides: The filtering, the manipulation of the data interactively is a nice experience. You take that and combine with the analytics and the reporting that we do, and it's very complementary.
I read a review complaining that Spotfire on Windows didn't have enough computational power. What's your response?
So Spotfire on Windows is the front end, but S-PLUS on Unix is the back end. But we have, with the newer Microsoft versions, some products that have been able to address more RAM and system resources. It's getting a lot better. But traditionally, it's been on Unix, Linux, and Solaris to handle the bigger data sets. The other thing is S-PLUS runs on MPI, the multipackage [Linux] environment, and you can have a Linux cluster and do a very large job.
See our Microarray Company List.
* Petricoin, E.F.; Ardekani, A.M.; Hitt, B.A.; Levine, P.J.; Fusaro, V.A.; Steinberg, S.M.; Mills, G.B.; Simone, C.; Fishman, D.A.; Kohn, E.C.; Liotta, L.A. "Use of proteomic patterns in serum to identify ovarian cancer." The Lancet 359, 572-7; 2002.