By Malorye A. Branca
December 15, 2002 | Microarrays are the Ferraris of functional genomics, allowing studies at unprecedented speed.
But in genomics, speed leads to giant data sets, and few biologists are comfortable with the statistics required to tackle those data sets. Plenty of open source and commercial statistical programs for microarray data analysis are available, but in some ways only add to the confusion by creating more options. Now, BioConductor (www.bioconductor.org) offers a solid set of open source tools, along with some guidance on when and how to use them to tackle these huge expanding data sets.
Best of all, it’s helping to create sort of a standard for array data analysis.
BioConductor contains statistical analysis and graphical tools based on R -- an open source version of the renowned S language developed by John Chambers at Bell Laboratories Inc. in the 1970s. It can do everything from the preprocessing of data prior to analysis to clustering and classification afterward. Version 1.0 of BioConductor was released in May. By late October, several thousand people had already downloaded the program.
Another spike of downloads is expected now, following the mid-November launch of version 1.1.
While the software doesn’t come with a 24 hour-helpline, it has other advantages. For one thing, it costs nothing but is reported to be relatively bug-free. “R has very rigorous procedures before something is published,” says microarray analysis guru Terry Speed, a professor at the Royal Melbourne Hospital in Australia. “I would trust their software much more than what you find on some random Web site, and more than [that from] a lot of the private companies too.”
That focus on quality is no accident. R was first developed by statisticians Robert Gentleman and Ross Ihaka while the two were at the University of Auckland, New Zealand. When they started involving researchers on other continents in the project, they were forced to establish strict rules for how to write functions and where to store those so others could test them. “That let us develop it collaboratively, and still keep it from being buggy and hard to use,” says Gentleman, now an associate professor at Harvard School of Public Health.
Ultimately, R “took on a life of its own,” says Gentleman. With a handful of co-founders, he established BioConductor in early 2001. The group was frustrated by the proliferation of bad algorithms. “Many biologists are re-implementing standard statistical programs,” Gentleman says. “They read about it in a book, and then try to write a program.”
With so many neophytes experimenting with statistics, the threat of bad science also looms. “If it’s easier to publish an analysis with some obscure machine-learning approach than with perfectly valid standard tools, that can’t be a good thing,” he says.
R has many attractive features, and it has become so popular in the microarray community that people are starting to regard it as a standard. Christian Stratowa, a bioinformatician at Boehringer Ingelheim in Austria, currently uses both R and it’s commercial version, S-Plus. “In the future, I will use R more often,” he says. “It seems that today, most algorithm development is done using R and sometimes only tested for R and no longer for S-Plus. This is especially true for microarray-specific algorithms.” Stratowa says he is also drawn by the number of renowned statisticians working in R.
Even the average biologist should be able to use BioConductor’s programs, Gentleman says. “It’s easier than learning Perl,” he adds.
Commercial software, he concedes, is probably the best choice for people looking for the easiest interfaces, but R is becoming more user friendly. “People are developing new interfaces to R,” says John Quackenbush, an investigator at The Institute for Genomic Research (TIGR) in Rockville, Md. Working in parallel to the BioConductor project, Quackenbush’s group is also developing open source software for some of the same algorithms written in the Java programming language. “Probably what you will see is these two efforts converge,” he says.
Some support and enhancements are available to BioConductor users, though, you may have to wait for them. Because it is open source, users can easily customize it, and share their innovations if they want to. Help comes via contact with the BioConductor community BioConductor also does a lot of education, through the Web site and workshops. “We have a big emphasis on training so people can get the most out of it,” says Sandrine Dudoit, one of BioConductor’s founders and a statistician at University of California at Berkeley’s School of Public Health.
BioConductor founders say they are not competing with commercial products, but want to instead influence them. Already Seattle-based Insightful Corp., which markets S-Plus, is offering a microarray analysis package based partly on BioConductor.
“We do hope BioConductor will be the vehicle of choice for people who want to be at the leading edge,” Gentleman says.
Sorting Out the Tool Set
If they can achieve even that, it could help the entire field progress.
“The thing that distinguishes BioConductor from other software, is that it lives in the open source environment where other people can add to it,” says Quackenbush. Developers around the world can collaborate on the same questions. “There are a lot of smart people out there, it is better to have them on the same team,” he says.
This is especially important because the “best” statistical methods for particular genomic applications are just being worked out.
“I think that new approaches will be essential,” says Quackenbush. “This field is so new we don’t yet know the best ways to analyze data. The cell is still a black box, and the completed genome is a hypothesis.”
BioConductor’s founders concur. “There is a lot of room for improvement in microarray analysis overall (see related story),” says Dudoit. “I think it is partly a question of better communication between the different types of scientists, as these are multi-disciplinary problems.”
Once people reach agreement on which statistical methods to use, and more people are using them correctly, the field will mature. “The final arbitrator in all of this is the data,” says Quackenbush. “Until you can make interchangeable comparisons, you are arguing about which method is best without any real evidence.”